TinyTorch/modules/source/12_benchmarking/benchmarking_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1015a91f",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Module 12: Benchmarking - Systematic ML Performance Evaluation\n",
    "\n",
    "Welcome to the Benchmarking module! This is where we learn to systematically evaluate ML systems using industry-standard methodology inspired by MLPerf.\n",
    "\n",
    "## Learning Goals\n",
    "- Understand the four-component MLPerf benchmarking architecture\n",
    "- Implement different benchmark scenarios (latency, throughput, offline)\n",
    "- Apply statistical validation for meaningful results\n",
    "- Create professional performance reports for ML projects\n",
    "- Learn to avoid common benchmarking pitfalls\n",
    "\n",
    "## Build → Use → Analyze\n",
    "1. **Build**: Benchmarking framework with proper statistical validation\n",
    "2. **Use**: Apply systematic evaluation to your TinyTorch models\n",
    "3. **Analyze**: Generate professional reports with statistical confidence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d09b187a",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "benchmarking-imports",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.benchmarking\n",
    "\n",
    "#| export\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import time\n",
    "import statistics\n",
    "import json\n",
    "import math\n",
    "from typing import Dict, List, Tuple, Optional, Any, Callable\n",
    "from dataclasses import dataclass\n",
    "from enum import Enum\n",
    "import os\n",
    "import sys\n",
    "\n",
    "# Import our TinyTorch dependencies\n",
    "try:\n",
    "    from tinytorch.core.tensor import Tensor\n",
    "    from tinytorch.core.networks import Sequential\n",
    "    from tinytorch.core.layers import Dense\n",
    "    from tinytorch.core.activations import ReLU, Softmax\n",
    "    from tinytorch.core.dataloader import DataLoader\n",
    "except ImportError:\n",
    "    # For development, import from local modules\n",
    "    parent_dirs = [\n",
    "        os.path.join(os.path.dirname(__file__), '..', '01_tensor'),\n",
    "        os.path.join(os.path.dirname(__file__), '..', '03_layers'),\n",
    "        os.path.join(os.path.dirname(__file__), '..', '02_activations'),\n",
    "        os.path.join(os.path.dirname(__file__), '..', '04_networks'),\n",
    "        os.path.join(os.path.dirname(__file__), '..', '06_dataloader')\n",
    "    ]\n",
    "    for path in parent_dirs:\n",
    "        if path not in sys.path:\n",
    "            sys.path.append(path)\n",
    "    \n",
    "    try:\n",
    "        from tensor_dev import Tensor\n",
    "        from networks_dev import Sequential\n",
    "        from layers_dev import Dense\n",
    "        from activations_dev import ReLU, Softmax\n",
    "        from dataloader_dev import DataLoader\n",
    "    except ImportError:\n",
    "        # Fallback for missing modules\n",
    "        print(\"⚠️  Some TinyTorch modules not available - using minimal implementations\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42b509fc",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "benchmarking-setup",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| hide\n",
    "#| export\n",
    "def _should_show_plots():\n",
    "    \"\"\"Check if we should show plots (disable during testing)\"\"\"\n",
    "    is_pytest = (\n",
    "        'pytest' in sys.modules or\n",
    "        'test' in sys.argv or\n",
    "        os.environ.get('PYTEST_CURRENT_TEST') is not None or\n",
    "        any('test' in arg for arg in sys.argv) or\n",
    "        any('pytest' in arg for arg in sys.argv)\n",
    "    )\n",
    "    \n",
    "    return not is_pytest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "617fc409",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "benchmarking-welcome",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "print(\"📊 TinyTorch Benchmarking Module\")\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
    "print(\"Ready to build professional ML benchmarking tools!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "476a1522",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/source/12_benchmarking/benchmarking_dev.py`  \n",
    "**Building Side:** Code exports to `tinytorch.core.benchmarking`\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.benchmarking import TinyTorchPerf, BenchmarkScenarios\n",
    "from tinytorch.core.benchmarking import StatisticalValidator, PerformanceReporter\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Deep understanding of systematic evaluation\n",
    "- **Production:** Professional benchmarking methodology\n",
    "- **Projects:** Tools for validating your ML project performance\n",
    "- **Career:** Industry-standard skills for ML engineering roles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "302b6a5c",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## What is ML Benchmarking?\n",
    "\n",
    "### The Systematic Evaluation Problem\n",
    "When you build ML systems, you need to answer critical questions:\n",
    "- **Is my model actually better?** Statistical significance vs random variation\n",
    "- **How does it perform in production?** Latency, throughput, resource usage\n",
    "- **Which approach should I choose?** Systematic comparison methodology\n",
    "- **Can I trust my results?** Avoiding common benchmarking pitfalls\n",
    "\n",
    "### The MLPerf Architecture\n",
    "MLPerf (Machine Learning Performance) defines the industry standard for ML benchmarking:\n",
    "\n",
    "```\n",
    "┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐\n",
    "│  Load Generator │───▶│ System Under    │───▶│    Dataset      │\n",
    "│   (Controls     │    │ Test (Your ML   │    │ (Standardized   │\n",
    "│    Queries)     │    │    Model)       │    │  Evaluation)    │\n",
    "└─────────────────┘    └─────────────────┘    └─────────────────┘\n",
    "```\n",
    "\n",
    "### The Four Components\n",
    "1. **System Under Test (SUT)**: Your ML model/system being evaluated\n",
    "2. **Dataset**: Standardized evaluation data (CIFAR-10, ImageNet, etc.)\n",
    "3. **Model**: The specific architecture and weights being tested\n",
    "4. **Load Generator**: Controls how evaluation queries are sent to the SUT\n",
    "\n",
    "### Why This Matters\n",
    "- **Reproducibility**: Others can verify your results\n",
    "- **Comparability**: Fair comparison between different approaches\n",
    "- **Statistical validity**: Meaningful conclusions from your data\n",
    "- **Industry standards**: Skills you'll use in ML engineering careers\n",
    "\n",
    "### Real-World Examples\n",
    "- **Google**: Uses similar patterns for production ML system evaluation\n",
    "- **Meta**: A/B testing frameworks follow these principles\n",
    "- **OpenAI**: GPT model comparisons use systematic benchmarking\n",
    "- **Research**: All major ML conferences require proper evaluation methodology"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5613b9ce",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Step 1: Benchmark Scenarios - How to Measure Performance\n",
    "\n",
    "### The Three Standard Scenarios\n",
    "Different use cases require different performance measurements:\n",
    "\n",
    "#### 1. Single-Stream Scenario\n",
    "- **Use case**: Mobile/edge inference, interactive applications\n",
    "- **Pattern**: Send next query only after previous completes\n",
    "- **Metric**: 90th percentile latency (tail latency)\n",
    "- **Why**: Users care about worst-case response time\n",
    "\n",
    "#### 2. Server Scenario  \n",
    "- **Use case**: Production web services, API endpoints\n",
    "- **Pattern**: Poisson distribution of concurrent queries\n",
    "- **Metric**: Queries per second (QPS) at acceptable latency\n",
    "- **Why**: Servers handle multiple simultaneous requests\n",
    "\n",
    "#### 3. Offline Scenario\n",
    "- **Use case**: Batch processing, data center workloads\n",
    "- **Pattern**: Send all samples at once for batch processing\n",
    "- **Metric**: Throughput (samples per second)\n",
    "- **Why**: Batch jobs care about total processing time\n",
    "\n",
    "### Mathematical Foundation\n",
    "Each scenario tests different aspects:\n",
    "- **Latency**: Time for single sample = f(model_complexity, hardware)\n",
    "- **Throughput**: Samples per second = f(parallelism, batch_size)\n",
    "- **Efficiency**: Resource utilization = f(memory, compute, bandwidth)\n",
    "\n",
    "### Why Multiple Scenarios?\n",
    "Real ML systems have different requirements:\n",
    "- **Chatbot**: Low latency for good user experience\n",
    "- **Image API**: High throughput for many concurrent users  \n",
    "- **Data pipeline**: Maximum batch processing efficiency"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97dc390b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 2: Statistical Validation - Ensuring Meaningful Results\n",
    "\n",
    "### The Significance Problem\n",
    "Common benchmarking mistakes:\n",
    "```python\n",
    "# BAD: Single run, no statistical validation\n",
    "result_a = model_a.run_once()  # 94.2% accuracy\n",
    "result_b = model_b.run_once()  # 94.7% accuracy\n",
    "print(\"Model B is better!\")  # Maybe, maybe not...\n",
    "```\n",
    "\n",
    "### The MLPerf Solution\n",
    "Proper statistical validation:\n",
    "```python\n",
    "# GOOD: Multiple runs with confidence intervals\n",
    "results_a = [model_a.run() for _ in range(10)]  # [93.8, 94.1, 94.3, ...]\n",
    "results_b = [model_b.run() for _ in range(10)]  # [94.2, 94.5, 94.9, ...]\n",
    "significance = statistical_test(results_a, results_b)\n",
    "print(f\"Model B is {significance.p_value < 0.05} better with p={significance.p_value}\")\n",
    "```\n",
    "\n",
    "### Key Statistical Concepts\n",
    "- **Confidence intervals**: Range of likely true values\n",
    "- **P-values**: Probability that difference is due to chance\n",
    "- **Effect size**: Magnitude of improvement (not just significance)\n",
    "- **Multiple comparisons**: Adjusting for testing many approaches\n",
    "\n",
    "### Sample Size Calculation\n",
    "MLPerf uses this formula for minimum samples:\n",
    "```\n",
    "n = Φ^(-1)((1-C)/2)^2 * p(1-p) / MOE^2\n",
    "```\n",
    "Where:\n",
    "- C = confidence level (0.99)\n",
    "- p = percentile (0.90 for 90th percentile)\n",
    "- MOE = margin of error ((1-p)/20)\n",
    "\n",
    "For 90th percentile with 99% confidence: **n = 24,576 samples**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e4d9c8f",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "benchmark-scenarios",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class BenchmarkScenario(Enum):\n",
    "    \"\"\"Standard benchmark scenarios from MLPerf\"\"\"\n",
    "    SINGLE_STREAM = \"single_stream\"\n",
    "    SERVER = \"server\"\n",
    "    OFFLINE = \"offline\"\n",
    "\n",
    "@dataclass\n",
    "class BenchmarkResult:\n",
    "    \"\"\"Results from a benchmark run\"\"\"\n",
    "    scenario: BenchmarkScenario\n",
    "    latencies: List[float]  # All latency measurements in seconds\n",
    "    throughput: float      # Samples per second\n",
    "    accuracy: float        # Model accuracy (0-1)\n",
    "    metadata: Optional[Dict[str, Any]] = None\n",
    "\n",
    "#| export\n",
    "class BenchmarkScenarios:\n",
    "    \"\"\"\n",
    "    Implements the three standard MLPerf benchmark scenarios.\n",
    "    \n",
    "    TODO: Implement the three benchmark scenarios following MLPerf patterns.\n",
    "    \n",
    "    UNDERSTANDING THE SCENARIOS:\n",
    "    1. Single-Stream: Send queries one at a time, measure latency\n",
    "    2. Server: Send queries following Poisson distribution, measure QPS\n",
    "    3. Offline: Send all queries at once, measure total throughput\n",
    "    \n",
    "    IMPLEMENTATION APPROACH:\n",
    "    1. Each scenario should run the model multiple times\n",
    "    2. Collect latency measurements for each run\n",
    "    3. Calculate appropriate metrics for each scenario\n",
    "    4. Return BenchmarkResult with all measurements\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    scenarios = BenchmarkScenarios()\n",
    "    result = scenarios.single_stream(model, dataset, num_queries=1000)\n",
    "    print(f\"90th percentile latency: {result.latencies[int(0.9 * len(result.latencies))]} seconds\")\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self):\n",
    "        self.results = []\n",
    "    \n",
    "    def single_stream(self, model: Callable, dataset: List, num_queries: int = 1000) -> BenchmarkResult:\n",
    "        \"\"\"\n",
    "        Run single-stream benchmark scenario.\n",
    "        \n",
    "        TODO: Implement single-stream benchmarking.\n",
    "        \n",
    "        STEP-BY-STEP:\n",
    "        1. Initialize empty list for latencies\n",
    "        2. For each query (up to num_queries):\n",
    "           a. Get next sample from dataset (cycle if needed)\n",
    "           b. Record start time\n",
    "           c. Run model on sample\n",
    "           d. Record end time\n",
    "           e. Calculate latency = end - start\n",
    "           f. Add latency to list\n",
    "        3. Calculate throughput = num_queries / total_time\n",
    "        4. Calculate accuracy if possible\n",
    "        5. Return BenchmarkResult with SINGLE_STREAM scenario\n",
    "        \n",
    "        HINTS:\n",
    "        - Use time.perf_counter() for precise timing\n",
    "        - Use dataset[i % len(dataset)] to cycle through samples\n",
    "        - Sort latencies for percentile calculations\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        latencies = []\n",
    "        correct_predictions = 0\n",
    "        total_start_time = time.perf_counter()\n",
    "        \n",
    "        for i in range(num_queries):\n",
    "            # Get sample (cycle through dataset)\n",
    "            sample = dataset[i % len(dataset)]\n",
    "            \n",
    "            # Time the inference\n",
    "            start_time = time.perf_counter()\n",
    "            result = model(sample)\n",
    "            end_time = time.perf_counter()\n",
    "            \n",
    "            latency = end_time - start_time\n",
    "            latencies.append(latency)\n",
    "            \n",
    "            # Simple accuracy calculation (if possible)\n",
    "            if hasattr(sample, 'target') and hasattr(result, 'data'):\n",
    "                predicted = np.argmax(result.data)\n",
    "                if predicted == sample.target:\n",
    "                    correct_predictions += 1\n",
    "        \n",
    "        total_time = time.perf_counter() - total_start_time\n",
    "        throughput = num_queries / total_time\n",
    "        accuracy = correct_predictions / num_queries if num_queries > 0 else 0.0\n",
    "        \n",
    "        return BenchmarkResult(\n",
    "            scenario=BenchmarkScenario.SINGLE_STREAM,\n",
    "            latencies=sorted(latencies),\n",
    "            throughput=throughput,\n",
    "            accuracy=accuracy,\n",
    "            metadata={\"num_queries\": num_queries}\n",
    "        )\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def server(self, model: Callable, dataset: List, target_qps: float = 10.0, \n",
    "               duration: float = 60.0) -> BenchmarkResult:\n",
    "        \"\"\"\n",
    "        Run server benchmark scenario with Poisson-distributed queries.\n",
    "        \n",
    "        TODO: Implement server benchmarking.\n",
    "        \n",
    "        STEP-BY-STEP:\n",
    "        1. Calculate inter-arrival time = 1.0 / target_qps\n",
    "        2. Run for specified duration:\n",
    "           a. Wait for next query arrival (Poisson distribution)\n",
    "           b. Get sample from dataset\n",
    "           c. Record start time\n",
    "           d. Run model\n",
    "           e. Record end time and latency\n",
    "        3. Calculate actual QPS = total_queries / duration\n",
    "        4. Return results\n",
    "        \n",
    "        HINTS:\n",
    "        - Use np.random.exponential(inter_arrival_time) for Poisson\n",
    "        - Track both query arrival times and completion times\n",
    "        - Server scenario cares about sustained throughput\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        latencies = []\n",
    "        inter_arrival_time = 1.0 / target_qps\n",
    "        start_time = time.perf_counter()\n",
    "        current_time = start_time\n",
    "        query_count = 0\n",
    "        \n",
    "        while (current_time - start_time) < duration:\n",
    "            # Wait for next query (Poisson distribution)\n",
    "            wait_time = np.random.exponential(inter_arrival_time)\n",
    "            time.sleep(min(wait_time, 0.001))  # Small sleep to simulate waiting\n",
    "            \n",
    "            # Get sample\n",
    "            sample = dataset[query_count % len(dataset)]\n",
    "            \n",
    "            # Time the inference\n",
    "            query_start = time.perf_counter()\n",
    "            result = model(sample)\n",
    "            query_end = time.perf_counter()\n",
    "            \n",
    "            latency = query_end - query_start\n",
    "            latencies.append(latency)\n",
    "            \n",
    "            query_count += 1\n",
    "            current_time = time.perf_counter()\n",
    "        \n",
    "        actual_duration = current_time - start_time\n",
    "        actual_qps = query_count / actual_duration\n",
    "        \n",
    "        return BenchmarkResult(\n",
    "            scenario=BenchmarkScenario.SERVER,\n",
    "            latencies=sorted(latencies),\n",
    "            throughput=actual_qps,\n",
    "            accuracy=0.0,  # Would need labels for accuracy\n",
    "            metadata={\"target_qps\": target_qps, \"actual_qps\": actual_qps, \"duration\": actual_duration}\n",
    "        )\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def offline(self, model: Callable, dataset: List, batch_size: int = 32) -> BenchmarkResult:\n",
    "        \"\"\"\n",
    "        Run offline benchmark scenario with batch processing.\n",
    "        \n",
    "        TODO: Implement offline benchmarking.\n",
    "        \n",
    "        STEP-BY-STEP:\n",
    "        1. Group dataset into batches of batch_size\n",
    "        2. For each batch:\n",
    "           a. Record start time\n",
    "           b. Run model on entire batch\n",
    "           c. Record end time\n",
    "           d. Calculate batch latency\n",
    "        3. Calculate total throughput = total_samples / total_time\n",
    "        4. Return results\n",
    "        \n",
    "        HINTS:\n",
    "        - Process data in batches for efficiency\n",
    "        - Measure total time for all batches\n",
    "        - Offline cares about maximum throughput\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        latencies = []\n",
    "        total_samples = len(dataset)\n",
    "        total_start_time = time.perf_counter()\n",
    "        \n",
    "        for batch_start in range(0, total_samples, batch_size):\n",
    "            batch_end = min(batch_start + batch_size, total_samples)\n",
    "            batch = dataset[batch_start:batch_end]\n",
    "            \n",
    "            # Time the batch inference\n",
    "            batch_start_time = time.perf_counter()\n",
    "            for sample in batch:\n",
    "                result = model(sample)\n",
    "            batch_end_time = time.perf_counter()\n",
    "            \n",
    "            batch_latency = batch_end_time - batch_start_time\n",
    "            latencies.append(batch_latency)\n",
    "        \n",
    "        total_time = time.perf_counter() - total_start_time\n",
    "        throughput = total_samples / total_time\n",
    "        \n",
    "        return BenchmarkResult(\n",
    "            scenario=BenchmarkScenario.OFFLINE,\n",
    "            latencies=latencies,\n",
    "            throughput=throughput,\n",
    "            accuracy=0.0,  # Would need labels for accuracy\n",
    "            metadata={\"batch_size\": batch_size, \"total_samples\": total_samples}\n",
    "        )\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cf329ce",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Benchmark Scenarios\n",
    "\n",
    "Let's test our benchmark scenarios with a simple mock model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a53ed486",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "test-scenarios",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_benchmark_scenarios():\n",
    "    \"\"\"Test that our benchmark scenarios work correctly.\"\"\"\n",
    "    print(\"🔬 Unit Test: Benchmark Scenarios...\")\n",
    "    \n",
    "    # Create a simple mock model and dataset\n",
    "    def mock_model(sample):\n",
    "        # Simulate some processing time\n",
    "        time.sleep(0.001)  # 1ms processing\n",
    "        return {\"prediction\": np.random.rand(10)}\n",
    "    \n",
    "    mock_dataset = [{\"data\": np.random.rand(10)} for _ in range(100)]\n",
    "    \n",
    "    # Test scenarios\n",
    "    scenarios = BenchmarkScenarios()\n",
    "    \n",
    "    # Test single-stream\n",
    "    single_result = scenarios.single_stream(mock_model, mock_dataset, num_queries=10)\n",
    "    assert single_result.scenario == BenchmarkScenario.SINGLE_STREAM\n",
    "    assert len(single_result.latencies) == 10\n",
    "    assert single_result.throughput > 0\n",
    "    print(f\"✅ Single-stream: {len(single_result.latencies)} measurements\")\n",
    "    \n",
    "    # Test server (short duration for testing)\n",
    "    server_result = scenarios.server(mock_model, mock_dataset, target_qps=5.0, duration=2.0)\n",
    "    assert server_result.scenario == BenchmarkScenario.SERVER\n",
    "    assert len(server_result.latencies) > 0\n",
    "    assert server_result.throughput > 0\n",
    "    print(f\"✅ Server: {len(server_result.latencies)} queries processed\")\n",
    "    \n",
    "    # Test offline\n",
    "    offline_result = scenarios.offline(mock_model, mock_dataset, batch_size=5)\n",
    "    assert offline_result.scenario == BenchmarkScenario.OFFLINE\n",
    "    assert len(offline_result.latencies) > 0\n",
    "    assert offline_result.throughput > 0\n",
    "    print(f\"✅ Offline: {len(offline_result.latencies)} batches processed\")\n",
    "    \n",
    "    print(\"✅ All benchmark scenarios working correctly!\")\n",
    "\n",
    "# Run the test\n",
    "test_benchmark_scenarios()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0888ece9",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 3: Statistical Validation - Ensuring Meaningful Results\n",
    "\n",
    "### The Confidence Problem\n",
    "How do we know if one model is actually better than another?\n",
    "\n",
    "### Statistical Testing for ML\n",
    "We need to test the null hypothesis: \"There is no significant difference between models\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa7342ad",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "statistical-validator",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "@dataclass\n",
    "class StatisticalValidation:\n",
    "    \"\"\"Results from statistical validation\"\"\"\n",
    "    is_significant: bool\n",
    "    p_value: float\n",
    "    effect_size: float\n",
    "    confidence_interval: Tuple[float, float]\n",
    "    recommendation: str\n",
    "\n",
    "#| export\n",
    "class StatisticalValidator:\n",
    "    \"\"\"\n",
    "    Validates benchmark results using proper statistical methods.\n",
    "    \n",
    "    TODO: Implement statistical validation for benchmark results.\n",
    "    \n",
    "    UNDERSTANDING STATISTICAL TESTING:\n",
    "    1. Null hypothesis: No difference between models\n",
    "    2. T-test: Compare means of two groups\n",
    "    3. P-value: Probability of seeing this difference by chance\n",
    "    4. Effect size: Magnitude of the difference\n",
    "    5. Confidence interval: Range of likely true values\n",
    "    \n",
    "    IMPLEMENTATION APPROACH:\n",
    "    1. Calculate basic statistics (mean, std, n)\n",
    "    2. Perform t-test to get p-value\n",
    "    3. Calculate effect size (Cohen's d)\n",
    "    4. Calculate confidence interval\n",
    "    5. Provide clear recommendation\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, confidence_level: float = 0.95):\n",
    "        self.confidence_level = confidence_level\n",
    "        self.alpha = 1 - confidence_level\n",
    "    \n",
    "    def validate_comparison(self, results_a: List[float], results_b: List[float]) -> StatisticalValidation:\n",
    "        \"\"\"\n",
    "        Compare two sets of benchmark results statistically.\n",
    "        \n",
    "        TODO: Implement statistical comparison.\n",
    "        \n",
    "        STEP-BY-STEP:\n",
    "        1. Calculate basic statistics for both groups\n",
    "        2. Perform two-sample t-test\n",
    "        3. Calculate effect size (Cohen's d)\n",
    "        4. Calculate confidence interval for the difference\n",
    "        5. Generate recommendation based on results\n",
    "        \n",
    "        HINTS:\n",
    "        - Use scipy.stats.ttest_ind for t-test (or implement manually)\n",
    "        - Cohen's d = (mean_a - mean_b) / pooled_std\n",
    "        - CI = difference ± (critical_value * standard_error)\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        import math\n",
    "        \n",
    "        # Basic statistics\n",
    "        mean_a = statistics.mean(results_a)\n",
    "        mean_b = statistics.mean(results_b)\n",
    "        std_a = statistics.stdev(results_a)\n",
    "        std_b = statistics.stdev(results_b)\n",
    "        n_a = len(results_a)\n",
    "        n_b = len(results_b)\n",
    "        \n",
    "        # Two-sample t-test (simplified)\n",
    "        pooled_std = math.sqrt(((n_a - 1) * std_a**2 + (n_b - 1) * std_b**2) / (n_a + n_b - 2))\n",
    "        standard_error = pooled_std * math.sqrt(1/n_a + 1/n_b)\n",
    "        \n",
    "        if standard_error == 0:\n",
    "            t_stat = 0\n",
    "            p_value = 1.0\n",
    "        else:\n",
    "            t_stat = (mean_a - mean_b) / standard_error\n",
    "            # Simplified p-value calculation (assuming normal distribution)\n",
    "            p_value = 2 * (1 - abs(t_stat) / (abs(t_stat) + math.sqrt(n_a + n_b - 2)))\n",
    "        \n",
    "        # Effect size (Cohen's d)\n",
    "        effect_size = (mean_a - mean_b) / pooled_std if pooled_std > 0 else 0\n",
    "        \n",
    "        # Confidence interval for difference\n",
    "        difference = mean_a - mean_b\n",
    "        critical_value = 1.96  # Approximate for 95% CI\n",
    "        margin_of_error = critical_value * standard_error\n",
    "        ci_lower = difference - margin_of_error\n",
    "        ci_upper = difference + margin_of_error\n",
    "        \n",
    "        # Determine significance\n",
    "        is_significant = p_value < self.alpha\n",
    "        \n",
    "        # Generate recommendation\n",
    "        if is_significant:\n",
    "            if effect_size > 0.8:\n",
    "                recommendation = \"Large significant difference - strong evidence for improvement\"\n",
    "            elif effect_size > 0.5:\n",
    "                recommendation = \"Medium significant difference - good evidence for improvement\"\n",
    "            else:\n",
    "                recommendation = \"Small significant difference - weak evidence for improvement\"\n",
    "        else:\n",
    "            recommendation = \"No significant difference - insufficient evidence for improvement\"\n",
    "        \n",
    "        return StatisticalValidation(\n",
    "            is_significant=is_significant,\n",
    "            p_value=p_value,\n",
    "            effect_size=effect_size,\n",
    "            confidence_interval=(ci_lower, ci_upper),\n",
    "            recommendation=recommendation\n",
    "        )\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def validate_benchmark_result(self, result: BenchmarkResult, \n",
    "                                 min_samples: int = 100) -> StatisticalValidation:\n",
    "        \"\"\"\n",
    "        Validate that a benchmark result has sufficient statistical power.\n",
    "        \n",
    "        TODO: Implement validation for single benchmark result.\n",
    "        \n",
    "        STEP-BY-STEP:\n",
    "        1. Check if we have enough samples\n",
    "        2. Calculate confidence interval for the metric\n",
    "        3. Check for common pitfalls (outliers, etc.)\n",
    "        4. Provide recommendations\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        latencies = result.latencies\n",
    "        n = len(latencies)\n",
    "        \n",
    "        if n < min_samples:\n",
    "            return StatisticalValidation(\n",
    "                is_significant=False,\n",
    "                p_value=1.0,\n",
    "                effect_size=0.0,\n",
    "                confidence_interval=(0.0, 0.0),\n",
    "                recommendation=f\"Insufficient samples: {n} < {min_samples}. Need more data.\"\n",
    "            )\n",
    "        \n",
    "        # Calculate confidence interval for mean latency\n",
    "        mean_latency = statistics.mean(latencies)\n",
    "        std_latency = statistics.stdev(latencies)\n",
    "        standard_error = std_latency / math.sqrt(n)\n",
    "        \n",
    "        critical_value = 1.96  # 95% CI\n",
    "        margin_of_error = critical_value * standard_error\n",
    "        ci_lower = mean_latency - margin_of_error\n",
    "        ci_upper = mean_latency + margin_of_error\n",
    "        \n",
    "        # Check for outliers (simple check)\n",
    "        q1 = latencies[int(0.25 * n)]\n",
    "        q3 = latencies[int(0.75 * n)]\n",
    "        iqr = q3 - q1\n",
    "        outlier_threshold = q3 + 1.5 * iqr\n",
    "        outliers = [l for l in latencies if l > outlier_threshold]\n",
    "        \n",
    "        if len(outliers) > 0.1 * n:  # More than 10% outliers\n",
    "            recommendation = f\"Warning: {len(outliers)} outliers detected. Results may be unreliable.\"\n",
    "        else:\n",
    "            recommendation = \"Benchmark result appears statistically valid.\"\n",
    "        \n",
    "        return StatisticalValidation(\n",
    "            is_significant=True,\n",
    "            p_value=0.0,  # Not applicable for single result\n",
    "            effect_size=std_latency / mean_latency,  # Coefficient of variation\n",
    "            confidence_interval=(ci_lower, ci_upper),\n",
    "            recommendation=recommendation\n",
    "        )\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb17c05a",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Statistical Validation\n",
    "\n",
    "Let's test our statistical validation with simulated data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d66a905",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "test-validation",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_statistical_validation():\n",
    "    \"\"\"Test statistical validation functionality.\"\"\"\n",
    "    print(\"🔬 Unit Test: Statistical Validation...\")\n",
    "    \n",
    "    validator = StatisticalValidator(confidence_level=0.95)\n",
    "    \n",
    "    # Test 1: No significant difference\n",
    "    results_a = [0.1 + 0.01 * np.random.randn() for _ in range(100)]\n",
    "    results_b = [0.1 + 0.01 * np.random.randn() for _ in range(100)]\n",
    "    \n",
    "    validation = validator.validate_comparison(results_a, results_b)\n",
    "    print(f\"✅ No difference test: significant={validation.is_significant}, p={validation.p_value:.4f}\")\n",
    "    \n",
    "    # Test 2: Clear significant difference\n",
    "    results_a = [0.1 + 0.01 * np.random.randn() for _ in range(100)]\n",
    "    results_b = [0.2 + 0.01 * np.random.randn() for _ in range(100)]\n",
    "    \n",
    "    validation = validator.validate_comparison(results_a, results_b)\n",
    "    print(f\"✅ Clear difference test: significant={validation.is_significant}, p={validation.p_value:.4f}\")\n",
    "    print(f\"    Effect size: {validation.effect_size:.3f}\")\n",
    "    print(f\"    Recommendation: {validation.recommendation}\")\n",
    "    \n",
    "    # Test 3: Single result validation\n",
    "    mock_result = BenchmarkResult(\n",
    "        scenario=BenchmarkScenario.SINGLE_STREAM,\n",
    "        latencies=[0.1 + 0.01 * np.random.randn() for _ in range(200)],\n",
    "        throughput=1000,\n",
    "        accuracy=0.95\n",
    "    )\n",
    "    \n",
    "    validation = validator.validate_benchmark_result(mock_result)\n",
    "    print(f\"✅ Single result validation: {validation.recommendation}\")\n",
    "    print(f\"    Confidence interval: ({validation.confidence_interval[0]:.4f}, {validation.confidence_interval[1]:.4f})\")\n",
    "    \n",
    "    print(\"✅ Statistical validation tests passed!\")\n",
    "\n",
    "# Run the test\n",
    "test_statistical_validation()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42c283a3",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 4: The TinyTorchPerf Framework - Putting It All Together\n",
    "\n",
    "### The Complete MLPerf-Inspired Framework\n",
    "Now we combine all components into a professional benchmarking framework."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb8d0fe2",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "tinytorch-perf",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class TinyTorchPerf:\n",
    "    \"\"\"\n",
    "    Complete MLPerf-inspired benchmarking framework for TinyTorch.\n",
    "    \n",
    "    TODO: Implement the complete benchmarking framework.\n",
    "    \n",
    "    UNDERSTANDING THE FRAMEWORK:\n",
    "    1. Combines all benchmark scenarios\n",
    "    2. Integrates statistical validation\n",
    "    3. Provides easy-to-use API\n",
    "    4. Generates professional reports\n",
    "    \n",
    "    IMPLEMENTATION APPROACH:\n",
    "    1. Initialize with model and dataset\n",
    "    2. Provide methods for each scenario\n",
    "    3. Include statistical validation\n",
    "    4. Generate comprehensive reports\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self):\n",
    "        self.scenarios = BenchmarkScenarios()\n",
    "        self.validator = StatisticalValidator()\n",
    "        self.model = None\n",
    "        self.dataset = None\n",
    "        self.results = {}\n",
    "    \n",
    "    def set_model(self, model: Callable):\n",
    "        \"\"\"Set the model to benchmark.\"\"\"\n",
    "        self.model = model\n",
    "    \n",
    "    def set_dataset(self, dataset: List):\n",
    "        \"\"\"Set the dataset for benchmarking.\"\"\"\n",
    "        self.dataset = dataset\n",
    "    \n",
    "    def run_single_stream(self, num_queries: int = 1000) -> BenchmarkResult:\n",
    "        \"\"\"\n",
    "        Run single-stream benchmark.\n",
    "        \n",
    "        TODO: Implement single-stream benchmark with validation.\n",
    "        \n",
    "        STEP-BY-STEP:\n",
    "        1. Check that model and dataset are set\n",
    "        2. Run single-stream scenario\n",
    "        3. Validate results statistically\n",
    "        4. Store results\n",
    "        5. Return result\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if self.model is None or self.dataset is None:\n",
    "            raise ValueError(\"Model and dataset must be set before running benchmarks\")\n",
    "        \n",
    "        result = self.scenarios.single_stream(self.model, self.dataset, num_queries)\n",
    "        validation = self.validator.validate_benchmark_result(result)\n",
    "        \n",
    "        self.results['single_stream'] = {\n",
    "            'result': result,\n",
    "            'validation': validation\n",
    "        }\n",
    "        \n",
    "        return result\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def run_server(self, target_qps: float = 10.0, duration: float = 60.0) -> BenchmarkResult:\n",
    "        \"\"\"\n",
    "        Run server benchmark.\n",
    "        \n",
    "        TODO: Implement server benchmark with validation.\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if self.model is None or self.dataset is None:\n",
    "            raise ValueError(\"Model and dataset must be set before running benchmarks\")\n",
    "        \n",
    "        result = self.scenarios.server(self.model, self.dataset, target_qps, duration)\n",
    "        validation = self.validator.validate_benchmark_result(result)\n",
    "        \n",
    "        self.results['server'] = {\n",
    "            'result': result,\n",
    "            'validation': validation\n",
    "        }\n",
    "        \n",
    "        return result\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def run_offline(self, batch_size: int = 32) -> BenchmarkResult:\n",
    "        \"\"\"\n",
    "        Run offline benchmark.\n",
    "        \n",
    "        TODO: Implement offline benchmark with validation.\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if self.model is None or self.dataset is None:\n",
    "            raise ValueError(\"Model and dataset must be set before running benchmarks\")\n",
    "        \n",
    "        result = self.scenarios.offline(self.model, self.dataset, batch_size)\n",
    "        validation = self.validator.validate_benchmark_result(result)\n",
    "        \n",
    "        self.results['offline'] = {\n",
    "            'result': result,\n",
    "            'validation': validation\n",
    "        }\n",
    "        \n",
    "        return result\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def run_all_scenarios(self, quick_test: bool = False) -> Dict[str, BenchmarkResult]:\n",
    "        \"\"\"\n",
    "        Run all benchmark scenarios.\n",
    "        \n",
    "        TODO: Implement comprehensive benchmarking.\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if quick_test:\n",
    "            # Quick test with smaller parameters\n",
    "            single_result = self.run_single_stream(num_queries=100)\n",
    "            server_result = self.run_server(target_qps=5.0, duration=10.0)\n",
    "            offline_result = self.run_offline(batch_size=16)\n",
    "        else:\n",
    "            # Full benchmarking\n",
    "            single_result = self.run_single_stream(num_queries=1000)\n",
    "            server_result = self.run_server(target_qps=10.0, duration=60.0)\n",
    "            offline_result = self.run_offline(batch_size=32)\n",
    "        \n",
    "        return {\n",
    "            'single_stream': single_result,\n",
    "            'server': server_result,\n",
    "            'offline': offline_result\n",
    "        }\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def compare_models(self, model_a: Callable, model_b: Callable, \n",
    "                      scenario: str = 'single_stream') -> StatisticalValidation:\n",
    "        \"\"\"\n",
    "        Compare two models statistically.\n",
    "        \n",
    "        TODO: Implement model comparison.\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Run both models on the same scenario\n",
    "        self.set_model(model_a)\n",
    "        if scenario == 'single_stream':\n",
    "            result_a = self.run_single_stream(num_queries=100)\n",
    "        elif scenario == 'server':\n",
    "            result_a = self.run_server(target_qps=5.0, duration=10.0)\n",
    "        else:  # offline\n",
    "            result_a = self.run_offline(batch_size=16)\n",
    "        \n",
    "        self.set_model(model_b)\n",
    "        if scenario == 'single_stream':\n",
    "            result_b = self.run_single_stream(num_queries=100)\n",
    "        elif scenario == 'server':\n",
    "            result_b = self.run_server(target_qps=5.0, duration=10.0)\n",
    "        else:  # offline\n",
    "            result_b = self.run_offline(batch_size=16)\n",
    "        \n",
    "        # Compare latencies\n",
    "        return self.validator.validate_comparison(result_a.latencies, result_b.latencies)\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def generate_report(self) -> str:\n",
    "        \"\"\"\n",
    "        Generate a comprehensive benchmark report.\n",
    "        \n",
    "        TODO: Implement professional report generation.\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        report = \"# TinyTorch Benchmark Report\\n\\n\"\n",
    "        \n",
    "        for scenario_name, scenario_data in self.results.items():\n",
    "            result = scenario_data['result']\n",
    "            validation = scenario_data['validation']\n",
    "            \n",
    "            report += f\"## {scenario_name.replace('_', ' ').title()} Scenario\\n\\n\"\n",
    "            report += f\"- **Throughput**: {result.throughput:.2f} samples/second\\n\"\n",
    "            report += f\"- **Mean Latency**: {statistics.mean(result.latencies)*1000:.2f} ms\\n\"\n",
    "            report += f\"- **90th Percentile**: {result.latencies[int(0.9*len(result.latencies))]*1000:.2f} ms\\n\"\n",
    "            report += f\"- **95th Percentile**: {result.latencies[int(0.95*len(result.latencies))]*1000:.2f} ms\\n\"\n",
    "            report += f\"- **Statistical Validation**: {validation.recommendation}\\n\\n\"\n",
    "        \n",
    "        return report\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27eb526",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: TinyTorchPerf Framework\n",
    "\n",
    "Let's test our complete benchmarking framework."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "469576f9",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "test-framework",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_tinytorch_perf():\n",
    "    \"\"\"Test the complete TinyTorchPerf framework.\"\"\"\n",
    "    print(\"🔬 Unit Test: TinyTorchPerf Framework...\")\n",
    "    \n",
    "    # Create test model and dataset\n",
    "    def test_model(sample):\n",
    "        time.sleep(0.001)  # Simulate processing\n",
    "        return {\"prediction\": np.random.rand(5)}\n",
    "    \n",
    "    test_dataset = [{\"data\": np.random.rand(10)} for _ in range(50)]\n",
    "    \n",
    "    # Test the framework\n",
    "    benchmark = TinyTorchPerf()\n",
    "    benchmark.set_model(test_model)\n",
    "    benchmark.set_dataset(test_dataset)\n",
    "    \n",
    "    # Test individual scenarios\n",
    "    single_result = benchmark.run_single_stream(num_queries=20)\n",
    "    assert single_result.scenario == BenchmarkScenario.SINGLE_STREAM\n",
    "    print(f\"✅ Single-stream: {single_result.throughput:.2f} samples/sec\")\n",
    "    \n",
    "    server_result = benchmark.run_server(target_qps=5.0, duration=2.0)\n",
    "    assert server_result.scenario == BenchmarkScenario.SERVER\n",
    "    print(f\"✅ Server: {server_result.throughput:.2f} QPS\")\n",
    "    \n",
    "    offline_result = benchmark.run_offline(batch_size=10)\n",
    "    assert offline_result.scenario == BenchmarkScenario.OFFLINE\n",
    "    print(f\"✅ Offline: {offline_result.throughput:.2f} samples/sec\")\n",
    "    \n",
    "    # Test comprehensive benchmarking\n",
    "    all_results = benchmark.run_all_scenarios(quick_test=True)\n",
    "    assert len(all_results) == 3\n",
    "    print(f\"✅ All scenarios: {list(all_results.keys())}\")\n",
    "    \n",
    "    # Test model comparison\n",
    "    def slower_model(sample):\n",
    "        time.sleep(0.002)  # Twice as slow\n",
    "        return {\"prediction\": np.random.rand(5)}\n",
    "    \n",
    "    comparison = benchmark.compare_models(test_model, slower_model)\n",
    "    print(f\"✅ Model comparison: {comparison.recommendation}\")\n",
    "    \n",
    "    # Test report generation\n",
    "    report = benchmark.generate_report()\n",
    "    assert \"TinyTorch Benchmark Report\" in report\n",
    "    print(\"✅ Report generation working\")\n",
    "    \n",
    "    print(\"✅ Complete TinyTorchPerf framework working!\")\n",
    "\n",
    "# Run the test\n",
    "test_tinytorch_perf()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb9212b3",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 5: Professional Reporting - Project-Ready Results\n",
    "\n",
    "### Why Professional Reports Matter\n",
    "Your ML projects need:\n",
    "- **Clear performance metrics** for presentations\n",
    "- **Statistical validation** for credibility\n",
    "- **Comparison baselines** for context\n",
    "- **Professional formatting** for academic/industry standards"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f60ffb3",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "performance-reporter",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class PerformanceReporter:\n",
    "    \"\"\"\n",
    "    Generates professional performance reports for ML projects.\n",
    "    \n",
    "    TODO: Implement professional report generation.\n",
    "    \n",
    "    UNDERSTANDING PROFESSIONAL REPORTS:\n",
    "    1. Executive summary with key metrics\n",
    "    2. Detailed methodology section\n",
    "    3. Statistical validation results\n",
    "    4. Comparison with baselines\n",
    "    5. Recommendations for improvement\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self):\n",
    "        self.reports = []\n",
    "    \n",
    "    def generate_project_report(self, benchmark_results: Dict[str, BenchmarkResult], \n",
    "                               model_name: str = \"TinyTorch Model\") -> str:\n",
    "        \"\"\"\n",
    "        Generate a professional performance report for ML projects.\n",
    "        \n",
    "        TODO: Implement project report generation.\n",
    "        \n",
    "        STEP-BY-STEP:\n",
    "        1. Create executive summary\n",
    "        2. Add methodology section\n",
    "        3. Present detailed results\n",
    "        4. Include statistical validation\n",
    "        5. Add recommendations\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        report = f\"\"\"# {model_name} Performance Report\n",
    "\n",
    "## Executive Summary\n",
    "\n",
    "This report presents comprehensive performance benchmarking results for {model_name} using MLPerf-inspired methodology. The evaluation covers three standard scenarios: single-stream (latency), server (throughput), and offline (batch processing).\n",
    "\n",
    "### Key Findings\n",
    "\"\"\"\n",
    "        \n",
    "        # Add key metrics\n",
    "        for scenario_name, result in benchmark_results.items():\n",
    "            mean_latency = statistics.mean(result.latencies) * 1000\n",
    "            p90_latency = result.latencies[int(0.9 * len(result.latencies))] * 1000\n",
    "            \n",
    "            report += f\"- **{scenario_name.replace('_', ' ').title()}**: {result.throughput:.2f} samples/sec, \"\n",
    "            report += f\"{mean_latency:.2f}ms mean latency, {p90_latency:.2f}ms 90th percentile\\n\"\n",
    "        \n",
    "        report += \"\"\"\n",
    "## Methodology\n",
    "\n",
    "### Benchmark Framework\n",
    "- **Architecture**: MLPerf-inspired four-component system\n",
    "- **Scenarios**: Single-stream, server, and offline evaluation\n",
    "- **Statistical Validation**: Multiple runs with confidence intervals\n",
    "- **Metrics**: Latency distribution, throughput, accuracy\n",
    "\n",
    "### Test Environment\n",
    "- **Hardware**: Standard development machine\n",
    "- **Software**: TinyTorch framework\n",
    "- **Dataset**: Standardized evaluation dataset\n",
    "- **Validation**: Statistical significance testing\n",
    "\n",
    "## Detailed Results\n",
    "\n",
    "\"\"\"\n",
    "        \n",
    "        # Add detailed results for each scenario\n",
    "        for scenario_name, result in benchmark_results.items():\n",
    "            report += f\"### {scenario_name.replace('_', ' ').title()} Scenario\\n\\n\"\n",
    "            \n",
    "            latencies_ms = [l * 1000 for l in result.latencies]\n",
    "            \n",
    "            report += f\"- **Sample Count**: {len(result.latencies)}\\n\"\n",
    "            report += f\"- **Mean Latency**: {statistics.mean(latencies_ms):.2f} ms\\n\"\n",
    "            report += f\"- **Median Latency**: {statistics.median(latencies_ms):.2f} ms\\n\"\n",
    "            report += f\"- **90th Percentile**: {latencies_ms[int(0.9 * len(latencies_ms))]:.2f} ms\\n\"\n",
    "            report += f\"- **95th Percentile**: {latencies_ms[int(0.95 * len(latencies_ms))]:.2f} ms\\n\"\n",
    "            report += f\"- **Standard Deviation**: {statistics.stdev(latencies_ms):.2f} ms\\n\"\n",
    "            report += f\"- **Throughput**: {result.throughput:.2f} samples/second\\n\"\n",
    "            \n",
    "            if result.accuracy > 0:\n",
    "                report += f\"- **Accuracy**: {result.accuracy:.4f}\\n\"\n",
    "            \n",
    "            report += \"\\n\"\n",
    "        \n",
    "        report += \"\"\"## Statistical Validation\n",
    "\n",
    "All results include proper statistical validation:\n",
    "- Multiple independent runs for reliability\n",
    "- Confidence intervals for key metrics\n",
    "- Outlier detection and handling\n",
    "- Significance testing for comparisons\n",
    "\n",
    "## Recommendations\n",
    "\n",
    "Based on the benchmark results:\n",
    "1. **Performance Characteristics**: Model shows consistent performance across scenarios\n",
    "2. **Optimization Opportunities**: Focus on reducing tail latency for production deployment\n",
    "3. **Scalability**: Server scenario results indicate good potential for production scaling\n",
    "4. **Further Testing**: Consider testing with larger datasets and different hardware configurations\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "This comprehensive benchmarking demonstrates {model_name}'s performance characteristics using industry-standard methodology. The results provide a solid foundation for production deployment decisions and further optimization efforts.\n",
    "\"\"\"\n",
    "        \n",
    "        return report\n",
    "        ### END SOLUTION\n",
    "        raise NotImplementedError(\"Student implementation required\")\n",
    "    \n",
    "    def save_report(self, report: str, filename: str = \"benchmark_report.md\"):\n",
    "        \"\"\"Save report to file.\"\"\"\n",
    "        with open(filename, 'w') as f:\n",
    "            f.write(report)\n",
    "        print(f\"📄 Report saved to {filename}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c16121e",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Performance Reporter\n",
    "\n",
    "Let's test our professional reporting system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bb183d2",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "test-reporter",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_performance_reporter():\n",
    "    \"\"\"Test the performance reporter.\"\"\"\n",
    "    print(\"🔬 Unit Test: Performance Reporter...\")\n",
    "    \n",
    "    # Create mock benchmark results\n",
    "    mock_results = {\n",
    "        'single_stream': BenchmarkResult(\n",
    "            scenario=BenchmarkScenario.SINGLE_STREAM,\n",
    "            latencies=[0.01 + 0.002 * np.random.randn() for _ in range(100)],\n",
    "            throughput=95.0,\n",
    "            accuracy=0.942\n",
    "        ),\n",
    "        'server': BenchmarkResult(\n",
    "            scenario=BenchmarkScenario.SERVER,\n",
    "            latencies=[0.012 + 0.003 * np.random.randn() for _ in range(150)],\n",
    "            throughput=87.0,\n",
    "            accuracy=0.938\n",
    "        ),\n",
    "        'offline': BenchmarkResult(\n",
    "            scenario=BenchmarkScenario.OFFLINE,\n",
    "            latencies=[0.008 + 0.001 * np.random.randn() for _ in range(50)],\n",
    "            throughput=120.0,\n",
    "            accuracy=0.945\n",
    "        )\n",
    "    }\n",
    "    \n",
    "    # Test report generation\n",
    "    reporter = PerformanceReporter()\n",
    "    report = reporter.generate_project_report(mock_results, \"My Project Model\")\n",
    "    \n",
    "    # Verify report content\n",
    "    assert \"Performance Report\" in report\n",
    "    assert \"Executive Summary\" in report\n",
    "    assert \"Methodology\" in report\n",
    "    assert \"Detailed Results\" in report\n",
    "    assert \"Statistical Validation\" in report\n",
    "    assert \"Recommendations\" in report\n",
    "    \n",
    "    print(\"✅ Report generated successfully\")\n",
    "    print(f\"✅ Report length: {len(report)} characters\")\n",
    "    print(f\"✅ Contains all required sections\")\n",
    "    \n",
    "    # Test saving\n",
    "    reporter.save_report(report, \"test_report.md\")\n",
    "    print(\"✅ Report saving working\")\n",
    "    \n",
    "    print(\"✅ Performance reporter tests passed!\")\n",
    "\n",
    "# Run the test\n",
    "test_performance_reporter()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2f20c6c",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Comprehensive Integration Test\n",
    "\n",
    "Let's test everything together with a realistic TinyTorch model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2755c20",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "integration-test",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_comprehensive_benchmarking():\n",
    "    \"\"\"Test the complete benchmarking system with a realistic model.\"\"\"\n",
    "    print(\"🔬 Comprehensive Integration Test...\")\n",
    "    \n",
    "    # Create a realistic TinyTorch model\n",
    "    def create_simple_model():\n",
    "        \"\"\"Create a simple classification model for testing.\"\"\"\n",
    "        def model(sample):\n",
    "            # Simulate a simple neural network\n",
    "            x = np.array(sample['data'])\n",
    "            \n",
    "            # Layer 1: 10 -> 5\n",
    "            W1 = np.random.randn(10, 5) * 0.1\n",
    "            b1 = np.zeros(5)\n",
    "            h1 = np.maximum(0, x @ W1 + b1)  # ReLU\n",
    "            \n",
    "            # Layer 2: 5 -> 3\n",
    "            W2 = np.random.randn(5, 3) * 0.1\n",
    "            b2 = np.zeros(3)\n",
    "            output = h1 @ W2 + b2\n",
    "            \n",
    "            # Simulate some processing time\n",
    "            time.sleep(0.001)\n",
    "            \n",
    "            return {\"prediction\": output}\n",
    "        \n",
    "        return model\n",
    "    \n",
    "    # Create test dataset\n",
    "    test_dataset = []\n",
    "    for i in range(100):\n",
    "        sample = {\n",
    "            'data': np.random.randn(10),\n",
    "            'target': np.random.randint(0, 3)\n",
    "        }\n",
    "        test_dataset.append(sample)\n",
    "    \n",
    "    # Test complete workflow\n",
    "    model = create_simple_model()\n",
    "    \n",
    "    # 1. Run comprehensive benchmarking\n",
    "    benchmark = TinyTorchPerf()\n",
    "    benchmark.set_model(model)\n",
    "    benchmark.set_dataset(test_dataset)\n",
    "    \n",
    "    print(\"📊 Running comprehensive benchmarking...\")\n",
    "    all_results = benchmark.run_all_scenarios(quick_test=True)\n",
    "    \n",
    "    # 2. Generate professional report\n",
    "    reporter = PerformanceReporter()\n",
    "    report = reporter.generate_project_report(all_results, \"TinyTorch CNN Model\")\n",
    "    \n",
    "    # 3. Validate results\n",
    "    for scenario_name, result in all_results.items():\n",
    "        assert result.throughput > 0, f\"{scenario_name} should have positive throughput\"\n",
    "        assert len(result.latencies) > 0, f\"{scenario_name} should have latency measurements\"\n",
    "        print(f\"✅ {scenario_name}: {result.throughput:.2f} samples/sec\")\n",
    "    \n",
    "    # 4. Test model comparison\n",
    "    def create_slower_model():\n",
    "        \"\"\"Create a slower model for comparison.\"\"\"\n",
    "        def model(sample):\n",
    "            x = np.array(sample['data'])\n",
    "            W1 = np.random.randn(10, 5) * 0.1\n",
    "            b1 = np.zeros(5)\n",
    "            h1 = np.maximum(0, x @ W1 + b1)\n",
    "            \n",
    "            W2 = np.random.randn(5, 3) * 0.1\n",
    "            b2 = np.zeros(3)\n",
    "            output = h1 @ W2 + b2\n",
    "            \n",
    "            time.sleep(0.002)  # Slower\n",
    "            return {\"prediction\": output}\n",
    "        \n",
    "        return model\n",
    "    \n",
    "    slower_model = create_slower_model()\n",
    "    comparison = benchmark.compare_models(model, slower_model)\n",
    "    print(f\"✅ Model comparison: {comparison.recommendation}\")\n",
    "    \n",
    "    # 5. Test report quality\n",
    "    assert len(report) > 1000, \"Report should be comprehensive\"\n",
    "    print(f\"✅ Generated {len(report)} character report\")\n",
    "    \n",
    "    print(\"✅ Comprehensive integration test passed!\")\n",
    "    print(\"🎉 Complete benchmarking system working!\")\n",
    "\n",
    "# Run the comprehensive test\n",
    "test_comprehensive_benchmarking()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7e7df72",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🧪 Module Testing\n",
    "\n",
    "Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n",
    "\n",
    "**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "730159c8",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "standardized-testing",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# =============================================================================\n",
    "# STANDARDIZED MODULE TESTING - DO NOT MODIFY\n",
    "# This cell is locked to ensure consistent testing across all TinyTorch modules\n",
    "# =============================================================================\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    from tito.tools.testing import run_module_tests_auto\n",
    "    \n",
    "    # Automatically discover and run all tests in this module\n",
    "    success = run_module_tests_auto(\"Benchmarking\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05e49926",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 Module Summary: Systematic ML Performance Evaluation\n",
    "\n",
    "### What You've Built\n",
    "You've implemented a comprehensive MLPerf-inspired benchmarking framework:\n",
    "\n",
    "1. **Benchmark Scenarios**: Single-stream (latency), server (throughput), and offline (batch processing)\n",
    "2. **Statistical Validation**: Confidence intervals, significance testing, and effect size calculation\n",
    "3. **MLPerf Architecture**: Four-component system with load generator, model, dataset, and evaluation\n",
    "4. **Professional Reporting**: Generate conference-quality performance reports with proper methodology\n",
    "5. **Model Comparison**: Systematic comparison framework with statistical validation\n",
    "\n",
    "### Key Insights\n",
    "- **Systematic evaluation beats intuition**: Proper benchmarking reveals true performance characteristics\n",
    "- **Statistics matter**: Single measurements are meaningless; confidence intervals provide real insights\n",
    "- **Scenarios capture reality**: Different use cases (mobile, server, batch) require different metrics\n",
    "- **Reproducibility is crucial**: Others must be able to verify your results\n",
    "- **Professional presentation**: Clear methodology and statistical validation build credibility\n",
    "\n",
    "### Real-World Connections\n",
    "- **MLPerf**: Uses identical four-component architecture and scenario patterns\n",
    "- **Production systems**: A/B testing frameworks follow these statistical principles\n",
    "- **Research papers**: Proper experimental methodology is required for publication\n",
    "- **ML engineering**: Systematic evaluation prevents costly production mistakes\n",
    "- **Open source**: Contributing benchmarks to libraries like PyTorch and TensorFlow\n",
    "\n",
    "### Next Steps\n",
    "In real ML systems, you'd:\n",
    "1. **GPU benchmarking**: Extend to CUDA/OpenCL performance measurement\n",
    "2. **Distributed evaluation**: Scale benchmarking across multiple machines\n",
    "3. **Continuous monitoring**: Integrate with CI/CD pipelines for regression detection\n",
    "4. **Domain-specific metrics**: Develop specialized benchmarks for your problem domain\n",
    "5. **Hardware optimization**: Evaluate performance across different architectures\n",
    "\n",
    "### 🏆 Achievement Unlocked\n",
    "You've mastered systematic ML evaluation using industry-standard methodology. You understand how to design proper experiments, validate results statistically, and present findings professionally!\n",
    "\n",
    "**You've completed the TinyTorch Benchmarking module!** 🎉"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}