cs249r_book/book/quarto/contents/vol1/benchmarking/benchmarking.qmd

---
quiz: benchmarking_quizzes.json
concepts: benchmarking_concepts.yml
glossary: benchmarking_glossary.json
engine: jupyter
---

# Benchmarking {#sec-benchmarking}

```{python}
#| echo: false
#| label: chapter-start
from mlsys.registry import start_chapter

start_chapter("vol1:benchmarking")
```

::: {layout-narrow}
::: {.column-margin}
\chapterminitoc
:::

\noindent
![](images/png/cover_ai_benchmarking.png){fig-alt="Olympic podium scene with AI processor chips as medal winners on circuit-board styled pedestals, featuring gold, silver, and bronze medals with AI Olympics banners in the background."}

:::

## Purpose {.unnumbered}

\begin{marginfigure}
\mlsysstack{75}{15}{25}{30}{35}{45}{10}{15}
\end{marginfigure}

_How do you systematically compare ML systems when hardware, models, data, and deployment conditions all interact?_

Every preceding chapter introduced decisions with measurable consequences: which hardware to target, how to compress a model, what data to select, how to serve predictions. Each decision improved one dimension—latency, accuracy, throughput, energy—but an ML system is the product of all these dimensions simultaneously. A pruned model runs faster on one accelerator but slower on another. A larger batch size improves accelerator utilization but violates a latency SLA. An edge device advertises peak throughput that thermal throttling halves under sustained workloads. The challenge is not whether individual optimizations work in isolation—they do—but how to *measure their combined effect* under conditions that actually matter. Benchmarking is the discipline of making such comparisons systematic rather than anecdotal. It requires defining *what* to measure (accuracy, latency, throughput, energy), *at what granularity* (a single kernel, a full model, an end-to-end pipeline), and *under which conditions* (batch size, input distribution, thermal state, concurrent load). Without this structure, teams compare numbers that were never measured on the same terms, and decisions that looked sound in a spreadsheet collapse under production workloads. The previous sections optimized the model, selected the data, and matched the hardware; benchmarking is where those optimizations are *validated*—where claims meet evidence, and where the gap between what was promised and what will be delivered is either quantified honestly or discovered painfully in production.

::: {.content-visible when-format="pdf"}
\newpage
:::

::: {.callout-tip title="Learning Objectives"}

- Explain how the **three-dimensional benchmarking framework** (system, model, data) addresses distinct ML evaluation requirements, and identify failure modes that emerge from their interdependence
- Compare training and inference benchmarking approaches through their distinct metrics, workloads, and performance characteristics, including **tail latency** and **end-to-end measurement**
- Select appropriate benchmark granularity levels (micro, macro, end-to-end) based on optimization objectives and development phase
- Apply **MLPerf** standards to evaluate ML systems across training, inference, and power dimensions
- Design benchmark protocols with standardized datasets, metrics, and evaluation procedures that ensure reproducible results
- Implement power measurement techniques that define system boundaries and enable standardized energy efficiency comparisons
- Evaluate model compression trade-offs using multi-dimensional metrics (accuracy, calibration, edge-case robustness) beyond top-line accuracy
- Distinguish laboratory benchmark results from production validation requirements, identifying failure modes (**data drift**, dynamic workloads, **silent degradation**) that controlled benchmarks cannot capture
- Critique benchmark limitations including statistical issues and deployment gaps

:::

```{python}
#| label: benchmarking-setup
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BENCHMARKING SETUP
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Chapter-wide constants used across all benchmarking sections,
# │          callouts, and worked examples
# │
# │ Goal: Centralize hardware and model parameters for the entire chapter.
# │ Show: A single source of truth for A100, ResNet, and BERT specs.
# │ How: Retrieve constants from mlsys.constants and Digital Twins.
# │
# │ Imports: mlsys.constants (*), mlsys.formatting (fmt, sci)
# │ Exports: a100_tflops_fp16_str, a100_tflops_fp32_str, a100_bw_gbs_str,
# │          a100_bw_tbs_str, h100_tflops_fp16_str, h100_tflops_fp8_str,
# │          h100_tflops_int8_str, h100_bw_tbs_str, a100_ridge_str,
# │          gpt3_params_b_str, gpt3_params_billion_str, gpt3_tokens_b_str,
# │          h100_tdp_str, energy_fp32_str, energy_fp16_str, energy_int8_str,
# │          energy_dram_str, energy_reg_str, energy_l1_str, energy_l2_str,
# │          dram_energy_pj_str, bert_params_m_str, bert_large_params_m_str,
# │          mobilenet_params_m_str, mobilenet_v1_size_mb_str,
# │          mobilenet_v1_int8_size_mb_str, mobilenet_v1_compression_ratio_str,
# │          mobilenet_flops_ratio_str, edgetpu_latency_ms_str,
# │          cpu_latency_ms_str, resnet50_params_m_str, v100_tflops_fp32_str,
# │          anomaly_params_k_str, anomaly_latency_ms_str, anomaly_auc_str,
# │          anomaly_energy_uj_str, gpt3_energy_mwh_str, accel_power_w_str,
# │          latency_fast_ms_str, latency_slow_ms_str, energy_fast_j_str,
# │          energy_fast_wh_str, energy_slow_wh_str
# │
# │ Note: This is a chapter-wide setup cell. Several exports are used >150 lines
# │       away by design — they are chapter-level constants, not section-local
# │       values. Distant consumers: anomaly_* (~line 1427), gpt3_tokens_b_str
# │       (~line 1609), accel_power_w_str/energy_*_wh_str (~line 1595),
# │       gpt3_energy_mwh_str (~line 2071). No JIT split is appropriate here
# │       because these variables share the same Digital Twin import context.
# └─────────────────────────────────────────────────────────────────────────────

from mlsys import Hardware, Models
from mlsys.constants import *
from mlsys.formatting import fmt_percent, fmt, sci

class BenchmarkingSetup:
    """Chapter-wide hardware and model constants for all benchmarking sections and callouts."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    # Hardware specs from Digital Twins
    _a100_fp16 = Hardware.A100.peak_flops.m_as(TFLOPs/second)
    _a100_fp32 = V100_FLOPS_FP32.m_as(TFLOPs/second)                           # V100 constant for fp32 baseline
    _a100_bw_gbs = Hardware.A100.memory_bw.m_as(GB/second)
    _a100_bw_tbs = Hardware.A100.memory_bw.m_as(TB/second)
    _h100_fp16 = Hardware.H100.peak_flops.m_as(TFLOPs/second)
    _h100_fp8 = H100_FLOPS_FP8_TENSOR.m_as(TFLOPs/second)
    _h100_bw_tbs = Hardware.H100.memory_bw.m_as(TB/second)
    _gpt3_params_b = Models.GPT3.parameters.m_as(Bparam)
    _h100_tdp = Hardware.H100.tdp.m_as(watt)
    _e_fp32 = ENERGY_FLOP_FP32_PJ.m_as(ureg.picojoule / ureg.flop)
    _e_fp16 = ENERGY_FLOP_FP16_PJ.m_as(ureg.picojoule / ureg.flop)
    _e_int8 = ENERGY_FLOP_INT8_PJ.m_as(ureg.picojoule / ureg.flop)
    _e_dram = ENERGY_DRAM_ACCESS_PJ.m_as(ureg.picojoule)
    _e_reg = ENERGY_REG_PJ.m_as(ureg.picojoule)
    _e_l1 = ENERGY_SRAM_L1_PJ.m_as(ureg.picojoule)
    _e_l2 = ENERGY_SRAM_L2_PJ.m_as(ureg.picojoule)
    _edgetpu_latency_ms = 2
    _cpu_latency_ms = 15
    _bert_m = Models.Language.BERT_Base.parameters.m_as(Mparam)
    _bert_large_m = Models.Language.BERT_Large.parameters.m_as(Mparam)
    _mobilenet_m = Models.MobileNetV2.parameters.m_as(Mparam)
    _mobilenet_v1_m = Models.Vision.MobileNetV1.parameters.m_as(Mparam)
    _resnet50_m = Models.ResNet50.parameters.m_as(Mparam)
    _v100_fp32 = Hardware.V100.peak_flops.m_as(TFLOPs/second)
    _gpt3_tokens_b = GPT3_TRAINING_TOKENS.m_as(count) / BILLION
    _anomaly_k = Models.Tiny.AnomalyDetector.parameters.m_as(Kparam)
    _anomaly_latency_ms = ANOMALY_MODEL_LATENCY.m_as(ureg.ms)
    _anomaly_auc = ANOMALY_MODEL_AUC
    _anomaly_energy_uj = ANOMALY_MODEL_ENERGY.m_as(ureg.microjoule)
    _gpt3_energy_mwh = 1287
    _accel_power_w = 300
    _lat_fast_ms = 10
    _lat_slow_ms = 100
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    _a100_ridge = Hardware.A100.ridge_point().m_as('flop/byte')
    _mv1_size_mb = _mobilenet_v1_m * 4
    _mv1_int8_mb = _mobilenet_v1_m * 1
    _mv1_compress = _mv1_size_mb / _mv1_int8_mb
    _flops_ratio = RESNET50_FLOPs.m_as(GFLOPs) / MOBILENETV2_FLOPs.m_as(GFLOPs)
    _e_fast_j = _accel_power_w * (_lat_fast_ms / 1000)
    _e_fast_wh = _e_fast_j / SEC_PER_HOUR
    _e_slow_j = _accel_power_w * (_lat_slow_ms / 1000)
    _e_slow_wh = _e_slow_j / SEC_PER_HOUR
    _gpt3_b_str = fmt(_gpt3_params_b, precision=0, commas=False)
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    a100_tflops_fp16_str = fmt(_a100_fp16, precision=0, commas=False)
    a100_tflops_fp32_str = fmt(_a100_fp32, precision=1, commas=False)
    a100_bw_gbs_str = fmt(_a100_bw_gbs, precision=0, commas=True)
    a100_bw_tbs_str = fmt(_a100_bw_tbs, precision=2, commas=False)
    h100_tflops_fp16_str = fmt(_h100_fp16, precision=0, commas=False)
    h100_tflops_fp8_str = fmt(_h100_fp8, precision=0, commas=True)
    h100_tflops_int8_str = fmt(_h100_fp8, precision=0, commas=False)            # same as FP8 for dense
    h100_bw_tbs_str = fmt(_h100_bw_tbs, precision=2, commas=False)
    a100_ridge_str = fmt(_a100_ridge, precision=0, commas=False)
    gpt3_params_b_str = _gpt3_b_str
    gpt3_params_billion_str = f"{_gpt3_b_str} billion"
    gpt3_tokens_b_str = fmt(_gpt3_tokens_b, precision=0, commas=False)
    h100_tdp_str = fmt(_h100_tdp, precision=0, commas=False)
    energy_fp32_str = f"{_e_fp32}"
    energy_fp16_str = f"{_e_fp16}"
    energy_int8_str = f"{_e_int8}"
    energy_dram_str = fmt(_e_dram, precision=0, commas=False)
    energy_reg_str = fmt(_e_reg, precision=2, commas=False)
    energy_l1_str = fmt(_e_l1, precision=1, commas=False)
    energy_l2_str = fmt(_e_l2, precision=1, commas=False)
    dram_energy_pj_str = fmt(ENERGY_DRAM_PJ_PER_BYTE.m_as(ureg.picojoule / byte), precision=0, commas=False)
    bert_params_m_str = fmt(_bert_m, precision=0, commas=False)
    bert_large_params_m_str = fmt(_bert_large_m, precision=0, commas=False)
    mobilenet_params_m_str = fmt(_mobilenet_m, precision=1, commas=False)
    mobilenet_v1_size_mb_str = fmt(_mv1_size_mb, precision=0, commas=False)
    mobilenet_v1_int8_size_mb_str = fmt(_mv1_int8_mb, precision=1, commas=False)
    mobilenet_v1_compression_ratio_str = fmt(_mv1_compress, precision=0, commas=False)
    mobilenet_flops_ratio_str = fmt(_flops_ratio, precision=1, commas=False)
    edgetpu_latency_ms_str = fmt(_edgetpu_latency_ms, precision=1, commas=False)
    cpu_latency_ms_str = fmt(_cpu_latency_ms, precision=0, commas=False)
    resnet50_params_m_str = fmt(_resnet50_m, precision=1, commas=False)
    v100_tflops_fp32_str = fmt(_v100_fp32, precision=1, commas=False)
    anomaly_params_k_str = fmt(_anomaly_k, precision=0, commas=False)
    anomaly_latency_ms_str = fmt(_anomaly_latency_ms, precision=1, commas=False)
    anomaly_auc_str = fmt(_anomaly_auc, precision=2, commas=False)
    anomaly_energy_uj_str = fmt(_anomaly_energy_uj, precision=0, commas=False)
    gpt3_energy_mwh_str = fmt(_gpt3_energy_mwh, precision=0, commas=True)
    accel_power_w_str = fmt(_accel_power_w, precision=0, commas=False)
    latency_fast_ms_str = fmt(_lat_fast_ms, precision=0, commas=False)
    latency_slow_ms_str = fmt(_lat_slow_ms, precision=0, commas=False)
    energy_fast_j_str = fmt(_e_fast_j, precision=0, commas=False)
    energy_fast_wh_str = fmt(_e_fast_wh, precision=6, commas=False)
    energy_slow_wh_str = fmt(_e_slow_wh, precision=5, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
a100_tflops_fp16_str = BenchmarkingSetup.a100_tflops_fp16_str
a100_tflops_fp32_str = BenchmarkingSetup.a100_tflops_fp32_str
a100_bw_gbs_str = BenchmarkingSetup.a100_bw_gbs_str
a100_bw_tbs_str = BenchmarkingSetup.a100_bw_tbs_str
h100_tflops_fp16_str = BenchmarkingSetup.h100_tflops_fp16_str
h100_tflops_fp8_str = BenchmarkingSetup.h100_tflops_fp8_str
h100_tflops_int8_str = BenchmarkingSetup.h100_tflops_int8_str
h100_bw_tbs_str = BenchmarkingSetup.h100_bw_tbs_str
a100_ridge_str = BenchmarkingSetup.a100_ridge_str
gpt3_params_b_str = BenchmarkingSetup.gpt3_params_b_str
gpt3_params_billion_str = BenchmarkingSetup.gpt3_params_billion_str
gpt3_tokens_b_str = BenchmarkingSetup.gpt3_tokens_b_str
h100_tdp_str = BenchmarkingSetup.h100_tdp_str
energy_fp32_str = BenchmarkingSetup.energy_fp32_str
energy_fp16_str = BenchmarkingSetup.energy_fp16_str
energy_int8_str = BenchmarkingSetup.energy_int8_str
energy_dram_str = BenchmarkingSetup.energy_dram_str
energy_reg_str = BenchmarkingSetup.energy_reg_str
energy_l1_str = BenchmarkingSetup.energy_l1_str
energy_l2_str = BenchmarkingSetup.energy_l2_str
dram_energy_pj_str = BenchmarkingSetup.dram_energy_pj_str
bert_params_m_str = BenchmarkingSetup.bert_params_m_str
bert_large_params_m_str = BenchmarkingSetup.bert_large_params_m_str
mobilenet_params_m_str = BenchmarkingSetup.mobilenet_params_m_str
mobilenet_v1_size_mb_str = BenchmarkingSetup.mobilenet_v1_size_mb_str
mobilenet_v1_int8_size_mb_str = BenchmarkingSetup.mobilenet_v1_int8_size_mb_str
mobilenet_v1_compression_ratio_str = BenchmarkingSetup.mobilenet_v1_compression_ratio_str
mobilenet_flops_ratio_str = BenchmarkingSetup.mobilenet_flops_ratio_str
edgetpu_latency_ms_str = BenchmarkingSetup.edgetpu_latency_ms_str
cpu_latency_ms_str = BenchmarkingSetup.cpu_latency_ms_str
resnet50_params_m_str = BenchmarkingSetup.resnet50_params_m_str
v100_tflops_fp32_str = BenchmarkingSetup.v100_tflops_fp32_str
anomaly_params_k_str = BenchmarkingSetup.anomaly_params_k_str
anomaly_latency_ms_str = BenchmarkingSetup.anomaly_latency_ms_str
anomaly_auc_str = BenchmarkingSetup.anomaly_auc_str
anomaly_energy_uj_str = BenchmarkingSetup.anomaly_energy_uj_str
gpt3_energy_mwh_str = BenchmarkingSetup.gpt3_energy_mwh_str
accel_power_w_str = BenchmarkingSetup.accel_power_w_str
latency_fast_ms_str = BenchmarkingSetup.latency_fast_ms_str
latency_slow_ms_str = BenchmarkingSetup.latency_slow_ms_str
energy_fast_j_str = BenchmarkingSetup.energy_fast_j_str
energy_fast_wh_str = BenchmarkingSetup.energy_fast_wh_str
energy_slow_wh_str = BenchmarkingSetup.energy_slow_wh_str
```

## ML Benchmarking Framework {#sec-benchmarking-machine-learning-benchmarking-framework-70b8}

The preceding chapters established physical laws (the Iron Law, the Conservation of Complexity, the Memory Wall) and developed diagnostic methods for applying them. Benchmarking is where those laws face empirical reality. The benchmark-production gap, routinely 2$\times$–10$\times$, is not a failure of methodology but the measure of how much physical reality exceeds our models of it. Closing that gap by designing measurements that predict production behavior with quantitative fidelity is the core competency that distinguishes ML systems engineering from ML research. Benchmarking is the discipline's truth-telling function: the practice that converts theoretical claims into verified engineering knowledge.

The optimization techniques from preceding chapters all claim improvements. Data selection strategies (@sec-data-selection) promise more efficient training. Model compression (@sec-model-compression) promises smaller, faster models. Hardware acceleration (@sec-hardware-acceleration) promises higher throughput. Yet *how* do we know these claims hold in production? A model quantized to INT8 may benchmark 2$\times$ faster on a synthetic workload but show no improvement under real traffic patterns with variable input sizes and concurrent requests. A pruned model may maintain accuracy on the test set but fail on edge cases the benchmark never covered. Benchmarking is the discipline of verifying that optimizations deliver their promised benefits under realistic conditions.

\index{System Benchmarking!definition}
\index{Model Benchmarking!definition}
\index{Data Benchmarking!definition}
ML benchmarking operates across three independent dimensions that map directly to the components of any deployed system. *System benchmarking* asks: does the hardware deliver promised performance under realistic workloads, or do memory bandwidth saturation and software dispatch overhead erode the gains? *Model benchmarking* asks: did optimization techniques preserve model quality across the full input distribution, not just on curated test sets? *Data benchmarking* asks: does the model generalize to real-world data with all its noise, bias, and distributional shift? Each dimension can independently reveal problems invisible to the others, and a system that passes all three provides far stronger deployment confidence than one evaluated along any single axis.

::: {.callout-definition title="Machine Learning Benchmarking"}

***Machine Learning Benchmarking***\index{Benchmarking!definition} is the empirical measurement of system performance against **Representative Workloads**.

1.  **Significance (Quantitative):** It exists to decouple **Peak Performance ($R_{peak}$)** (marketing specs) from **Sustained Performance** (real-world capability), quantifying the **System Efficiency ($\eta$)** and the impact of software overheads ($L_{lat}$).
2.  **Distinction (Durable):** Unlike **Micro-benchmarks** (which measure individual components like GEMM), ML Benchmarking measures **End-to-End System Performance** on full models and datasets.
3.  **Common Pitfall:** A frequent misconception is that benchmarks are "fixed numbers." In reality, they are **Moving Targets**: as models and hardware evolve, the "state-of-the-art" benchmark result becomes the new baseline.

:::

Unlike traditional systems where benchmarks represent fixed specifications, ML benchmarks capture only a snapshot of a shifting reality. This distinction has profound implications for how we interpret benchmark results.

::: {.callout-perspective title="Benchmarks as Moving Targets"}
In traditional systems (e.g., SPEC CPU), the benchmark is a *rigid specification*. A sorting algorithm is correct if it sorts the list. Correctness is absolute and unchanging. In ML systems, the benchmark is a *soft specification*: correctness is defined by a finite set of examples (ImageNet), and the world moves. A model that scores 99% on ImageNet might fail completely on user photos taken years after the benchmark was created.

In computer architecture, you design for the benchmark because the benchmark represents the workload. In ML engineering, designing solely for the benchmark is *overfitting*. Robustness comes from acknowledging that the benchmark is only a proxy for a shifting reality.
:::

\index{MobileNet!deployment validation example}
To make this three-dimensional framework concrete, we ground it in a running example that threads through the entire chapter, returning to it repeatedly as we develop each dimension. MobileNet (the **Edge Lighthouse** from @sec-network-architectures) deployment validation spans all three evaluation dimensions, illustrating *how* each reveals problems the others cannot.

::: {.callout-lighthouse title="MobileNet Deployment Validation"}
Throughout this chapter, we validate the complete optimization pipeline using **MobileNet** (introduced in @sec-network-architectures-lighthouse-roster-model-biographies-a763) as our lighthouse example. The initial compression figures below use MobileNet v1 parameters; later sections reference MobileNetV2, which refines v1's depthwise separable design with inverted residuals and linear bottlenecks while maintaining a similar parameter scale. MobileNet exemplifies the deployment challenges where benchmarking determines success or failure. The validation questions below preview the three dimensions we develop throughout this chapter. Each section makes these questions concrete with specific metrics and methodologies.

**The Optimization Pipeline**:

\index{Quantization!INT8 compression ratio}
\index{Edge Accelerator!inference latency}
1. **Model Compression** (@sec-model-compression): INT8 quantization reduces MobileNet from `{python} mobilenet_v1_size_mb_str` MB to `{python} mobilenet_v1_int8_size_mb_str` MB (`{python} mobilenet_v1_compression_ratio_str`$\times$ compression)
2. **Hardware Acceleration** (@sec-hardware-acceleration): EdgeTPU deployment achieves `{python} edgetpu_latency_ms_str` ms inference versus `{python} cpu_latency_ms_str` ms on CPU
3. **Benchmarking Validation** (this chapter): Verify the pipeline delivers in practice

**Three-Dimensional Validation Questions**:

- **System**: Does EdgeTPU actually achieve `{python} edgetpu_latency_ms_str` ms, or do preprocessing and data transfer add 10ms of overhead?
- **Model Quality**: Did INT8 quantization preserve accuracy? What about edge cases with unusual lighting?
- **Data**: Does performance hold on real-world smartphone images, not just ImageNet test images?

Each section of this chapter addresses one dimension of this validation stack. By the end, you will understand *how* to answer these questions systematically for any optimization pipeline.
:::

Before examining these dimensions in detail, we must establish the mindset that separates rigorous evaluation from misleading metrics. Three principles distinguish effective practitioners.

\index{Benchmarking!benchmarks as proxies}
First, *benchmarks are proxies, not truth.* Every benchmark measures specific conditions that may not match your deployment. A system achieving 10,000 samples/second in Offline mode might achieve only 200 QPS in Server mode with latency constraints. The critical question is always: "What does this benchmark NOT measure?"

Second, Goodhart's Law applies everywhere.[^fn-goodharts-law]\index{Goodhart's Law!metric trap} "When a measure becomes a target, it ceases to be a good measure." Teams that optimize for benchmark rankings often produce systems that excel in evaluation but fail in production. Benchmark-specific optimizations frequently degrade characteristics—robustness, calibration, efficiency—that matter for deployment.

\index{Benchmarking!end-to-end vs. component metrics}
Third, end-to-end beats component metrics. Vendors report component latency (5–10 ms for model inference), but production latency includes preprocessing, queuing, and postprocessing (50–100 ms total). A 3$\times$ inference speedup in isolation might yield only 1.3$\times$ end-to-end improvement, or worse if the optimization increases memory pressure.

[^fn-goodharts-law]: **Goodhart's Law**: Articulated by Charles Goodhart in a 1975 Bank of England paper on monetary policy [@goodhart1984problems]; generalized by Marilyn Strathern in 1997 into the form quoted above [@strathern1997improving]. The original context was macroeconomics: once a monetary aggregate became an official policy target, banks changed behavior to game the metric, destroying its predictive value. In ML, the same failure mode recurs structurally: BLEU scores incentivize n-gram matching over fluency, ImageNet accuracy rewards architecture tricks over robustness, and benchmark leaderboards incentivize test-set overfitting — each a case where the metric's success as a target caused its failure as a measure. \index{Goodhart's Law!etymology}

These principles reappear throughout this chapter and are examined in depth in @sec-benchmarking-fallacies-pitfalls-9781.

Knowing *what* to measure, however, is only half the problem. Measuring incorrectly—with the wrong workloads, biased baselines, or uncontrolled variables—produces numbers that feel precise but mislead decisions. The history of computing benchmarking is littered with examples of technically sound metrics applied with flawed methodology, from compiler-gamed Whetstone scores to cherry-picked GPU benchmarks that predict nothing about sustained workloads. Understanding how measurement methodology evolved, and where it failed, is essential for designing benchmarks that distinguish genuine improvements from measurement artifacts.

We begin with these historical foundations of benchmarking[^fn-benchmark-etymology] to understand which lessons from decades of computing measurement apply to ML, then examine each of the three dimensions in depth, before showing how integrated benchmarking brings them together. This structure reflects the validation sequence practitioners follow: *first verify* hardware delivers promised performance, *then verify* the model and data optimizations built atop that hardware deliver their promised gains.

[^fn-benchmark-etymology]: **Benchmark**: From surveying, where a "bench mark" was a horizontal cut in stone serving as a fixed elevation reference. The term entered computing in the 1970s to describe standardized comparison points, but the surveying metaphor carries a systems lesson: just as an elevation measurement is meaningless without a calibrated reference, an ML throughput number is meaningless without controlled workloads, thermal state, and precision settings. \index{Benchmark!etymology}

## Historical Foundations {#sec-benchmarking-historical-context-7350}

\index{Benchmarking!historical evolution}
In 1976, when Whetstone became one of the first standardized computing benchmarks, vendors immediately began optimizing their compilers specifically for its floating-point tests—producing impressive numbers that predicted nothing about real application performance. This gaming problem has plagued every generation of benchmarks since. Understanding *why* ML benchmarking requires our three-dimensional approach demands tracing *how* measurement methodologies evolved, and often failed, over decades of computing history. Each generation of benchmarks emerged from the limitations of its predecessors, teaching lessons that directly inform modern ML evaluation.

Benchmarking intersects with metrics from several chapters. The following note maps these connections.

::: {.callout-perspective title="Related Efficiency Metrics"}
While this chapter focuses on system-level benchmarking, comprehensive evaluation spans multiple dimensions covered elsewhere. For data selection metrics (PPD, DUE), see @sec-data-selection. For model compression evaluation (Accuracy vs. Compression), see @sec-model-compression. For hardware efficiency metrics (Roofline, TOPS/Watt), see @sec-hardware-acceleration. The system benchmarks in @sec-benchmarking-system-benchmarks-393c through @sec-benchmarking-mlperf-power-case-study-a554 validate these hardware claims; @sec-benchmarking-model-data-benchmarking-e0ca addresses model and data validation.
:::

The evolution from simple performance metrics to ML benchmarking reveals three methodological shifts, each emerging when practitioners discovered that previous evaluation approaches failed to predict real-world performance.

### Performance Benchmarks {#sec-benchmarking-performance-benchmarks-ea8a}

\index{Whetstone!synthetic floating-point benchmark}
\index{LINPACK!matrix operations benchmark}
\index{SPEC CPU!real application workloads}
\index{Benchmark Gaming!vendor optimization for tests}
The earliest computing benchmarks revealed a problem that plagues evaluation to this day: gaming. Mainframe benchmarks like Whetstone (1976) and LINPACK (1979)[^fn-whetstone-linpack] measured isolated operations—floating-point throughput, matrix solve speed—and vendors quickly learned to optimize specifically for these narrow tests rather than for practical performance. The resulting numbers looked impressive on paper but predicted little about how systems performed on actual workloads. SPEC CPU (1989) broke this cycle by pioneering the use of real application workloads, ensuring that evaluation reflected actual deployment scenarios rather than synthetic ideals. This lesson directly shapes ML benchmarking: optimization claims from @sec-model-compression require validation on representative tasks, and MLPerf's inclusion of real models like ResNet-50 and BERT ensures benchmarks capture deployment complexity rather than idealized test cases.

[^fn-whetstone-linpack]: **Whetstone and LINPACK**: Whetstone (1972, published 1976) was named after the English Electric facility in Whetstone, Leicestershire, where the original ALGOL compiler was built; LINPACK (1979) was Jack Dongarra's benchmark for dense linear systems, later adopted by the Top500 list in 1993. Both measured a single operation type so narrowly that compilers could be tuned to game the result -- Whetstone's floating-point loops became a test of compiler optimization rather than hardware performance. ML benchmarking inherited the same vulnerability: single-model benchmarks can be gamed through model-specific kernel tuning, which is why MLPerf requires multiple workloads spanning vision, language, and recommendation. \index{Whetstone!etymology}\index{LINPACK!narrow benchmark}

As deployment contexts diversified, a second limitation emerged: single-metric evaluation proved inadequate. Graphics benchmarks began measuring rendering quality alongside frame rate; mobile benchmarks added battery life as a co-equal concern with performance. The multi-objective challenges from @sec-introduction (balancing accuracy, latency, and energy) manifest directly in modern ML evaluation, where no single metric captures deployment viability.

A third shift occurred when distributed computing revealed that component-level optimization fails to predict system-level performance. A CPU benchmark cannot predict cluster throughput when network communication dominates. ML training similarly depends on the interplay of accelerator compute (@sec-hardware-acceleration), data pipelines, gradient synchronization, and storage throughput. MLPerf evaluates complete workflows, recognizing that performance emerges from component interactions, not from components in isolation.

\index{DAWNBench!time-to-accuracy evaluation}
\index{MLPerf!founding and purpose}
\index{ResNet-50!MLPerf reference model}
\index{BERT!MLPerf reference model}
DAWNBench [@coleman2017dawnbench] emerged as an early ML benchmark that pioneered time-to-accuracy evaluation, directly influencing MLPerf's methodology for measuring training efficiency. These lessons culminate in MLPerf[^fn-mlperf] (2018), which synthesizes representative workloads, multi-objective evaluation, and integrated measurement while addressing ML-specific challenges [@ranganathan2024twenty].

\index{Patterson, David!MLPerf leadership}
\index{MLCommons!benchmark organization}

[^fn-mlperf]: **MLPerf**: Founded in 2018 by researchers from Google, NVIDIA, Intel, Harvard, Stanford, and UC Berkeley, the name combines "ML" with "Perf" (performance), echoing SPEC's benchmarking tradition. MLPerf's design principles — representative workloads, full-system measurement, and open submission — directly address the gaming that plagued Whetstone and LINPACK: vendors who could previously report peak kernel throughput on cherry-picked problem sizes must now report end-to-end system performance on standardized tasks. \index{MLPerf!founding}

### Energy Benchmarks {#sec-benchmarking-energy-benchmarks-709a}

\index{Energy Benchmarking!first-class metric}
\index{SPEC Power!server energy efficiency}
\index{Green500!HPC energy efficiency ranking}
The multi-objective evaluation paradigm naturally extended to energy efficiency as computing diversified beyond mainframes with unlimited power budgets. Mobile devices demanded battery life optimization, while warehouse-scale systems faced energy costs rivaling hardware expenses. This shift established energy as a first-class metric alongside performance, spawning benchmarks like SPEC Power[^fn-spec-power] for servers and Green500[^fn-green500] for supercomputers.

[^fn-spec-power]: **SPEC Power**: Introduced in 2007, SPEC Power measures performance per watt across 11 load levels from idle (0%) through 100% in 10% increments. This granularity matters for ML serving: inference workloads rarely sustain 100% load, and servers that are efficient at peak but wasteful at 30% utilization inflate the energy cost of real-world deployment, where average utilization typically hovers between 20--50%. \index{SPEC Power!energy efficiency}

[^fn-green500]: **Green500**: Started in 2007 as a counterpart to the Top500, Green500 ranks systems by FLOPS per watt rather than raw performance. The shift from less than 1 GFLOPS/watt in the early 2000s to over 60 GFLOPS/watt today reveals that efficiency gains have outpaced raw performance gains, a pattern ML systems inherit: the most cost-effective training clusters are not the fastest but the most efficient. \index{Green500!energy efficiency ranking}

\index{MLPerf Power!ML energy measurement}
Diverse workload patterns and system configurations continue to challenge power benchmarking across computing environments. MLPerf Power [@mlperf_power_website] addresses this with specialized methodologies for measuring the energy impact of machine learning workloads, reflecting energy efficiency's central role in AI system design.

Energy benchmarking extends beyond hardware power measurement to include algorithmic efficiency. Model compression techniques (pruning, quantization, knowledge distillation) often achieve greater energy savings than hardware improvements alone. MobileNet architectures achieve approximately `{python} mobilenet_flops_ratio_str`$\times$ fewer FLOPs versus ResNet, translating to proportional energy reduction on hardware that efficiently exploits the smaller model [@howard2017mobilenets]. These techniques, detailed in @sec-model-compression, establish that energy-aware benchmarking must evaluate algorithmic efficiency alongside hardware power consumption; the specific energy breakdown of INT8 versus FP32 is quantified in @sec-benchmarking-training-metrics-0f1a. As AI systems scale, this lesson becomes central to sustainable computing practices.

### Domain-Specific Benchmarks {#sec-benchmarking-domainspecific-benchmarks-b15f}

As computing diversified beyond general-purpose servers, generic benchmarks proved inadequate for specialized domains. Three categories of specialization drove this evolution, each exposing measurement dimensions that general-purpose benchmarks could not address.

Deployment constraints shape core metric priorities. Datacenter workloads optimize for throughput with kilowatt-scale power budgets, while mobile AI operates within 2–5 W thermal envelopes, and IoT devices require milliwatt-scale operation. These constraints, rooted in efficiency principles from @sec-introduction, determine whether benchmarks prioritize total throughput or energy per operation.

Application requirements then impose functional and regulatory constraints beyond raw performance. Healthcare AI demands interpretability metrics alongside accuracy; financial systems require microsecond latency with audit compliance; autonomous vehicles need safety-critical reliability (ASIL-D: $<10^{-8}$ failure/hour). These requirements, connecting to responsible AI principles from @sec-responsible-engineering, extend evaluation beyond traditional performance metrics.

Operational conditions determine real-world viability. Autonomous vehicles face -40°C to +85°C temperatures and degraded sensor inputs; datacenters handle millions of concurrent requests with network partitions; industrial IoT endures years-long deployment without maintenance. The hardware capabilities from @sec-hardware-acceleration only deliver value when validated under these conditions.

\index{MLPerf Training!datacenter multi-node scaling}
\index{MLPerf Inference!server to edge evaluation}
\index{MLPerf Tiny!microcontroller benchmarks}
Machine learning exemplifies this transition to domain-specific evaluation. Traditional CPU and GPU benchmarks prove insufficient for assessing ML workloads, which involve complex interactions between computation, memory bandwidth, and data movement patterns. MLPerf has standardized performance measurement for machine learning models across these three categories: MLPerf Training addresses datacenter deployment constraints with multi-node scaling benchmarks, MLPerf Inference evaluates latency-critical application requirements across server to edge deployments, and MLPerf Tiny assesses ultra-constrained operational conditions for microcontroller deployments. This tiered structure, summarized in @tbl-mlperf-suites, reflects the systematic application of our three-category framework to ML-specific evaluation needs.

: **MLPerf Benchmark Suite Variants.** Each variant addresses a different deployment context, from datacenter-scale training to ultra-constrained microcontroller inference, targeting specific operational constraints and measuring metrics relevant to its deployment scenario. {#tbl-mlperf-suites}

| **MLPerf Variant**   | **Target Domain** | **Key Constraints**                              | **Primary Metrics**                             |
|:---------------------|:------------------|:-------------------------------------------------|:------------------------------------------------|
| **MLPerf Training**  | Datacenter        | Multi-node scaling, high bandwidth interconnects | Time-to-quality, throughput (samples/sec)       |
| **MLPerf Inference** | Server / Edge     | Latency SLAs, throughput requirements            | QPS, latency percentiles, accuracy preservation |
| **MLPerf Tiny**      | MCU / IoT         | Ultra-constrained (<1mW), limited memory (<1MB)  | Latency, accuracy, energy per inference         |
| **MLPerf Power**     | Cross-cutting     | Energy budgets, thermal constraints              | Performance/Watt, energy per query              |

Domain-specific benchmarks drive targeted hardware and software optimizations while ensuring that improvements translate to deployment success rather than narrow laboratory conditions.

This historical progression—from general computing benchmarks through energy-aware measurement to domain-specific evaluation frameworks—provides the foundation for understanding contemporary ML benchmarking challenges. The lessons learned (representative workloads over synthetic tests, multi-objective over single metrics, integrated systems over isolated components) directly shape how we approach AI system evaluation today. @tbl-benchmark-evolution summarizes this progression and the key lessons each generation contributed.

: **Benchmark Evolution.** Evolution of computing benchmarks from synthetic operations to ML-specific evaluation. Each generation addressed limitations of its predecessors, culminating in MLPerf's synthesis of representative workloads, multi-objective metrics, and integrated system measurement. {#tbl-benchmark-evolution}

| **Benchmark**  | **Year** | **Primary Focus**                   | **Key Metric(s)**                       | **Lesson for ML Benchmarking**                                               |
|:---------------|---------:|:------------------------------------|:----------------------------------------|:-----------------------------------------------------------------------------|
| **Whetstone**  |     1976 | Synthetic floating-point operations | MWIPS                                   | Gaming synthetic tests undermines evaluation validity                        |
| **LINPACK**    |     1979 | Linear algebra (matrix operations)  | FLOPS                                   | Isolated operations miss system-level complexity and bottlenecks             |
| **SPEC CPU**   |     1989 | Real application workloads          | SPECrate, SPECspeed                     | Representative workloads reveal true deployment performance                  |
| **SPEC Power** |     2007 | Server energy efficiency            | ssj_ops/Watt across load levels         | Energy efficiency requires multi-load evaluation, not just peak performance  |
| **Green500**   |     2007 | HPC energy efficiency               | GFLOPS/Watt                             | Efficiency rankings complement raw performance rankings                      |
| **MLPerf**     |     2018 | ML systems (training + inference)   | Time-to-quality, QPS, latency, accuracy | Synthesizes all lessons: representative workloads + multi-objective + system |

\index{Benchmarking!probabilistic variability}
These lessons culminate in modern ML benchmarking suites. But ML systems face an additional challenge absent from traditional benchmarks: inherent probabilistic variability. Unlike traditional workloads with deterministic behavior, ML systems must satisfy all three historical lessons (representative workloads, multi-objective evaluation, integrated measurement) while also accounting for stochastic outcomes that vary with training data, weight initialization, and even operation ordering. This additional dimension of variability demands new measurement methodologies, which modern ML benchmarking suites must address head-on.

Individual organizations learned these lessons independently—often painfully—but isolated measurements cannot drive an industry. When one team measures inference latency including preprocessing and another excludes it, when accuracy benchmarks use different data splits, or when power measurements draw different system boundaries, the resulting numbers are incommensurable. The transition from ad-hoc measurement to standardized benchmarking suites transforms benchmarking from an internal validation exercise into a shared language that enables hardware procurement, architecture comparison, and deployment decisions across organizations.

## System Benchmarking Suites {#sec-benchmarking-system-benchmarking-suites-e946}

A team evaluating edge deployment hardware needs to compare five different SoCs for a smart camera product. Vendor A reports 8 TOPS at INT8; Vendor B reports 15 TOPS at INT4; Vendor C reports inference latency on a proprietary model; Vendor D cites MLPerf scores from two generations ago; Vendor E provides only peak throughput at maximum batch size. None of these numbers are comparable. The team cannot make a procurement decision because every vendor measured a different thing, under different conditions, using different definitions of "performance." This fragmentation—not a lack of data, but a lack of *commensurable* data—is precisely the problem that benchmarking suites exist to solve.

Three lessons from benchmark history—representative workloads, multi-objective evaluation, and integrated measurement—converge with the challenge unique to ML: inherent probabilistic variability. Modern benchmarking suites encode these lessons into standardized frameworks that make the kind of cross-organization comparison our hardware procurement team needs possible.

ML benchmarks must evaluate the interplay between algorithms, hardware, and data, not merely computational efficiency alone. Early benchmarks focused on algorithmic performance [@lecun1998gradient], but scaling demands expanded the focus to hardware efficiency [@jouppi2017datacenter], and high-profile deployment failures elevated data quality as a third evaluation dimension [@gebru2021datasheets]. This probabilistic nature elevates accuracy to a first-class evaluation dimension alongside speed and energy consumption: the same ML system can produce different results depending on the data it encounters. Energy efficiency cuts across all three framework dimensions, since algorithmic choices affect computational complexity, hardware capabilities determine energy-performance trade-offs, and dataset characteristics influence training energy costs [@hernandez2020measuring].

### ML Measurement Challenges {#sec-benchmarking-ml-measurement-challenges-60ea}

\index{Measurement Variability!sources in ML systems}
The unique characteristics of ML systems create measurement challenges that many traditional benchmarks were not designed for. Unlike deterministic algorithms that produce identical outputs given the same inputs, ML systems exhibit inherent variability from multiple sources: algorithmic randomness from weight initialization and data shuffling, hardware thermal states affecting clock speeds, system load variations from concurrent processes, and environmental factors including network conditions and power management. This variability requires rigorous statistical methodology to distinguish genuine performance improvements from measurement noise.

\index{Random Seeds!benchmark protocol requirement}
\index{Confidence Intervals!benchmark reporting}
To address this variability, effective benchmark protocols require multiple experimental runs with different random seeds. Running each benchmark 5–10 times and reporting statistical measures beyond simple means (including standard deviations or 95% confidence intervals) quantifies result stability and allows practitioners to distinguish genuine performance improvements from measurement noise.

Empirical studies have shown how inadequate statistical rigor can lead to misleading conclusions. Many reinforcement learning papers report improvements that fall within statistical noise [@henderson2018deep], while GAN comparisons often lack proper experimental protocols, leading to inconsistent rankings across different random seeds [@lucic2018gans]. These findings underscore the importance of establishing measurement protocols that account for ML's probabilistic nature.

Representative workload selection determines benchmark validity. Synthetic microbenchmarks often fail to capture the complexity of real ML workloads where data movement, memory allocation, and dynamic batching create performance patterns not visible in simplified tests. Comprehensive benchmarking therefore requires workloads that reflect actual deployment patterns: variable sequence lengths in language models, mixed precision training regimes, and realistic data loading patterns that include preprocessing overhead.

Beyond workload representativeness, the distinction between statistical significance and practical significance requires careful interpretation. A small performance improvement might achieve statistical significance across hundreds of trials but prove operationally irrelevant if it falls within measurement noise or costs exceed benefits. This creates what we call *the statistical confidence trap*, where seemingly rigorous evaluation still misleads.

::: {.callout-notebook title="The Statistical Confidence Trap"}
**Problem**: You are optimizing an image classifier that currently has **95% accuracy**. You deploy a "compressed" version and measure its accuracy on a **1,000-image** test set. You get **94%**. Did your optimization cause a real regression, or is it just noise?

**The Math**:

1.  **Expected Errors**: At 95%, you expect 50 errors. At 94%, you expect 60 errors.
2.  **Standard Deviation ($\sigma$)**: Using the binomial distribution $\sqrt{N p (1-p)}$:

    $$ \sigma \approx \sqrt{1000 \times 0.05 \times 0.95} \approx \mathbf{6.9 \text{ images}} $$

3.  **95% Confidence Interval**: $50 \pm 1.96 \times 6.9 \approx \mathbf{[36, 64]}$.

**The Systems Conclusion**: Both 50 and 60 fall inside the **same confidence interval**. A 1,000-sample test set **cannot reliably detect** a 1% accuracy drop. To distinguish a 1% change with high confidence, you need $\approx \mathbf{10,000}$ samples.

**The Moral**: Small benchmarks are the "Laboratory Fallacy." In AI Engineering, your sensors (the test set) must be sized to match the precision of the change you are trying to measure.
:::

This principle manifests concretely in practice through *Goodhart's Law in action*:

::: {.callout-notebook title="Goodhart's Law in Action"}
**The Metric Trap**: Optimizing for a single metric often degrades others.

\index{Goodhart's Law!BLEU score example}
\index{BLEU Score!metric trap example}
**Scenario**: You optimize a translation model for **BLEU score**.

*   **Original Model**: BLEU = 28.0, Inference = 50ms.
*   **Optimized Model**: BLEU = 28.5 (Better!), Inference = 200 ms (4$\times$ slower).

**The Math**:

*   The 0.5 BLEU gain comes from using a larger beam search (beam_size=10 vs beam_size=1).
*   **Cost**: $10 \times$ more candidate evaluations per step.
*   **Result**: You won the leaderboard but destroyed the product.

**The Systems Conclusion**: Always constrain your optimization. Maximize Accuracy *subject to* Latency < 100ms.
:::

Current benchmarking paradigms measure narrow task performance on static datasets, primarily testing pattern recognition rather than the adaptability production demands. When models achieve excellent benchmark scores yet fail under slightly different conditions, the limitation is clear: comprehensive evaluation must also measure learning efficiency, continual learning, and out-of-distribution generalization.

The measurement challenges above motivate evaluating each dimension of our three-dimensional framework—system, model, and data—with distinct methodologies. The bulk of this chapter focuses on system benchmarking (training benchmarks, inference benchmarks, and power measurement) because these form the foundation of standardized evaluation through MLPerf. Model and data benchmarking require different methodologies and are treated in detail in @sec-benchmarking-model-data-benchmarking-e0ca after we establish system evaluation foundations.

### System Benchmarks {#sec-benchmarking-system-benchmarks-393c}

\index{Fallacy of Peak Performance!GPU utilization gap}
\index{Memory Wall!impact on GPU utilization}
System benchmarks measure the computational foundation that enables model capabilities, examining how hardware architectures, memory systems, and interconnects affect overall performance. This validation is critical because hardware specifications often describe theoretical peaks that real workloads never achieve. A GPU advertising 300 TFLOPS might deliver only 30 TFLOPS on memory-bound transformer inference. This discrepancy is so common it constitutes *the fallacy of peak performance*. System benchmarks reveal these gaps by running standardized ML workloads rather than synthetic microbenchmarks.

::: {.callout-perspective title="The Fallacy of Peak Performance"}
Dave Patterson often refers to peak performance as "the performance the manufacturer guarantees you will not exceed." For ML systems, this gap between peak and achieved performance is especially wide because of the **Memory Wall**. A GPU might advertise 300 TFLOPS, but if your model is memory-bound, you might only see 10 TFLOPS. Standardized benchmarks like MLPerf are essential because they force systems to run *real* models on *real* data, revealing the true "sustained performance" that engineers can actually rely on.
:::

Armed with this understanding, we can critically evaluate the benchmark claims encountered in vendor documentation and marketing materials.

::: {.callout-warning title="Decoding Vendor Benchmark Claims"}
When evaluating hardware or software based on vendor-reported benchmarks, ask these critical questions:

**What is measured?**

- "10ms inference latency"—is this model-only, or including preprocessing/postprocessing?
- "1000 TOPS"—at what precision? INT4 TOPS are 4$\times$ INT8 TOPS on the same hardware
- "2$\times$ faster than competitor"—on which workload? What batch size? What precision?

**What is excluded?**

- Memory transfer time between CPU and accelerator
- Model loading and initialization overhead
- Thermal throttling under sustained workloads
- Power consumption at the claimed performance level

**What conditions produced these results?**

- Batch size (larger batches inflate throughput numbers but increase latency)
- Precision (FP32 vs. FP16 vs. INT8 vs. INT4)
- Model variant (smaller models benchmark faster but may not meet your accuracy needs)
- Thermal state (fresh cold start vs. sustained operation)

**Translation guide for common claims:**

| **Vendor Claim**               | **What It Often Means**                                            |
|:-------------------------------|:-------------------------------------------------------------------|
| **"Up to 10,000 images/sec"**  | Peak throughput at maximum batch size, INT8, without preprocessing |
| **"Sub-millisecond latency"**  | Accelerator compute only, excluding data transfer                  |
| **"5$\times$ more efficient"** | Per-operation efficiency, not total system efficiency              |
| **"Optimized for AI"**         | May only accelerate specific operations or precisions              |
:::

The underlying hardware infrastructure (CPUs, GPUs, TPUs[^fn-bench-tpu], and ASICs[^fn-asic]) determines the speed, efficiency, and scalability of ML systems. System benchmarks establish standardized methodologies for evaluating hardware performance across AI workloads, measuring metrics including computational throughput, memory bandwidth, power efficiency, and scaling characteristics [@reddi2020mlperf; @mattson2020mlperf].

[^fn-bench-tpu]: **TPU (Tensor Processing Unit)**: Google's custom ASIC for neural network workloads (architecture details in @sec-hardware-acceleration). A TPU v4 pod (4,096 chips) delivers 1.1 exaFLOPS peak BF16, but benchmarking TPUs requires caution: their systolic-array architecture favors regular tensor operations, so peak FLOPS overstate performance on irregular workloads like sparse attention or dynamic control flow. \index{TPU!benchmarking considerations}

[^fn-asic]: **ASIC (Application-Specific Integrated Circuit)**: An ASIC's peak TOPS number applies only to the specific operators it was designed for. A single unsupported layer forces fallback to a general-purpose processor, potentially negating the entire efficiency advantage. This makes operator coverage the first question in any ASIC benchmark: the gap between peak and achieved throughput is not a hardware limitation but a workload-compatibility limitation. \index{ASIC!benchmarking trade-off}

System benchmarks serve two functions. For practitioners, they enable informed hardware selection by providing comparative data across configurations. For manufacturers, they quantify generational improvements and guide accelerator development. The co-evolution has been dramatic: as GPU adoption grew, accuracy improved rapidly, demonstrating that hardware and algorithmic advances drive progress in tandem.

::: {.callout-definition title="Machine Learning System Benchmarks"}

***Machine Learning System Benchmarks***\index{ML System Benchmarks!definition} standardize the measurement of **Infrastructure Efficiency** by fixing the workload and quality target.

1.  **Significance (Quantitative):** They isolate the contribution of hardware ($R_{peak}, BW$) and software stacks ($L_{lat}$) to **System Throughput** and **Latency**, effectively measuring the system's ability to execute the **Silicon Contract**.
2.  **Distinction (Durable):** Unlike **Algorithmic Benchmarks** (which focus on **Convergence Accuracy**), System Benchmarks focus on the **Execution Efficiency ($\eta$)** of the implementation.
3.  **Common Pitfall:** A frequent misconception is that System Benchmarks measure "the model." In reality, they measure the **Hardware-Software Intersection**: the same model can yield 10$\times$ different results on different hardware stacks or with different compiler optimizations.

:::

\index{Arithmetic Intensity!threshold for compute vs. memory bound}
\index{Accelerator Specifications!roofline analysis example}
Effective benchmark interpretation requires knowing the performance characteristics of target hardware. Whether a specific AI workload is compute-bound or memory-bound provides essential insight for optimization decisions. Computational intensity, measured as FLOPs[^fn-flops-throughput] per byte of data movement, determines performance limits. Consider an NVIDIA A100 GPU with `{python} a100_tflops_fp16_str` TFLOPS of FP16 Tensor Core performance (FP32 is `{python} a100_tflops_fp32_str` TFLOPS) and `{python} a100_bw_tbs_str` TB/s memory bandwidth (SXM variant). Dividing peak compute by peak bandwidth yields an arithmetic intensity threshold of `{python} a100_ridge_str` FLOPs/byte. Workloads below this threshold are bottlenecked by memory bandwidth, while those above are bottlenecked by compute capacity. The architectural foundations for understanding these hardware characteristics, including the roofline model for analyzing compute-bound versus memory-bound workloads, are established in @sec-hardware-acceleration, which provides context for interpreting system benchmark results.

```{python}
#| label: roofline-example-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ROOFLINE ANALYSIS EXAMPLES (RESNET-50 AND BERT PREVIEW)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Prose explaining compute-bound vs memory-bound workloads, leading
# │          into the BERT roofline worked example callout
# │
# │ Goal: Ground the roofline model in concrete hardware numbers.
# │ Show: The contrast between compute-bound ResNet and memory-bound BERT.
# │ How: Calculate arithmetic intensity and utilization for both models on A100.
# │
# │ Imports: mlsys.constants (A100_MEM_BW, A100_FLOPS_FP16_TENSOR, TB, TFLOPs,
# │          second), mlsys.formatting (fmt)
# │ Exports: resnet_ai_str, resnet_util_min_str, resnet_util_max_str,
# │          resnet_perf_tflops_str, bert_ai_b1_str, bert_perf_b1_str,
# │          bert_util_b1_str, utilization_peak_pct_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import A100_MEM_BW, A100_FLOPS_FP16_TENSOR, TB, TFLOPs, second
from mlsys.constants import BILLION, MILLION
from mlsys.formatting import fmt_percent, fmt, check

# ┌── LEGO ───────────────────────────────────────────────
class RooflineExamples:
    """
    Namespace for Roofline Analysis Examples (ResNet vs BERT).
    Scenario: Comparing compute-bound vs memory-bound workloads on A100.
    """

    # ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
    # A100 Specs (re-derived locally for safety)
    peak_flops = A100_FLOPS_FP16_TENSOR.m_as(TFLOPs/second)
    peak_bw = A100_MEM_BW.m_as(TB/second)
    ridge_point = (peak_flops * TRILLION) / (peak_bw * TRILLION) # ~153

    # ResNet (Compute Bound)
    resnet_ai = 300.0
    resnet_util_min = 85
    resnet_util_max = 90

    # BERT (Memory Bound at Batch=1)
    bert_flops_b = 22.0
    bert_weight_mb = 440.0
    bert_util_peak = 0.85

    # ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
    # Step 1: ResNet Performance
    resnet_perf_tflops = peak_flops * (resnet_util_max / 100.0)

    # Step 2: BERT Performance
    bert_ai_b1 = (bert_flops_b * BILLION) / (bert_weight_mb * MILLION)
    bert_perf_b1 = bert_ai_b1 * peak_bw
    bert_util_b1 = (bert_perf_b1 / peak_flops) * 100.0

    # ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
    check(resnet_ai > ridge_point, f"ResNet AI ({resnet_ai}) must be > Ridge ({ridge_point:.0f}) to be compute-bound.")
    check(bert_ai_b1 < ridge_point, f"BERT AI ({bert_ai_b1:.0f}) must be < Ridge ({ridge_point:.0f}) to be memory-bound.")

    # ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
    # A100 context
    a100_tflops_fp16_str = fmt(peak_flops, precision=0, commas=False)
    a100_bw_tbs_str = fmt(peak_bw, precision=1, commas=False)
    a100_ridge_str = fmt(ridge_point, precision=0, commas=False)

    # ResNet
    resnet_ai_str = fmt(resnet_ai, precision=0, commas=False)
    resnet_util_min_str = fmt(resnet_util_min, precision=0, commas=False)
    resnet_util_max_str = fmt(resnet_util_max, precision=0, commas=False)
    resnet_perf_tflops_str = fmt(resnet_perf_tflops, precision=0, commas=False)

    # BERT
    bert_ai_b1_str = fmt(bert_ai_b1, precision=0, commas=False)
    bert_perf_b1_str = fmt(bert_perf_b1, precision=0, commas=False)
    bert_util_b1_str = fmt(bert_util_b1, precision=0, commas=False)
    utilization_peak_pct_str = fmt_percent(bert_util_peak, precision=0, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
a100_tflops_fp16_str = RooflineExamples.a100_tflops_fp16_str
a100_bw_tbs_str = RooflineExamples.a100_bw_tbs_str
a100_ridge_str = RooflineExamples.a100_ridge_str # Needed for next cell too
resnet_ai_str = RooflineExamples.resnet_ai_str
resnet_util_min_str = RooflineExamples.resnet_util_min_str
resnet_util_max_str = RooflineExamples.resnet_util_max_str
resnet_perf_tflops_str = RooflineExamples.resnet_perf_tflops_str
bert_ai_b1_str = RooflineExamples.bert_ai_b1_str
bert_perf_b1_str = RooflineExamples.bert_perf_b1_str
bert_util_b1_str = RooflineExamples.bert_util_b1_str
utilization_peak_pct_str = RooflineExamples.utilization_peak_pct_str
```

[^fn-flops-throughput]: **FLOPS (Floating-Point Operations Per Second)**: The gap between advertised peak FLOPS and achieved FLOPS is the central tension in hardware benchmarking. The A100 advertises `{python} a100_tflops_fp16_str` TFLOPS FP16 Tensor Core, but real workloads achieve 30--60% of peak depending on arithmetic intensity and memory access patterns. Reporting peak FLOPS without utilization context is the most common benchmarking distortion. \index{FLOPS!peak vs. achieved}

\index{ResNet-50!roofline analysis}
High-intensity operations like dense matrix multiplication in certain AI model operations (typically >150 FLOPs/byte) achieve near-peak computational throughput on the A100. For example, a ResNet-50 forward pass on large batch sizes (256+) achieves arithmetic intensity of ~`{python} resnet_ai_str` FLOPs/byte, enabling `{python} resnet_util_min_str`–`{python} resnet_util_max_str`% of peak tensor performance (approximately `{python} resnet_perf_tflops_str` TFLOPS achieved vs `{python} a100_tflops_fp16_str` TFLOPS theoretical) [@nvidia2020a100]. Conversely, low-intensity operations like activation functions and certain lightweight operations (<10 FLOPs/byte) become memory bandwidth limited, using only a fraction of the GPU's computational capacity. When all data movement is considered (weights, KV cache, and activations), a BERT inference with batch size 1 achieves only ~`{python} bert_ai_b1_str` FLOPs/byte arithmetic intensity, limiting performance to ~`{python} bert_perf_b1_str` TFLOPS (`{python} a100_bw_tbs_str` TB/s$\times$ `{python} bert_ai_b1_str` FLOPs/byte), representing roughly `{python} bert_util_b1_str`% of peak computational capability. A simplified analysis considering only weight loading yields a higher estimate (~50 FLOPs/byte, as shown in the worked example below), illustrating how data movement assumptions significantly affect roofline predictions.

\index{BERT!roofline analysis at batch 1}
\index{Roofline Model!batch size effect on utilization}
This quantitative analysis, formalized in roofline models[^fn-roofline-model], guides both algorithm design and hardware selection by identifying the dominant performance constraint for a given workload. For instance, increasing batch size from 1 to 32 for transformer inference can shift operations from memory-bound to compute-bound, improving GPU utilization from `{python} bert_util_b1_str`% to `{python} utilization_peak_pct_str`% [@pope2023efficiently].

\index{Roofline Model!etymology and origin}
\index{Williams, Samuel!roofline model developer}

[^fn-roofline-model]: **Roofline Model**: Introduced by Samuel Williams, Andrew Waterman, and David Patterson at UC Berkeley in 2009 [@williams2009roofline], named for the visual shape of its performance ceiling. The model's diagnostic power lies in a single number: the ridge point (peak FLOPS / peak bandwidth). Workloads below this threshold are memory-bound and cannot benefit from faster arithmetic; workloads above it are compute-bound and cannot benefit from faster memory. This diagnosis determines whether optimization effort should target the Data Term or the Compute Term of the Iron Law. \index{Roofline Model!origin}

\index{BERT!inference deployment prediction}
The following worked example applies *roofline analysis for BERT inference* to demonstrate how these principles translate into concrete deployment predictions.

```{python}
#| label: bert-roofline-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BERT ROOFLINE CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "Roofline Analysis for BERT Inference" — the worked
# │          example applying roofline analysis to BERT-Base on A100
# │
# │ Goal: Demonstrate the complete roofline analysis workflow.
# │ Show: How batching shifts BERT from the memory-bound to compute-bound regime.
# │ How: Calculate arithmetic intensity and predict throughput across batch sizes.
# │
# │ Imports: mlsys (Models, Hardware), mlsys.constants (A100_FLOPS_FP16_TENSOR, TFLOPs, second, Mparam, Bparam, BYTES_FP32, MB, TB),
# │          mlsys.formatting (fmt)
# │ Exports: bert_params_m_str, bert_flops_b_str, bert_weight_mb_str,
# │          bert_ai_b1_str, bert_perf_b1_str, bert_util_b1_str,
# │          bert_batch32_flops_str, bert_ai_b32_str, bert_ai_eq_str,
# │          bert_b32_flops_eq_str, bert_perf_b32_str, batch32_str,
# │          utilization_peak_str, utilization_peak_pct_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys import Models, Hardware
from mlsys.constants import (
    A100_FLOPS_FP16_TENSOR, TFLOPs, second, BILLION, MILLION,
    Mparam, Bparam, BYTES_FP32, MB, TB
)
from mlsys.formatting import fmt_percent, fmt, check

# ┌── LEGO ───────────────────────────────────────────────
class BertRoofline:
    """
    Namespace for BERT Roofline Calculation.
    Scenario: Comparing Batch-1 (Memory Bound) vs Batch-32 (Shift to Compute).
    """

    # ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
    # Model
    m_bert = Models.Language.BERT_Base
    params_m = m_bert.parameters.m_as(Mparam)
    flops_b_per_inf = m_bert.inference_flops.m_as(Bparam)
    weight_mb = m_bert.size_in_bytes(BYTES_FP32).m_as(MB)

    # Hardware (A100)
    h_a100 = Hardware.Cloud.A100
    peak_flops = h_a100.peak_flops.m_as(TFLOPs/second)
    peak_bw = h_a100.memory_bw.m_as(TB/second)
    ridge_point = h_a100.ridge_point().m_as('flop/byte')

    # Scenarios
    batch_1 = 1
    batch_32 = 32
    util_peak = 0.85

    # ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
    # Step 1: Batch 1
    ai_b1 = (flops_b_per_inf * BILLION) / (weight_mb * MILLION)
    perf_b1 = ai_b1 * peak_bw
    util_b1 = (perf_b1 / peak_flops) * 100.0

    # Step 2: Batch 32
    flops_b32 = flops_b_per_inf * batch_32
    # Step 3: Note: Weights loaded once for batch! That's the key.
    # Step 4: AI = (FLOPs/Inf * Batch) / Weights
    ai_b32 = (flops_b32 * BILLION) / (weight_mb * MILLION)

    # Step 5: Is it compute bound now?
    is_compute_bound_b32 = ai_b32 > ridge_point

    # Step 6: Performance at Batch 32 (capped by compute if AI > Ridge)
    perf_b32 = peak_flops * util_peak if is_compute_bound_b32 else (ai_b32 * peak_bw)

    # ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
    check(ai_b32 > ai_b1, "Batching must increase Arithmetic Intensity.")
    if ai_b32 < 1000: # Sanity check, should be huge (50 * 32 = 1600)
        pass

    # ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
    bert_params_m_str = fmt(params_m, precision=0, commas=False)
    bert_flops_b_str = fmt(flops_b_per_inf, precision=0, commas=False)
    bert_weight_mb_str = fmt(weight_mb, precision=0, commas=False)

    bert_ai_b1_str = fmt(ai_b1, precision=0, commas=False)
    bert_perf_b1_str = fmt(perf_b1, precision=0, commas=False)
    bert_util_b1_str = fmt(util_b1, precision=0, commas=False)

    bert_batch32_flops_str = f"{flops_b32:.0f}"
    bert_ai_b32_str = fmt(ai_b32, precision=0, commas=False)

    bert_ai_eq_str = f"{flops_b_per_inf} $\\times$ $10^{{9}}$ ÷ {weight_mb} $\\times$ $10^{{6}}$"
    bert_b32_flops_eq_str = f"{flops_b_per_inf} $\\times$ $10^{{9}}$ $\\times$ {batch_32}"

    bert_perf_b32_str = fmt(perf_b32, precision=0, commas=False)

    batch32_str = str(batch_32)
    utilization_peak_str = f"{util_peak}"
    utilization_peak_pct_str = fmt_percent(util_peak, precision=0, commas=False)

    # Re-export A100 constants for this cell context
    a100_tflops_fp16_str = fmt(peak_flops, precision=0, commas=False)
    a100_bw_tbs_str = fmt(peak_bw, precision=1, commas=False)
    a100_ridge_str = fmt(ridge_point, precision=0, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
bert_params_m_str = BertRoofline.bert_params_m_str
bert_flops_b_str = BertRoofline.bert_flops_b_str
bert_weight_mb_str = BertRoofline.bert_weight_mb_str
bert_ai_b1_str = BertRoofline.bert_ai_b1_str
bert_perf_b1_str = BertRoofline.bert_perf_b1_str
bert_util_b1_str = BertRoofline.bert_util_b1_str
bert_batch32_flops_str = BertRoofline.bert_batch32_flops_str
bert_ai_b32_str = BertRoofline.bert_ai_b32_str
bert_ai_eq_str = BertRoofline.bert_ai_eq_str
bert_b32_flops_eq_str = BertRoofline.bert_b32_flops_eq_str
bert_perf_b32_str = BertRoofline.bert_perf_b32_str
batch32_str = BertRoofline.batch32_str
utilization_peak_str = BertRoofline.utilization_peak_str
utilization_peak_pct_str = BertRoofline.utilization_peak_pct_str
a100_tflops_fp16_str = BertRoofline.a100_tflops_fp16_str
a100_bw_tbs_str = BertRoofline.a100_bw_tbs_str
a100_ridge_str = BertRoofline.a100_ridge_str

```

::: {.callout-notebook title="Roofline Analysis for BERT Inference"}

**Problem**: You need to deploy BERT-Base for inference on an A100 GPU. Management expects high GPU utilization. What performance should you predict, and how can you improve it?

#### Step 1: Hardware Limits {.unnumbered}

- Peak compute: `{python} a100_tflops_fp16_str` TFLOPS (FP16 Tensor Core)
- Memory bandwidth: `{python} a100_bw_tbs_str` TB/s
- Ridge point: `{python} a100_tflops_fp16_str` ÷ `{python} a100_bw_tbs_str` = `{python} a100_ridge_str` FLOPs/byte

Any workload with arithmetic intensity below `{python} a100_ridge_str` FLOPs/byte is memory-bound; above is compute-bound.

#### Step 2: BERT-Base Characteristics {.unnumbered}

- Parameters: `{python} bert_params_m_str` M = `{python} bert_weight_mb_str` MB (FP32)
- FLOPs per inference: ~`{python} bert_flops_b_str` billion (forward pass with sequence length $S=128$)
- Data movement: ~`{python} bert_weight_mb_str` MB (must load all weights from memory)
- Arithmetic intensity: `{python} bert_ai_eq_str` = `{python} bert_ai_b1_str` FLOPs/byte (weights-only model; see note in main text)

#### Step 3: Performance Prediction {.unnumbered}

Since `{python} bert_ai_b1_str` < `{python} a100_ridge_str`, BERT at batch=1 is **memory-bound**:

Achievable perf = `{python} bert_ai_b1_str` FLOPs/byte$\times$ `{python} a100_bw_tbs_str` TB/s = **`{python} bert_perf_b1_str` TFLOPS**

GPU utilization = `{python} bert_perf_b1_str` ÷ `{python} a100_tflops_fp16_str` = **`{python} bert_util_b1_str`%**

#### Step 4: Optimization via Batching {.unnumbered}

Increase batch size to `{python} batch32_str`:

- Same `{python} bert_weight_mb_str` MB of weights, but `{python} batch32_str`$\times$ more compute
- New FLOPs: `{python} bert_b32_flops_eq_str` = `{python} bert_batch32_flops_str`$\times$ $10^{9}$
- New intensity: `{python} bert_batch32_flops_str`$\times$ $10^{9}$ ÷ `{python} bert_weight_mb_str`$\times$ $10^{6}$ = `{python} bert_ai_b32_str` FLOPs/byte

Since `{python} bert_ai_b32_str` > `{python} a100_ridge_str`, batch=32 is **compute-bound**:

Achievable perf ≈ `{python} utilization_peak_str`$\times$ `{python} a100_tflops_fp16_str` = **`{python} bert_perf_b32_str` TFLOPS**

$$\text{GPU utilization} \approx 85\%$$

**The Systems Insight**: Batch size transforms memory-bound inference (`{python} bert_util_b1_str`% utilization) into compute-bound inference (`{python} utilization_peak_pct_str`% utilization). Batching, however, increases latency because you must wait to accumulate requests. This is the fundamental throughput-latency tradeoff that MLPerf scenarios capture: SingleStream (batch=1, latency-optimized) versus Offline (maximum batch, throughput-optimized).
:::

System benchmarks evaluate performance across scales, ranging from single-chip configurations to large distributed systems, and AI workloads including both training and inference tasks. This evaluation approach ensures that benchmarks accurately reflect real-world deployment scenarios and deliver insights that inform both hardware selection decisions and system architecture design. @fig-imagenet-gpus reveals the striking correlation between GPU adoption and ImageNet classification error rates from 2010 to 2014: as GPU entries surged from 0 to 110, error rates plummeted from 28.2% to 7.3%, demonstrating how hardware capabilities and algorithmic advances drive progress in tandem.

::: {#fig-imagenet-gpus fig-env="figure" fig-pos="htb" fig-cap="**GPU Adoption and Error Reduction**: As GPU entries in ImageNet surged from 0 to 110 between 2010 and 2014, top-5 error rates dropped from 28.2% to 7.3%, demonstrating the co-evolution of hardware capabilities and algorithmic advances." fig-alt="Dual-axis chart with blue line showing top-5 error rate declining from 28% to 7% and green bars showing GPU entries rising from 0 to 110 between 2010 and 2014."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]

\pgfplotsset{myaxis/.style={
  axis line style={draw=none},
  /pgf/number format/.cd,
  1000 sep={},
   width=105mm,
   height=60mm,
   axis lines=left,
   axis line style={thick,-latex},
    tick label style={/pgf/number format/assume math mode=true},
    yticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},
    /pgf/number format/.cd, fixed, fixed zerofill, precision=0},
    xticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
    ylabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
    xlabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
    y tick style={draw=none},
    x tick style={draw=none,thin},
    tick align=outside,
    major tick length=1mm,
    title style={yshift=-4pt,font=\footnotesize\usefont{T1}{phv}{m}{n}},
    xmin=2009.5,xmax=2014.5,
    xtick={2010,2011,2012,2013,2014},
    }}
  %grid
\begin{axis}[myaxis,
    grid=both,
    major grid style={thin,black!60},
    minor tick num=1,
    ymin=0,    ymax=33,
    ytick={0,10,...,30},
    xticklabels={,,,,},
]
\end{axis}
%bar
\begin{axis}[myaxis,
    axis y line*=right,
    axis x line=none,
    ylabel={\color{green!60!black}\# of Entries Using GPUs},
    xlabel={Year},
    ymin=0,    ymax=133,
    ytick={0,25,...,125},
    every axis plot/.append style={
          ybar,
          bar width=0.55,
          bar shift=0pt,
          fill
        }]
      \addplot[draw=none]coordinates {(2010,0)};
      \addplot[draw=none]coordinates {(2011,0)};
      \addplot[green!60!black]coordinates {(2012,4)};
      \addplot[green!60!black]coordinates{(2013,60)};
      \addplot[green!60!black]coordinates{(2014,110)};
      %
\end{axis}
  %line
\begin{axis}[myaxis,
    ylabel={\color{blue!50!black}Top-5 Error Rate (\%)},
    xlabel={Year},
    ymin=0,    ymax=33,
    ytick={0,10,...,30},
    xticklabels={2010,2011,2012,2013,2014},
]
\addplot[
  mark=*,
  mark size=2pt,
  line width=1.5pt,
  draw=blue!50!black, %
  mark options={fill=red, draw=red} %
]
table[x=Year,y=Y, col sep=comma] {
Year,Y
  2010,28.2
  2011,25.8
  2012,16.4
  2013,11.7
  2014,7.3
};
\end{axis}
\end{tikzpicture}
```
:::

The ImageNet example demonstrates *how* hardware advances enable algorithmic breakthroughs. (We revisit this progression with model-specific architectural milestones in @sec-benchmarking-model-benchmarking-4847.) But effective system benchmarking requires understanding the relationship between workload characteristics and hardware utilization. Modern AI systems rarely achieve theoretical peak performance due to interactions between computational patterns, memory hierarchies, and system architectures. This gap between theoretical and achieved performance shapes how we design meaningful system benchmarks.

Realistic hardware utilization patterns are essential for actionable benchmark design. As the roofline analysis above demonstrated, GPU utilization varies dramatically with batch size and model architecture—from `{python} utilization_peak_pct_str`% for compute-bound workloads to `{python} bert_util_b1_str`% for memory-bound single-request inference. These patterns extend to memory bandwidth: utilization ranges from 20% for parameter-heavy transformer models to 90% for activation-heavy convolutional networks [@you2019scaling], directly impacting achievable performance across different precision levels.

Performance per watt varies by three orders of magnitude across platforms, making energy efficiency a key benchmark dimension. Underutilized GPUs consume disproportionate power relative to their output, creating efficiency penalties that affect both operational costs and environmental impact.

Distributed system performance introduces additional complexity beyond single-machine evaluation. Multi-node training involves communication bottlenecks, network topology effects, and coordination overhead that single-node benchmarks cannot capture. For deployments spanning multiple machines, specialized distributed benchmarking methodologies, including scaling efficiency measurement and network performance profiling, become essential. These distributed benchmarking approaches are critical for scaling ML systems across multiple machines but require dedicated treatment beyond the scope of single-node evaluation.

Within the single-machine scope of this book, multi-GPU benchmarking focuses on intra-node communication patterns, memory bandwidth utilization across accelerators, and the efficiency of gradient synchronization within shared-memory systems. Modern workstations with 4-8 GPUs connected via NVLink or PCIe provide substantial parallelism while avoiding the network communication challenges that characterize multi-node deployments.

Effective system benchmarks must therefore evaluate performance across realistic utilization scenarios rather than peak theoretical capabilities, ensuring results translate to practical deployment guidance.

### Community-Driven Standardization {#sec-benchmarking-communitydriven-standardization-5c56}

The hardware utilization insights above are only useful for comparison when measured consistently, which requires community-driven standardization. When one team measures inference latency with preprocessing included and another excludes it, when accuracy benchmarks use different data splits, or when power measurements employ different system boundaries, meaningful comparison becomes impossible. Individual organizations cannot establish measurement standards alone; the proliferation of benchmarks across our three dimensions creates fragmentation that only coordinated effort can resolve.

The most successful benchmarks emerge through broad collaboration among academic institutions, industry partners, and domain experts. ImageNet's lasting impact demonstrates how sustained community engagement—through workshops, challenges, and open datasets—establishes authority that corporate-driven benchmarks rarely achieve. This collaborative development creates a foundation for formal standardization: IEEE working groups [@ieee_working_groups] and ISO/IEC technical committees [@iso_tc] codify community-developed methodologies into official standards (e.g., IEEE 2416-2019 [@ieee_2416_2019] for system power modeling), providing precise measurement specifications that enable reliable cross-institutional comparison. Projects that provide open-source reference implementations, containerized evaluation environments, and comprehensive validation suites further reduce barriers and ensure consistent interpretation across research groups.

ML benchmarks must balance academic rigor with industry practicality, since theoretical advances must translate to practical improvements in deployed systems [@patterson2021carbon]. Benchmarks that emerge from this balance, with transparent governance and regular evolution, become authoritative standards; those developed in isolation struggle to gain traction regardless of technical sophistication. These evaluation methodology principles guide both training and inference benchmark design throughout this chapter.

Community standards ensure reproducibility, but they do not prescribe the level of detail at which measurements should be taken. A benchmark could time a single matrix multiplication or an entire training run—and each choice reveals different kinds of information. The depth of measurement, from individual operations to complete systems, determines what insights benchmarks can provide and which problems they can diagnose.

## Benchmarking Granularity {#sec-benchmarking-benchmarking-granularity-3855}

A GPU kernel that runs 3$\times$ faster in isolation may deliver zero end-to-end speedup if the data pipeline cannot keep pace. This diagnostic failure illustrates a fundamental design question: at what level of detail should evaluation occur? Standardization answers "how do we measure consistently?" while granularity answers "what exactly do we measure?" Each validation dimension can be assessed at different scales, from individual operations to complete workflows, with each granularity level revealing different kinds of problems:

\index{Micro Benchmarks!component isolation}
\index{Macro Benchmarks!subsystem evaluation}
\index{End-to-End Benchmarks!complete workflow measurement}
- **Micro benchmarks** isolate individual components: kernel execution time, memory bandwidth utilization, single-layer accuracy. These diagnose *where* problems occur.
- **Macro benchmarks** evaluate subsystems: full model training convergence, inference pipeline throughput, dataset bias metrics. These reveal *what* problems exist.
- **End-to-end benchmarks** measure complete workflows: request-to-response latency including preprocessing, training time-to-accuracy including data loading, model performance on production data distributions. These show *whether* the system works.

The optimization techniques from Part III operate at different granularities (kernel fusion targets micro performance, pruning affects macro model behavior, data curation determines end-to-end generalization) and validation must match. A micro benchmark might show kernel speedup while a macro benchmark reveals memory bottlenecks that negate the gain; an end-to-end benchmark might expose data pipeline stalls invisible at any other level.

To visualize how these granularity levels map onto actual ML systems, @fig-granularity breaks down the stack into four distinct evaluation scopes. Notice how each scope progressively expands the measurement boundary: micro-benchmarks isolate neural network layers, macro-benchmarks encompass complete models, application benchmarks add supporting compute, and end-to-end benchmarks capture the full deployment context including non-AI components.

::: {#fig-granularity fig-env="figure" fig-pos="htb" fig-cap="**Benchmarking Granularity**: Four-panel block diagram showing micro, model, application, and end-to-end evaluation layers. Each panel maps a distinct scope of assessment, from isolated kernel operations through full-system deployment, enabling targeted optimization at every level of the ML stack." fig-alt="Block diagram showing three evaluation layers: neural network nodes on left, model components in center, and end-to-end application with compute nodes on right, connected by dashed lines."}
```{.tikz}
\begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}\small]
 \tikzset{%
 Box/.style={
 node distance=1.25,
    inner xsep=2pt,
    draw=RedLine,
    line width=0.75pt,
    fill=RedL!20,
    align=flush center,
    text width=20mm,
    minimum width=20mm, minimum height=9mm
  },
   Box2/.style={Box, draw=BlueLine,fill=BlueL!20},
   Box3/.style={Box, draw=GreenLine,fill=GreenL!40},
   Box4/.style={Box, draw=BrownLine,fill=BrownL!40,text width=24mm,},
 %
  Line/.style={line width=0.35pt,black!60,text=black},
  LineD/.style={line width=0.75pt,black!60,text=black,dashed,dash pattern=on 3pt off 2pt},
  Line2/.style={line width=0.85pt,black!60,text=black,-{Latex[length=6pt, width=4pt]}}
}

  \def\rows{3}
  \def\cols{4}
  \def\r{0.35}
  \def\xgap{0.15}
  \def\ygap{0.5}

\begin{scope}[local bounding box=CEN,shift={($(0,0)+(0,0)$)}]
  \foreach \i [count=\c] in {0,...,\numexpr\rows-1} {
    \foreach \j  in {0,...,\numexpr\cols-1} {
      \pgfmathtruncatemacro{\newX}{\j + 1} %
      % Pozicija kruga
      \pgfmathsetmacro\x{\j*(2*\r + \xgap)}
      \pgfmathsetmacro\y{-\i*(2*\r + \ygap)}
\definecolor{cellcol}{RGB}{253,226,240}
     \node[fill=VioletLine!60,circle,minimum size=\r](C\c\newX)at(\x,-\y){};
    }
  }
\foreach \a/\b in {C3/C2, C2/C1}{%
  \foreach \i in {1,2,3,4}{%
    \foreach \j in {1,2,3,4}{%
      \draw[Line] (\a\i) -- (\b\j);
    }%
  }%
}
\end{scope}
\scoped[on background layer]
\node[outer sep=0pt,draw=BackLine,inner xsep=3mm,inner ysep=2mm,yshift=0mm,
           fill=BackColor!20,fit=(C11)(C34),line width=0.75pt](BB1){};
\node[above=2pt of  BB1.north,anchor=south]{ML Layers};
%%%
\node[Box,below right =0.19 and 1 of BB1.north east](MA){Model A};
\node[Box,above right =0.19 and 1 of BB1.south east](MB){Model B};
\scoped[on background layer]
\node[outer sep=0pt,draw=BackLine,inner xsep=3mm,inner ysep=2mm,yshift=0mm,
           fill=BackColor!20,fit=(MA)(MB),line width=0.75pt](BB2){};
\node[above=2pt of  BB2.north,anchor=south]{ML Model};
%
\node[Box2,right = of MA](T1){AI Task 1};
\node[Box2, right = of MB](T2){Supporting Compute};
\scoped[on background layer]
\node[outer sep=0pt,draw=BackLine,inner xsep=3mm,inner ysep=2mm,yshift=0mm,
           fill=BackColor!20,fit=(T1)(T2),line width=0.75pt](BB3){};
\node[above=2pt of  BB3.north,anchor=south]{AI Task};
%
\node[Box3,right = of T1](CN1){AI Compute Node};
\node[Box4,right = 0.75 of CN1](NA1){Non-AI Compute Node};
\node[Box3, right = of T2](CN2){AI Compute Node};
\node[Box4,right = 0.75 of CN2](NA2){Non-AI Compute Node};
\scoped[on background layer]
\node[outer sep=0pt,draw=BackLine,inner xsep=3mm,inner ysep=2mm,yshift=0mm,
           fill=BackColor!20,fit=(CN1)(NA2),line width=0.75pt](BB4){};
\node[above=2pt of  BB4.north,anchor=south]{End-to-End Application};
%
\draw[Line2](MA)--(MB);
\draw[Line2](T1)--(T2);
\draw[Line2](CN1)--(NA1);
\draw[Line2](CN2)--(NA2);
\draw[Line2](NA1)--++(0,-0.8)-|(CN2);
\draw[LineD](BB1.north east)--(MA.170);
\draw[LineD](BB1.south east)--(MA.190);
\draw[LineD](BB2.north east)--(T1.170);
\draw[LineD](BB2.south east)--(T1.190);
\draw[LineD](BB3.north east)--(CN1.170);
\draw[LineD](BB3.south east)--(CN2.190);
\end{tikzpicture}
```
:::

### Micro Benchmarks {#sec-benchmarking-micro-benchmarks-7f5a}

\index{Micro Benchmarks!diagnostic purpose}
While end-to-end benchmarks reveal overall system behavior, optimization requires pinpointing exactly which operations consume time and energy. Micro-benchmarks serve this diagnostic purpose by isolating individual tensor operations, the mathematical primitives whose hardware optimization we examined in @sec-hardware-acceleration.

Consider debugging a slow inference pipeline: macro benchmarks might show unacceptable latency, but only micro-benchmarks reveal whether the bottleneck lies in convolutions, attention mechanisms, or memory copies. This diagnostic precision makes micro-benchmarks essential for the targeted optimization that transforms theoretical hardware capabilities into realized performance gains. These benchmarks isolate individual tasks to provide detailed insights into the computational demands of particular system elements, from neural network layers to optimization techniques to activation functions.

A key area of micro-benchmarking focuses on tensor operations (see @sec-hardware-acceleration for the hardware paths that accelerate them), which are the computational core of deep learning. Libraries like cuDNN [@chetlur2014cudnn][^fn-cudnn] by NVIDIA provide benchmarks for measuring core computations such as convolutions and matrix multiplications across different hardware configurations. These measurements help developers understand how their hardware handles the core mathematical operations that dominate ML workloads.

[^fn-cudnn]: **cuDNN (CUDA Deep Neural Network Library)**: Released by NVIDIA in 2014, cuDNN provides hand-tuned kernel implementations for convolutions, pooling, and normalization that often achieve 2--5$\times$ speedups over naive CUDA implementations. The benchmarking implication: reported inference latencies depend heavily on which cuDNN version and algorithm autotuner settings were used, making cuDNN version a mandatory element of any reproducible benchmark specification. \index{cuDNN!benchmarking dependency}

Measuring these operations correctly requires discipline. The following *micro-benchmarking rules* prevent common measurement errors that can invalidate results entirely.

::: {.callout-perspective title="Micro-Benchmarking Rules"}

To avoid measuring hardware artifacts instead of kernel performance, follow the **Systems Detective's Rules**:

\index{DVFS!warm-up measurement artifact}
1.  **The Warm-up Rule**: Never measure the first 10–50 iterations. Modern hardware uses **DVFS (Dynamic Voltage and Frequency Scaling)** and **Turbo Boost**. A "cold" GPU may take 100 ms to ramp from 300 MHz to 1.5 GHz. Your first batch will appear 5$\times$ slower than reality.
2.  **The Variance Rule**: Report the **Coefficient of Variation (CV)** ($CV = \sigma / \mu$). If $CV > 0.05$ (5%), your measurement is noisy. This usually indicates background OS jitter, thermal throttling\index{Thermal Throttling!impact on sustained perf}, or memory contention.
\index{Speed of Light Check!utilization diagnosis}
3.  **The "Speed of Light" (SOL) Check**: Compare your achieved throughput against the roofline. If your kernel achieves 10 TFLOPS on an H100 (peak ~`{python} h100_tflops_fp16_str` TFLOPS FP16, or ~`{python} h100_tflops_fp8_str` TFLOPS FP8 dense), do not just optimize the code; ask *why* the utilization is so low. Is it a kernel launch latency issue (too many small kernels)?
4.  **The Flush Rule**: When measuring memory bandwidth, ensure you flush the L2 cache between runs, or your "bandwidth" will reflect cache speed (~5–10 TB/s) rather than DRAM speed (~1–2 TB/s).

:::

With these measurement principles established, we can now examine how to diagnose specific bottlenecks by *measuring the Iron Law terms* from the framework introduced in @sec-hardware-acceleration, which decomposes execution time into data movement, compute throughput, and latency overhead.

::: {.callout-notebook title="Measuring the Iron Law Terms"}
**From Theory to Trace**: How to map the **Iron Law** equation (from @sec-hardware-acceleration) to a profiler timeline (like Nsight Systems or PyTorch Profiler).

**1. Measuring the Data Term ($\frac{D_{vol}}{BW}$)**

*   **Signal:** Look for the **"Memory Throughput"** or **"DRAM Bandwidth"** line.
*   **Calculation:** $\text{Effective BW} = \frac{\text{Total Bytes Transferred}}{\text{Kernel Duration}}$.
*   **Diagnosis:** If $\text{Effective BW} \approx \text{Peak BW}$ (e.g., >1.6 TB/s on A100), your kernel is **Memory Bound**. Optimizing compute (Ops) will do nothing.

**2. Measuring the Throughput Term ($\eta$)**

*   **Signal:** Look for **"SM Active"** or **"Compute Throughput"**.
*   **Calculation:** $\text{Achieved TFLOPS} = \frac{\text{FLOP Count}}{\text{Kernel Duration}}$.
*   **Diagnosis:** If $\text{Achieved TFLOPS} \ll \text{Peak TFLOPS}$ AND $\text{Memory BW} \ll \text{Peak BW}$, you are in the **"Utilization Trap"**: likely Latency Bound (kernels too small) or Grid Bound (not enough threads).

**3. Measuring the Latency Term ($L_{lat}$)**

*   **Signal:** Look for **Gaps** (empty space) between colored kernel bars on the timeline.
*   **Calculation:** $\text{Overhead Ratio} = \frac{\text{Gap Duration}}{\text{Kernel Duration} + \text{Gap Duration}}$.
\index{Operator Fusion!latency gap elimination}
\index{Dispatch Overhead Reduction!kernel launch batching}
*   **Diagnosis:** A "Sawtooth" pattern (Compute, Gap, Compute, Gap) indicates high software overhead. You need **Operator Fusion** (@sec-ml-frameworks) or **CUDA Graphs** to remove the gaps.
:::

While benchmarks like MLPerf reveal *how fast* a system is, micro-benchmarking tools reveal *why* it is slow. To perform this diagnosis, engineers use kernel-level profilers that peer inside the execution of individual operations.

#### Framework Profilers {.unnumbered}

\index{Framework Profiler!performance analysis tool}
Tools like PyTorch Profiler capture the "logical" execution flow:

*   Which layer is taking the most time?
*   Are CPU and GPU synchronized or overlapped?
*   Is the data loader keeping up?

The key metric is step time breakdown (data loading vs. compute vs. communication).

#### Kernel Profilers {.unnumbered}

\index{Kernel Profiler!hardware-level analysis}
Tools like NVIDIA Nsight Systems and Compute capture "physical" execution on the hardware:

*   Is this matrix multiplication compute-bound or memory-bound?
*   Is the system hitting 100% occupancy on the Streaming Multiprocessors?
*   Are memory coalescing rules being respected?

The key metric is roofline analysis (FLOPS vs. memory bandwidth).

The recommended workflow is to start with the Framework Profiler to find the slow layer (e.g., "The Attention Block is slow"). Then, use the Kernel Profiler to diagnose the physics (e.g., "The Softmax kernel is memory-bound because it is reading too many bytes per FLOP"). This targeted approach avoids the "optimization without measurement" trap.

Micro-benchmarks also examine activation functions and neural network layers in isolation. This includes measuring the performance of various activation functions like ReLU, Sigmoid, and Tanh under controlled conditions, and evaluating the computational efficiency of distinct neural network components such as LSTM cells or Transformer blocks when processing standardized inputs. These granular measurements enable precise optimization, but they cannot reveal how components interact when assembled into complete models. Macro-benchmarks address this gap.

\index{Micro-Benchmark Suite!operator-level testing}
DeepBench [@deepbench_github], developed by Baidu, was one of the first to demonstrate the value of comprehensive micro-benchmarking. It evaluates these core operations across different hardware platforms, providing detailed performance data that helps developers optimize their deep learning implementations. By isolating and measuring individual operations, DeepBench enables precise comparison of hardware platforms and identification of potential performance bottlenecks.

### Macro Benchmarks {#sec-benchmarking-macro-benchmarks-4283}

Micro-benchmarks confirm that individual convolution kernels run fast. Macro-benchmarks reveal whether the complete model works under realistic conditions. This shift from component-level to model-level assessment reveals how architectural choices and component interactions affect overall model behavior. For instance, while micro-benchmarks might show optimal performance for individual convolutional layers, macro-benchmarks reveal how these layers work together within a complete convolutional neural network.

Macro-benchmarks measure multiple performance dimensions that emerge only at the model level. These include prediction accuracy, which shows how well the model generalizes to new data; memory consumption patterns across different batch sizes and sequence lengths; throughput under varying computational loads; and latency across different hardware configurations. Understanding these metrics helps developers make informed decisions about model architecture, optimization strategies, and deployment configurations.

\index{ImageNet!macro benchmark reference}
The assessment of complete models occurs under standardized conditions using established datasets and tasks. For example, computer vision models might be evaluated on ImageNet [@imagenet_website], measuring both computational efficiency and prediction accuracy. Natural language processing models might be assessed on translation tasks, examining how they balance quality and speed across different language pairs.

Several industry-standard benchmarks enable consistent model evaluation across platforms. The MLPerf family (Inference, Mobile, Client, and Tiny) provides comprehensive testing suites adapted for computational environments from datacenter to microcontroller, detailed in @sec-benchmarking-mlperf-inference-benchmarks-e878. For embedded systems, EEMBC's MLMark emphasizes both performance and power efficiency, while the AI-Benchmark [@ai_benchmark_website] suite specializes in mobile platforms.

### End-to-End Benchmarks {#sec-benchmarking-endtoend-benchmarks-51bb}

End-to-end benchmarks provide the most inclusive evaluation by encompassing the entire pipeline of an AI system, not just the model. This includes ETL (Extract-Transform-Load) data processing, model inference, post-processing of results, and critical infrastructure components like storage and network systems.

Data processing—extracting from source systems, transforming through cleaning and feature engineering, and loading into model-ready formats—forms the foundation of the pipeline. These preprocessing steps directly affect overall performance, and end-to-end benchmarks must assess standardized datasets through complete pipelines to ensure data preparation does not become a bottleneck. Post-processing similarly affects real-world performance: a computer vision system must post-process detection boundaries, apply confidence thresholds, and format results for downstream applications before the user sees a response.

Infrastructure components heavily influence overall performance beyond the AI workload itself. Storage solutions can dominate data retrieval times with large AI datasets, and network interactions in distributed systems can become performance bottlenecks. End-to-end benchmarks must evaluate these components under specified environmental conditions to ensure reproducible measurements of the entire system.

To date, there are no public, end-to-end benchmarks that fully account for data storage, network, and compute performance. While MLPerf Training and Inference approach end-to-end evaluation, they primarily focus on model performance rather than real-world deployment scenarios. Nonetheless, they provide valuable baseline metrics for assessing AI system capabilities.

Given the inherent specificity of end-to-end benchmarking, organizations typically perform these evaluations internally by instrumenting production deployments. The sensitivity of these measurements means they rarely appear publicly, but their absence from the literature does not diminish their importance.

### Granularity Trade-offs and Selection Criteria {#sec-benchmarking-granularity-tradeoffs-selection-criteria-b7d9}

@tbl-benchmark-comparison reveals how different challenges emerge at different stages of an AI system's lifecycle. Each benchmarking approach provides unique insights: micro-benchmarks help engineers optimize specific components like GPU kernel implementations or data loading operations, macro-benchmarks guide model architecture decisions and algorithm selection, while end-to-end benchmarks reveal system-level bottlenecks in production environments.

| **Component**   | **Micro Benchmarks**                                      | **Macro Benchmarks**                                   | **End-to-End Benchmarks**                              |
|:----------------|:----------------------------------------------------------|:-------------------------------------------------------|:-------------------------------------------------------|
| **Focus**       | Individual operations                                     | Complete models                                        | Full system pipeline                                   |
| **Scope**       | Tensor ops, layers, activations                           | Model architecture, training, inference                | ETL, model, infrastructure                             |
| **Example**     | Conv layer performance on cuDNN                           | ResNet-50 on ImageNet                                  | Production recommendation system                       |
| **Advantages**  | Precise bottleneck identification, Component optimization | Model architecture comparison, Standardized evaluation | Realistic performance assessment, System-wide insights |
| **Challenges**  | May miss interaction effects                              | Limited infrastructure insights                        | Complex to standardize, Often proprietary              |
| **Typical Use** | Hardware selection, Operation optimization                | Model selection, Research comparison                   | Production system evaluation                           |

: **Benchmarking Granularity Levels.** Different benchmark scopes target distinct stages of ML system development. Micro-benchmarks isolate individual operations for low-level optimization, macro-benchmarks evaluate complete models to guide architectural choices, and end-to-end benchmarks assess full system performance in production environments. {#tbl-benchmark-comparison}

Why not just pick one granularity level and stick with it? Because a core tension exists between diagnostic precision and real-world fidelity. @fig-benchmark-tradeoffs maps this trade-off, placing micro-benchmarks at the high-isolation end (precise but narrow) and end-to-end benchmarks at the high-representativeness end (realistic but harder to diagnose). Notice that no single point on this spectrum provides both: micro-benchmarks pinpoint exactly which kernel is slow but miss system-level bottlenecks, while end-to-end benchmarks capture production behavior but obscure root causes. The practical takeaway is that effective ML system evaluation requires combining insights from all three levels.

::: {#fig-benchmark-tradeoffs fig-env="figure" fig-pos="htb" fig-cap="**Isolation vs. Representativeness**: The core trade-off in benchmarking granularity. Micro-benchmarks provide high diagnostic precision but limited real-world relevance, while end-to-end benchmarks capture realistic system behavior but offer less precise component-level insights. Effective ML system evaluation requires strategic combination of all three levels." fig-alt="Scatter plot with three labeled points along diagonal: micro-benchmarks at high isolation, macro-benchmarks at medium, and end-to-end benchmarks at high representativeness."}
```{.tikz}
\begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}\small]
\tikzset{%
  axis/.style={-latex,thick,black},
  grid/.style={very thin,gray!40},
  point/.style={circle,fill,inner sep=1.75pt}
}
% Draw axes
\draw[axis] (0,0) -- node[pos=.5, sloped, above=17pt]{\footnotesize Isolation / Diagnostic Power} (0,5);
\draw[axis] (0,0) --node[below=10pt] {\footnotesize Real-World Representativeness} (8,0) ;
% Draw grid lines
\foreach \x in {1,2,3,4,5,6,7}
  \draw[grid] (\x,0) -- (\x,4.75);
\foreach \y in {1,2,3,4}
  \draw[grid] (0,\y) -- (7.25,\y);
% Add trend line (passes through all three points)
\draw[dashed,thick,gray!60] (1.5,4) -- (4.5,1);
% Draw benchmark points
\node[point,color=RedLine] (micro) at (1.5,4) {};
\node[point,color=BlueLine] (macro) at (3,2.5) {};
\node[point,color=GreenD] (endtoend) at (4.5,1) {};
% Add labels
\node[right=4pt ,color=RedLine] at (micro) {\footnotesize \textbf{Micro-benchmarks}};
\node[right=4pt ,color=BlueLine] at (macro) {\footnotesize \textbf{Macro-benchmarks}};
\node[right=4pt ,color=GreenD] at (endtoend) {\footnotesize \textbf{End-to-End benchmarks}};
% Add axis labels
\node[rotate=90] at (-0.3,4.5) {\footnotesize High};
\node[rotate=90] at (-0.3,0.45) {\footnotesize Low};
\node[below] at (0.5,-0.0) {\footnotesize Low};
\node[below] at (7.5,-0.0) {\footnotesize High};
\end{tikzpicture}
```
:::

Component interaction often produces unexpected behaviors that single-level benchmarks miss. While micro-benchmarks might show excellent performance for individual operations and macro-benchmarks might demonstrate strong model accuracy, end-to-end evaluation can reveal that data preprocessing creates unexpected bottlenecks during high-traffic periods. These system-level insights remain hidden when components undergo isolated testing.

Choosing a granularity level, however, is only half the design problem. The other half is specifying the concrete ingredients every benchmark requires: what task does it evaluate, on what data, using what model, and measured by what metrics? Without answers to these questions, even the right granularity level produces meaningless numbers. The components of a benchmark determine whether results translate into actionable engineering insight or merely generate impressive-looking numbers that collapse under scrutiny.

## Benchmark Components {#sec-benchmarking-benchmark-components-97cc}

Choosing between micro, macro, and end-to-end granularity determines what a benchmark can diagnose, but every benchmark at every granularity must still answer the same implementation questions: what task are we measuring, on what data, with which model, against which metrics, and under what rules? Micro-benchmarks require synthetic inputs that isolate specific computational patterns; macro-benchmarks demand representative datasets like ImageNet; end-to-end benchmarks must incorporate real-world data with all its noise and distributional shift. Despite this variation, all benchmarks share common implementation components that enable consistent evaluation.

The essential components interconnect to form a complete evaluation pipeline. Study the workflow in @fig-benchmark-components carefully: each stage—task definition, dataset selection, model selection, and evaluation metrics—feeds directly into the next, creating a chain where decisions made early constrain every downstream choice.

::: {#fig-benchmark-components fig-env="figure" fig-pos="htb" fig-cap="**Anomaly Detection Pipeline**: Nine-stage benchmark workflow applied to an industrial audio anomaly detection task. The pipeline progresses from problem definition through dataset selection, model training, quantization, and ARM embedded deployment, illustrating how each benchmark component feeds the next." fig-alt="Workflow diagram showing nine stages from problem definition through deployment, with detailed views of anomaly detection system, model training, quantization, and ARM embedded implementation."}
```{.tikz}
\begin{tikzpicture}[line cap=round,line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{
  Box/.style={align=center,outer sep=0pt ,
    inner xsep=2pt,
    node distance=0.45,
    draw=GreenLine,
    line width=0.75pt,
    fill=GreenL!60,
    text width=32mm,
    minimum width=17mm, minimum height=11mm
  },
   Box2/.style={Box, fill=BrownL!60,draw=BrownLine},
   Box3/.style={Box, fill=RedL!60,draw=RedLine},
   Box4/.style={Box, fill=GreenD,  text width=3mm,minimum width=3mm, minimum height=22mm,draw=none},
   Box5/.style={Box, fill=red,  text width=5mm,minimum width=5mm, minimum height=5mm,draw=none},
   Box6/.style={Box, fill=BrownL!70,text width=17mm,minimum width=17mm, minimum height=9mm,draw=none},
   Box7/.style={Box6, fill=magenta!20},
   Box8/.style={Box6, fill=magenta!20,minimum width=27mm, minimum height=18mm},
   Box9/.style={Box, node distance=0.2,fill=white,text width=22mm,minimum width=22mm,
                        minimum height=14mm,draw=none,font=\usefont{T1}{phv}{m}{n}\small},
   Trap/.style={trapezium, trapezium stretches = true, fill=GreenD,draw=none,
   minimum width=15mm,minimum height=10mm, draw=none, thick,rotate=270},
Line/.style={violet!50, line width=1.1pt,shorten <=1pt,shorten >=2pt},
LineA/.style={violet!50,line width=1.0pt,{-{Triangle[width=1.1*4pt,length=1.5*6pt]}},shorten <=1pt,shorten >=1pt},
ALine/.style={black!50, line width=1.1pt,{{Triangle[width=0.9*6pt,length=1.2*6pt]}-}},
Larrow/.style={fill=violet!50, single arrow,  inner sep=2pt, single arrow head extend=3pt,
            single arrow head indent=0pt,minimum height=10mm, minimum width=3pt}
}

\tikzset{
channel/.pic={
\pgfkeys{/channel/.cd, #1}
\begin{scope}[yscale=\scalefac,xscale=\scalefac,every node/.append style={scale=\scalefac}]
\draw[draw=BrownLine,fill=BrownLine!10](0,0.20)coordinate(W1)--
(0.75,-0.20)coordinate(W2)coordinate(\picname-W2)--(1.75,0.4)coordinate(W3)--
(1.0,0.8)coordinate(W4)coordinate(\picname-W4)--cycle;
\draw[BrownLine,shorten <=4pt,shorten >=5pt]($(W4)!0.3!(W1)$)--($(W3)!0.3!(W2)$);
\draw[BrownLine,shorten <=4pt,shorten >=7pt]($(W4)!0.5!(W1)$)--($(W3)!0.5!(W2)$);
\draw[BrownLine,shorten <=4pt,shorten >=9pt]($(W4)!0.7!(W1)$)--($(W3)!0.7!(W2)$);
\end{scope}
        },
}
\pgfkeys{
  /channel/.cd,
  channelcolor/.store in=\channelcolor,
  drawchannelcolor/.store in=\drawchannelcolor,
  scalefac/.store in=\scalefac,
  picname/.store in=\picname,
  channelcolor=BrownLine,
  drawchannelcolor=BrownLine,
  scalefac=1,
  picname=C
}
%Graph1
\begin{scope}[local bounding box=GRAPH1,shift={($(0,0)+(0,0)$)},scale=1, every node/.append style={transform shape}]
\begin{axis}[axis lines=none,  ticks=none, clip=false, width=3cm, height=2cm,
  scale only axis, enlargelimits=false,samples=600]
\addplot[smooth, color=GreenD,  domain=2:7.9] (\x,{sin((22.9*(27*deg(x))) )*cos(((1*deg(x))) )});
\end{axis}
%%fitting
\scoped[on background layer]
\node[draw=OrangeLine,fill=OrangeL!20, inner ysep=1mm, inner xsep=1mm,
fit=(GRAPH1),yshift=0mm](BB1){};
\end{scope}
%
\node[Box,below=1.3 of GRAPH1](ASDS){Anomalous Sound Detection System};
\node[Box2, minimum height=7mm,below=1 .3of ASDS](NORM){Normal};
\node[Box3, minimum height=7mm,below=0 of NORM](ANOM){Anomaly};
\draw[LineA](GRAPH1)--(ASDS);
\draw[LineA](ASDS)--(NORM);
%Graph2
\begin{scope}[local bounding box=GRAPH2,shift={($(GRAPH1)+(5.5,-1.0)$)},scale=1, every node/.append style={transform shape}]
\begin{axis}[ axis x line=bottom,  axis y line=left,  axis line style={-latex},
ticklabel style={font=\tiny\usefont{T1}{phv}{m}{n}},axis background/.style={fill=gray!10},
%clip=false,
width=4cm, height=2cm,ymax=0.99,xmax=16,
enlarge x limits=0.1,
 scale only axis, %enlargelimits=false,
 samples=600]
\addplot[smooth, color=cyan,  domain=2:14.9] (\x,{sin((3*(147*deg(x))) )*cos(((1*deg(x))) )}) ;
\end{axis}
\end{scope}
%Graph3
\begin{scope}[local bounding box=GRAPH3,shift={($(GRAPH2.south)+(-1.5,-2.3)$)},scale=1, every node/.append style={transform shape}]
\pgfdeclareverticalshading{rainbow}{100bp}
 {color(0bp)=(blue); color(25bp)=(blue); color(35bp)=(blue);
  color(45bp)=(green); color(55bp)=(cyan); color(65bp)=(blue);
  color(75bp)=(violet); color(100bp)=(violet)}
 \shade[shading=rainbow] (0.1,0.1) rectangle (3.6,2.1);
\draw[-latex](0,0)--(4,0);
\draw[-latex](0,0)--(0,2.5);
\end{scope}
%diagram
\begin{scope}[local bounding box=DIAGRAM1,shift={($(GRAPH3.south)+(-2.7,-1.5)$)}]
\node[Box4](T1){};
\node[Trap,right=1.7 of T1,anchor=north](T2){};
\node[Box5,right=2.3 of T1](T3){};
\node[Trap,right=0.65 of T3,anchor=north,yscale=-1,fill=cyan](T4){};
\node[Box4,right=2.3 of T3,fill=cyan](T5){};
\draw[LineA](T1)--(T2.south);
\draw[LineA](T2)--(T3.west);
\draw[LineA](T3)--(T4.north);
\draw[LineA](T4.south)--(T5);
\end{scope}
%%fitting2
\scoped[on background layer]
\node[draw=BackLine,fill=BackColor!40, inner ysep=2mm, inner xsep=3mm,
fit=(GRAPH2)(DIAGRAM1),yshift=0mm](BB2){};
\fill[BrownL!50](ASDS.north east)--(BB2.north west)--(BB2.south west)--(ASDS.south east)--cycle;
%%%right
\node[Box6,below right=0.9 and 0.9 of BB2.north east](FP){FP32};
\node[Box7,right=1.0  of FP](IN){INT8};
\node[Box8,right=1.1  of IN](ARM){{\large\textbf{ARM}}\\ mbed OS};
%%table
\coordinate(S) at ($(ARM.south)+(0,-1.3)$);
\begin{scope}[local bounding box=TAB2,shift={(S)},anchor=north]
\colorlet{col1}{BrownLine!35}
\colorlet{col2}{BrownLine!15}
\colorlet{col3}{BrownLine!5}
\matrix(T)[%nodes in empty cells,
  matrix of nodes,
  row sep =3\pgflinewidth,
  column sep = 3\pgflinewidth,
  nodes={text height=1.5ex,text depth=0.25ex, text width=2mm, draw=white,
  line width=0.25pt, font=\footnotesize\usefont{T1}{phv}{m}{n}},
  row 1/.style={nodes={align=center,fill=col1}},
  column 1/.style = {nodes={text width=23mm,align=left}},
  column 2/.style = {nodes={text width=16mm,align=center}},
  ]
  {
\textbf{Problem}&\textbf{AD}\\
|[fill=col3]| Model &|[fill=col3]| FC-AE\\
% NOTE: Values hardcoded because inline {python} doesn't work inside .tikz blocks.
% Source: ANOMALY_MODEL_* constants in mlsys/constants.py
|[fill=col2]| Size&|[fill=col2]| 270 Kpar\\
|[fill=col3]| Latency &|[fill=col3]| 10.4 ms/inf.\\
|[fill=col2]| Accuracy &|[fill=col2]| 0.86 AUC\\
|[fill=col3]| Energy &|[fill=col3]| 516 $\mu$J/inf.\\
  };
\end{scope}
%
\begin{scope}[local bounding box=F1,shift={($(FP)+(-0.1,-4.25)$)}]
\foreach \j in {1,2,3} {
\pic[shift={(0,0)}]  at ({\j*0.02}, {0.16*\j}) {channel={scalefac=1.5,picname=1\j}};
}
\node[below=3pt of 11-W2,align=center]{Training Code};
\end{scope}
%
\draw[LineA](FP)--(IN);
\draw[LineA](IN)--(ARM);
\draw[LineA](ARM.south)--(S);
\draw[LineA](13-W4)--++(0,0.8)-|(FP);
%above
\coordinate(AB)at($(GRAPH1.north)+(-0.2,1.7)$);
\node[Box9](B1)at(AB){Problem\\ definition};
\node[Box9,right=of B1](B2){Database \\ selection \\ (public domain)};
\node[Box9,right=of B2](B3){Model \\ selection};
\node[Box9,right=of B3](B4){Model \\ training code};
\node[Box9,right=of B4](B5){Derive "Tiny" \\ version:\\ Quantization};
\node[Box9,right=of B5](B6){Embedded\\ implementation};
\node[Box9,right=of B6](B7){Benchmarking \\ harness\\ integration};
\node[Box9,right=of B7](B8){Deploy on \\ device};
\node[Box9,right=of B8](B9){Example \\ benchmark\\ run};
%%fitting arrow
\node[draw=none,fill=none, inner ysep=4mm, inner xsep=6mm,fit=(B1)(B9),xshift=-3mm](A){};
\coordinate(AL)at($($(A.north west)!0.5!(A.south west)$)+(0.6,0)$);
\coordinate(AD)at($($(A.north east)!0.5!(A.south east)$)+(0.6,0)$);
\scoped[on background layer]
\draw[draw=none,fill=cyan!50](A.north west)--(A.north east)--(AD)--(A.south east)--(A.south west)--(AL)--cycle;
\end{tikzpicture}
```
:::

Effective benchmark design must account for the optimization techniques established in preceding chapters. Quantization and pruning affect model accuracy-efficiency trade-offs, requiring benchmarks that measure both speedup and accuracy preservation simultaneously. Hardware acceleration techniques influence arithmetic intensity and memory bandwidth utilization, necessitating roofline model analysis to interpret results correctly. Understanding these optimization foundations enables benchmark selection that validates claimed improvements rather than measuring artificial scenarios.

### Problem Definition {#sec-benchmarking-problem-definition-79e4}

Every benchmark begins by asking: *what exactly must this system do?* The anomaly detection system in @fig-benchmark-components processes audio signals to identify deviations from normal operation patterns—an industrial monitoring application that exemplifies how formal task specifications translate into practical implementations. While specific tasks vary widely by domain (natural language processing tasks include machine translation, question answering [@hirschberg2015advances], and text classification; computer vision employs object detection, image segmentation, and facial recognition [@everingham2010pascal]), every benchmark task specification must define three essential elements: an *input specification* (what data the system processes), an *output specification* (what response the system must produce), and a *performance specification* (quantitative requirements for accuracy, speed, and resource utilization).

Task design directly impacts the benchmark's ability to evaluate AI systems. The audio anomaly detection example illustrates this through its specific requirements: processing continuous signal data, adapting to varying noise conditions, and operating within strict time constraints. These practical constraints create a framework for assessment that reflects real-world operational demands. Each subsequent phase of benchmark implementation—from dataset selection through deployment—builds directly upon these initial specifications.

### Standardized Datasets {#sec-benchmarking-standardized-datasets-123f}

A task definition is only as good as the data used to evaluate it. Standardized datasets ensure that all models undergo testing under identical conditions, enabling direct comparisons across different approaches—without them, every team would evaluate on private data, making cross-lab comparison impossible. \index{ImageNet!standardized dataset}
\index{COCO!object detection dataset}
\index{CIFAR-10!classification reference dataset}
\index{SQuAD!reading comprehension dataset}
\index{GLUE!language understanding benchmark}
In computer vision, ImageNet [@imagenet_website] [@deng2009imagenet], COCO [@lin2014microsoft] [@lin2014microsoft], and CIFAR-10 [@cifar10_website] [@krizhevsky2009learning] serve as reference standards; in natural language processing, SQuAD [@squad_website][^fn-squad] [@rajpurkar2016squad], GLUE[^fn-glue-saturation] [@wang2018glue] [@wang2018glue], and WikiText [@wikitext_website] [@merity2016pointer] fulfill similar roles, each encompassing a range of complexities and edge cases.

[^fn-squad]: **SQuAD (Stanford Question Answering Dataset)**: Introduced in 2016 with 100,000+ question-answer pairs from Wikipedia. AI systems exceeded the 86.8% human F1 baseline by 2018, but this "superhuman" result illustrates a benchmarking failure mode: the task's extractive format (answers are text spans within the passage) makes it easier than open-ended question answering, inflating perceived capability relative to production NLP systems. \index{SQuAD!saturation}

[^fn-glue-saturation]: **GLUE**: GLUE's saturation arc is the canonical benchmark obsolescence case study. Introduced in 2018 with a human baseline of 87.1%, BERT [@devlin2019bert] reached 80.2% within months and models exceeded the human baseline by mid-2019 — less than one year after launch. This is Goodhart's Law in action: once GLUE became a target, it ceased to be a good measure, as models learned to exploit dataset artifacts rather than develop genuine language understanding. The pattern forced the creation of SuperGLUE and now BIG-bench, each requiring progressively harder tasks. \index{GLUE!benchmark saturation}

Dataset selection shapes everything downstream. In the audio anomaly detection example (@fig-benchmark-components), the dataset must include representative waveform samples of normal operation alongside comprehensive examples of anomalous conditions; domain-specific collections like ToyADMOS[^fn-toyadmos] [@koizumi2019toyadmos] for industrial manufacturing and Google Speech Commands for general sound recognition address these requirements. Effective benchmark datasets must balance two competing demands: accurately representing real-world challenges while maintaining sufficient complexity to differentiate model performance. Simplified datasets like ToyADMOS are valuable for methodological development but may not capture the full complexity of production environments.

[^fn-toyadmos]: **ToyADMOS**: Developed by NTT Communications in 2019 for acoustic anomaly detection, containing audio recordings from toy car and conveyor belt operations (1,000+ normal, 300+ anomalous samples per machine type). The "toy" prefix is intentional: the controlled environment enables reproducible benchmarking but creates a domain gap -- models achieving 95%+ AUC on ToyADMOS may drop to 70--80% on factory floors with background noise, vibration, and sensor degradation. \index{ToyADMOS!domain gap}

### Model Selection {#sec-benchmarking-model-selection-01e6}

With task and data specified, the benchmark must define which models to evaluate and what baselines to compare against. This choice is less straightforward than it appears: a benchmark's model selection determines whether results reflect architectural innovation, implementation quality, or simply framework-specific optimizations. The selection process builds upon the architectural foundations established in @sec-network-architectures and must account for the framework considerations discussed in @sec-ml-frameworks.

\index{Baseline Models!reference point selection}
Baseline models serve as reference points spanning from basic implementations (linear regression, logistic regression) to advanced architectures with proven success in comparable domains. In NLP, models like BERT[^fn-bert] have emerged as standard baselines. Critically, the choice of baseline depends on the deployment framework: a PyTorch implementation may exhibit different performance characteristics than its TensorFlow equivalent due to framework-specific optimizations and operator implementations, meaning the benchmark must control for this variable.

[^fn-bert]: **BERT (Bidirectional Encoder Representations from Transformers)**: BERT-Large (`{python} bert_large_params_m_str`M parameters) became the default NLP baseline because its fixed-size encoder produces deterministic latency per input, unlike autoregressive models whose cost scales with output length. This predictability is precisely why MLPerf Inference adopted BERT as its NLP reference workload: a baseline must isolate hardware and software differences from model-inherent variability, and BERT's constant-cost forward pass achieves that separation. \index{BERT!benchmarking baseline}

Once the architecture is selected, model development follows two parallel optimization paths that the benchmark must track. Training optimization focuses on achieving target accuracy within computational constraints. Inference optimization addresses the transition to production—particularly precision reduction from FP32 to INT8 or lower, which demands careful calibration to maintain accuracy while reducing resource requirements. The benchmark must specify requirements for both paths, because a model that trains efficiently but deploys poorly (or vice versa) fails the full evaluation. This dual optimization naturally demands quantitative evaluation metrics that span all three dimensions of our benchmarking framework.

### Evaluation Metrics {#sec-benchmarking-evaluation-metrics-6bac}

Evaluation metrics[^fn-metric-etymology] translate raw model behavior into numbers that can be compared, ranked, and used to make engineering decisions. The challenge is choosing the *right* numbers: a metric that captures accuracy but ignores latency may declare the winner to be a model too slow for production; one that rewards throughput but ignores energy may optimize for a deployment budget that does not exist.

\index{Metric!etymology}

[^fn-metric-etymology]: **Metric**: In mathematics, a metric is a distance function satisfying strict axioms including the triangle inequality, which guarantees transitive ordering. ML borrows the term loosely for any quantitative measure, but many ML "metrics" (BLEU, perplexity) violate transitivity: model A beats B, B beats C, yet C beats A. This intransitivity means leaderboard rankings can change depending on which models are compared, making the choice of metric an engineering decision that shapes which system wins, not just how we measure it. \index{Metric!intransitivity}

Organizing these metrics into a coherent taxonomy helps practitioners select the right measurements for their evaluation goals. {#sec-benchmarking-metric-taxonomy-d4cd} @tbl-metric-taxonomy categorizes metrics by what each measures and when it should be applied:

| **Category**   | **Metric**                   | **Unit**               | **Primary Use Case**   |
|:---------------|:-----------------------------|:-----------------------|:-----------------------|
| **Accuracy**   | Top-1/Top-5 Accuracy         | Percentage             | Classification         |
|                | mAP (mean Average Precision) | 0-1 score              | Object detection       |
|                | BLEU/ROUGE                   | 0-100 score            | NLP generation         |
|                | Perplexity                   | Score (lower = better) | Language modeling      |
| **Throughput** | Samples/second               | Samples/s              | Batch inference        |
|                | Tokens/second                | Tokens/s               | LLM inference          |
|                | Time-to-train                | Hours/days             | Training benchmarks    |
| **Latency**    | p50 latency                  | Milliseconds           | Median response time   |
|                | p99 latency                  | Milliseconds           | Tail latency (SLA)     |
|                | First-token latency          | Milliseconds           | LLM responsiveness     |
| **Efficiency** | Samples/second/watt          | Samples/s/W            | Energy efficiency      |
|                | Accuracy/FLOP                | %/PFLOP                | Algorithmic efficiency |
|                | TCO per inference            | $/inference            | Economic efficiency    |

: **ML Benchmarking Metric Taxonomy.** Metrics organized by evaluation category, unit, and primary use case. Accuracy metrics quantify model quality, throughput and latency metrics capture system speed, and efficiency metrics combine multiple dimensions. Selecting the right metric for your deployment context is often more important than optimizing any single metric to its maximum. {#tbl-metric-taxonomy}

\index{Throughput!vs. latency tradeoff}
\index{Percentile Latency!p50 p95 p99 reporting}
Several distinctions within this taxonomy deserve emphasis. Throughput measures aggregate capacity (ideal for batch processing), while latency measures individual request timing (critical for interactive applications). These metrics frequently conflict: maximizing throughput through batching often increases per-request latency. Mean latency can hide problematic tail behavior—a system with 10 ms mean latency might have 500 ms p99 latency, failing SLA requirements. In production, percentiles (p50, p95, p99) are far more informative than means. Finally, compound metrics like samples/second/watt combine multiple dimensions into a single number, enabling quick comparisons but obscuring individual bottlenecks. Reporting both atomic and compound metrics provides a complete picture.

The selection of appropriate metrics represents a critical aspect of benchmark design, as they must align with task objectives while providing meaningful insights into model behavior across both training and deployment scenarios. Metric computation can vary between frameworks. The training methodologies from @sec-model-training demonstrate how different frameworks handle loss computation and gradient accumulation differently, affecting reported metrics (for example, PyTorch and TensorFlow compute batch normalization statistics differently during evaluation, potentially causing accuracy discrepancies of 0.1-0.5% on the same model).

\index{Precision (Metric)!positive prediction accuracy}
\index{Recall!positive case detection rate}
\index{F1 Score!precision-recall harmonic mean}
Task-specific metrics quantify a model's performance on its intended function. For example, classification tasks employ metrics including accuracy (overall correct predictions), precision (positive prediction accuracy), recall (positive case detection rate), and F1 score (precision-recall harmonic mean) [@sokolova2009systematic]. Regression problems use error measurements like Mean Squared Error (MSE) and Mean Absolute Error (MAE) to assess prediction accuracy. Domain-specific applications often require specialized metrics; for example, machine translation uses the BLEU score[^fn-bleu] to evaluate the semantic and syntactic similarity between machine-generated and human reference translations [@papineni2002bleu].

[^fn-bleu]: **BLEU (Bilingual Evaluation Understudy)**: Introduced by IBM in 2002, BLEU measures translation quality via n-gram overlap with human references (0--100 scale; 30+ useful, 50+ good). BLEU is a canonical example of Goodhart's Law in ML: optimizing for n-gram matches produces fluent-sounding but semantically incorrect translations, because the metric rewards surface-level word patterns rather than meaning preservation. \index{BLEU!Goodhart's Law example}

However, as models transition from research to production deployment, implementation metrics become equally important. Model size, measured in parameters or memory footprint, directly affects deployment feasibility across different hardware platforms. Processing latency, typically measured in milliseconds per inference, determines whether the model meets real-time requirements. Energy consumption, measured in watts or joules per inference, indicates operational efficiency. These practical considerations reflect the growing need for solutions that balance accuracy with computational efficiency. The operational challenges of maintaining these metrics in production environments are explored in deployment strategies (@sec-ml-operations).

Consequently, the selection of appropriate metrics requires careful consideration of both task requirements and deployment constraints. A single metric rarely captures all relevant aspects of performance in real-world scenarios. For instance, in anomaly detection systems, high accuracy alone may not indicate good performance if the model generates frequent false alarms. Similarly, a fast model with poor accuracy fails to provide practical value.

\index{AUC!anomaly detection performance}
This multi-metric evaluation approach appears in our anomaly detection system, which reports performance across multiple dimensions: model size (`{python} anomaly_params_k_str` Kparameters), processing speed (`{python} anomaly_latency_ms_str` ms/inference), and detection accuracy (`{python} anomaly_auc_str` AUC). This combination of metrics ensures the model meets both technical and operational requirements in real-world deployment scenarios.

### Benchmark Harness {#sec-benchmarking-benchmark-harness-09ea}

\index{Benchmark Harness!test infrastructure}
\index{Reproducibility!benchmark requirement}
Metrics define *what* to measure; the benchmark harness determines *how* to measure it. A harness is the test infrastructure that delivers inputs to the system under test, collects measurements, and ensures that the entire process is reproducible. Without a well-designed harness, even perfectly chosen metrics produce unreliable numbers.

Harness design must align with the intended deployment scenario. For server deployments, the harness generates request patterns that simulate real-world traffic, typically using a Poisson distribution[^fn-poisson] to model random but statistically consistent workloads, while managing concurrent requests and varying load intensities.

[^fn-poisson]: **Poisson Distribution**: Named after Siméon Denis Poisson, who formalized it in 1837 while modeling wrongful conviction rates in French courts. The distribution models independent events at a constant average rate ($\lambda$), making it the standard assumption for server request arrivals. The benchmarking consequence: real ML serving traffic often violates the Poisson assumption due to bursty patterns (e.g., viral content spikes), so benchmarks using Poisson arrivals systematically underestimate tail latency in production. \index{Poisson Distribution!traffic modeling}

For embedded and mobile applications, the harness generates input patterns that reflect actual deployment conditions. This might involve sequential image injection for mobile vision applications or synchronized multi-sensor streams for autonomous systems. Such precise input generation and timing control ensures the system experiences realistic operational patterns, revealing performance characteristics that would emerge in actual device deployment.

The harness must also accommodate different throughput models. Batch processing scenarios require the ability to evaluate system performance on large volumes of parallel inputs, while real-time applications need precise timing control for sequential processing. In the embedded implementation phase, the harness must support precise measurement of inference time and energy consumption per operation.

Reproducibility demands that the harness maintain consistent testing conditions across different evaluation runs. This includes controlling environmental factors such as background processes, thermal conditions, and power states that might affect performance measurements. The harness must also provide mechanisms for collecting and logging performance metrics without measurably impacting the system under test.

### System Specifications {#sec-benchmarking-system-specifications-6e80}

Complementing the harness that controls test execution, system specifications document the complete computational environment—the hardware and software stack on which the benchmark runs. Without precise specifications, a reported throughput number is meaningless: the same model can train ten times faster on an H100 than on a V100, making the hardware context inseparable from the result.

On the hardware side, specifications must capture the processor type and clock rate, accelerator model and memory (GPU, TPU, or custom ASIC), system RAM, storage type, and network configuration for distributed setups. On the software side, they must record the operating system, framework versions (e.g., PyTorch 2.1 vs. TensorFlow 2.14), compiler flags, and environment management tools such as Docker containers or virtual environments. This level of detail enables other researchers to replicate the benchmark environment with high fidelity and provides critical context for interpreting performance differences.

Many benchmarks include results across multiple hardware configurations, precisely because the trade-offs between model complexity, computational resources, and performance only become visible through comparative analysis. As the field increasingly prioritizes sustainability, specifications now extend to energy consumption metrics—FLOPS/watt, total power draw over training time—reflecting growing awareness that computational efficiency is an engineering requirement, not merely an environmental aspiration.

### Run Rules {#sec-benchmarking-run-rules-c33f}

System specifications describe *what* the benchmark runs on; run rules govern *how* it runs. These procedural constraints ensure that results can be reliably replicated, which is harder than it sounds in a field where stochastic processes—weight initialization, data shuffling, dropout masks—mean that two identical runs on identical hardware can produce different numbers. Run rules tame this randomness by mandating fixed seeds, controlled data ordering, and systematic handling of every source of non-determinism.

Hyperparameter documentation is equally critical. A learning rate change from $10^{-3}$ to $3 \times 10^{-4}$ can shift accuracy by several percentage points, so benchmarks require exhaustive recording of every configuration setting. Similarly, benchmarks mandate the preservation and sharing of training and evaluation datasets; when privacy or licensing constraints prevent direct sharing, detailed preprocessing specifications enable construction of comparable datasets.

Code provenance completes the reproducibility chain. Contemporary benchmarks typically require publication of implementation code in version-controlled repositories—not just the model, but the full pipeline of preprocessing, training, and evaluation scripts. Advanced benchmarks distribute containerized environments that encapsulate all dependencies and configurations, while mandating detailed experimental logging: training metrics, model checkpoints, and documentation of any mid-experiment adjustments. Together, these protocols transform benchmarking from a one-time measurement into a verifiable, iterable scientific process.

### Result Interpretation {#sec-benchmarking-result-interpretation-cd29}

Producing benchmark numbers is the easy part; interpreting them correctly is where most engineers go wrong. A raw throughput figure or accuracy score is meaningless without understanding the conditions that produced it, the statistical confidence behind it, and the deployment context that determines whether the number matters. {#sec-benchmarking-benchmark-result-interpretation-framework-16a7}

::: {.callout-example title="Benchmarking a Vision Model for Edge Deployment"}
**Scenario**: A team validates MobileNetV2 for a wildlife camera trap running on a Raspberry Pi 4.

**Benchmark Results**:

| **Precision** | **Latency (ms)** | **Accuracy (Top-1)** | **Model Size (MB)** |
|:--------------|-----------------:|---------------------:|--------------------:|
| **FP32**      |           120 ms |                71.8% |                14.0 |
| **INT8**      |            35 ms |                71.2% |                 3.5 |

**Interpretation**:
The 3.4$\times$ speedup and 4$\times$ size reduction from quantization come at a minimal cost of 0.6% accuracy drop. For a battery-powered real-time system, INT8 is the clear choice, enabling 28 FPS processing compared to 8 FPS with FP32.
:::

Before drawing conclusions from benchmark results, apply the vendor claim analysis framework introduced earlier (see the "Decoding Vendor Benchmark Claims" checklist) and extend it with two additional dimensions. First, *is the comparison fair?* Comparing ResNet-50 against MobileNet conflates architecture differences with optimization choices; precision differences (FP32 vs. INT8) alone can explain 2--4$\times$ performance gaps, and batch size, hardware generation, and software framework must all be controlled. Second, *are the statistics meaningful?* Reliable results require multiple runs (minimum 5, preferably 10+), reported variance with confidence intervals, clear handling of outliers, and steady-state operation rather than cold-start effects.

Applying these questions to *interpreting a benchmark claim* illustrates how incomplete specifications obscure real performance.

::: {.callout-notebook title="Interpreting a Benchmark Claim"}
**Problem**: A vendor claims "Our system achieves 10,000 images/second on ResNet-50." Should you trust this number for your deployment planning?

**Critical Questions**:

1. **What batch size?** Batch 256 achieves high throughput but 256ms latency; batch 1 achieves low latency but lower throughput.
2. **What precision?** INT8 is 2--4$\times$ faster than FP32 but may have accuracy implications.
3. **What is included?** Pure inference, or including preprocessing?
4. **What accuracy?** Matching the original 76.1% Top-1, or degraded?

**A Complete Specification**: "10,000 images/second on ResNet-50 at batch size 32, INT8 precision, 76.0% Top-1 accuracy, including JPEG decoding, on NVIDIA H100 at `{python} h100_tdp_str` W TDP."

**The Systems Insight**: Understanding whether a performance difference is meaningful requires both statistical rigor AND contextual validation. A benchmark number without these details is a marketing claim, not an engineering specification.
:::

Beyond vendor claims, context determines which metrics matter most. A 1% accuracy improvement may be decisive for medical diagnostics but irrelevant for an application that prioritizes inference speed. Practitioners should also guard against *benchmark overfitting* — models excessively optimized for specific benchmark tasks at the expense of real-world generalization — by evaluating performance on related but distinct tasks and considering practical deployment scenarios.

### Example Benchmark {#sec-benchmarking-example-benchmark-229f}

To see how these components work together in practice, walk through the anomaly detection pipeline in @fig-benchmark-components one more time, now focusing on the output stage. The benchmark produces three complementary measurements: a model size of `{python} anomaly_params_k_str` Kparameters with `{python} anomaly_latency_ms_str` milliseconds per inference (computational resources), a detection accuracy of `{python} anomaly_auc_str` AUC in distinguishing normal from anomalous audio patterns (task effectiveness), and an energy consumption of `{python} anomaly_energy_uj_str` µJ per inference (operational efficiency).

Which of these metrics matters most depends entirely on the deployment context. Energy consumption per inference is critical for battery-powered devices but irrelevant for always-on server racks. Model size constrains embedded devices with limited memory but barely registers for cloud deployments. Processing speed determines whether the system can operate in real-time or must batch inputs. These metrics also reveal inherent trade-offs: reducing model size from `{python} anomaly_params_k_str` Kparameters might improve speed and energy efficiency but degrade the `{python} anomaly_auc_str` AUC detection accuracy. Whether these measurements constitute a "passing" benchmark depends on the deployment constraints—the framework provides structure for consistent evaluation, but acceptance criteria must come from the application requirements.

### Compression Benchmarks {#sec-benchmarking-compression-benchmarks-9cf0}

Neural network compression—pruning, quantization, knowledge distillation, and architecture optimization—requires specialized benchmarks because compression reshapes the trade-off landscape: every byte saved or operation eliminated must be weighed against potential accuracy loss and hardware compatibility.

The most basic compression metric is raw size reduction: parameter count, memory footprint in bytes, and compressed storage requirements. But size alone is misleading. MobileNetV2 achieves approximately 72% ImageNet top-1 accuracy with `{python} mobilenet_params_m_str` million parameters versus ResNet-50's 76% accuracy with `{python} resnet50_params_m_str` million parameters—a 7.5$\times$ efficiency improvement in the parameter-to-accuracy ratio that matters far more than raw parameter counts.

Pruning benchmarks must distinguish between structured and unstructured approaches, because they produce qualitatively different results on real hardware. Structured pruning removes entire neurons or filters, achieving consistent speedups but typically lower compression ratios (2--4$\times$). Unstructured pruning eliminates individual weights for higher compression ratios (10--100$\times$), but realizing actual speedups requires specialized sparse computation support—meaning benchmark protocols must specify hardware platform and software implementation.

\index{Mixed-Precision!layer-wise precision assignment}
\index{Knowledge Distillation!benchmark evaluation}
Quantization benchmarks evaluate precision reduction across data types. INT8 quantization reduces memory footprint by 4$\times$ and accelerates inference by 2--4$\times$ (the precision-accuracy trade-off is analyzed in @sec-benchmarking-inference-metrics-78d4, and the energy implications in @sec-benchmarking-training-metrics-0f1a). Mixed-precision approaches push further by applying different precision levels to different layers—critical layers retain FP16 while computation-heavy layers use INT8 or INT4—enabling fine-grained efficiency optimization. Knowledge distillation adds another dimension: successful transfer achieves 90–95% of the teacher model's accuracy while reducing size by 5--10$\times$, but benchmarking must verify that the student generalizes rather than merely memorizing the teacher's outputs.

Critically, acceleration factors vary dramatically across hardware platforms: sparse models deliver 2--5$\times$ speedup on CPUs, reduced-precision models achieve 2--8$\times$ on mobile processors, and efficient architectures provide 5--20$\times$ on specialized edge accelerators. Current benchmark suites like MLPerf focus primarily on dense, unoptimized models that do not represent production deployments, where compressed models are ubiquitous. This gap between what benchmarks measure and what production actually runs remains one of the field's most consequential blind spots.

### Mobile and Edge Benchmarks {#sec-benchmarking-mobile-edge-benchmarks-fd4f}

\index{Edge Deployment!constraint triangle}
Mobile and edge deployments face constraints radically different from cloud environments, requiring specialized benchmarking approaches that capture the unique trade-offs in resource-constrained settings. These constraints form an interdependent triangle of power consumption, inference latency, and model accuracy, where improving any two typically degrades the third.

Edge deployment requires navigating trade-offs that cloud deployments can largely ignore:

| **Constraint** | **Cloud Impact**               | **Edge Impact**               |
|:---------------|:-------------------------------|:------------------------------|
| **Power**      | Operational cost (~\$0.10/kWh) | Hard limit (battery capacity) |
| **Latency**    | User experience metric         | Safety-critical deadline      |
| **Accuracy**   | Primary optimization target    | Constrained by power/latency  |

\index{Thermal Throttling!sustained vs. peak performance}
As a concrete example, a smartphone camera AI for real-time object detection must process 30 frames/second (33 ms/frame) while consuming <1 W to avoid excessive battery drain and thermal throttling. A MobileNetV3 model achieving 75% accuracy at 15 ms/frame and 0.8 W meets these constraints; a ResNet-50 achieving 80% accuracy at 45 ms/frame and 2.5 W does not, despite being "better" by accuracy-only benchmarks. An *edge benchmark reality check* exposes these gaps between marketed specifications and sustained operational behavior. The gap between *peak* and *sustained* performance is particularly dangerous: a vendor may report burst-mode numbers that halve under thermal throttling within minutes, making *benchmarking the edge* a categorically different exercise than benchmarking the cloud.

::: {.callout-example title="Benchmarking the Edge"}
**The Scenario**: You are selecting a device for a smart doorbell. The vendor claims their chip runs "AI at 1 Watt."

**The Benchmark**: You run a continuous object detection loop.

**The Reality**:

1.  **Minute 0-1**: The chip runs fast (30 FPS) at 1W.
2.  **Minute 2**: The chip heats up.
3.  **Minute 5**: **Thermal Throttling** kicks in. The clock speed drops by 50% to prevent melting.
4.  **Steady State**: The chip stabilizes at 15 FPS.

**The Conclusion**: The "Benchmark" was 30 FPS. The "Product Reality" is 15 FPS. If you designed your user experience for 30 FPS, your product is broken. Always benchmark **sustained performance**, not just peak.
:::

This pattern of optimistic marketing claims versus production reality is endemic to edge hardware. The following checklist provides a systematic approach to cutting through the noise.

::: {.callout-perspective title="Edge Benchmark Reality Check"}
When evaluating edge hardware claims:

1. **Peak vs. Sustained**: Snapdragon 8 Gen 3 advertises 35 TOPS peak but delivers 20 TOPS sustained under thermal throttling. Always benchmark under sustained workloads (>30 seconds minimum).
2. **Power at idle vs. active**: A device consuming 50mW idle and 2W active may report "2W" for marketing, but if your application runs inference 1% of the time, effective power draw is ~70mW, not 2W.
3. **Thermal envelope**: Edge devices typically target 3–5 W thermal design power (TDP). Exceeding this triggers throttling within seconds. Benchmark reports omitting thermal conditions are incomplete.
4. **End-to-end vs. accelerator-only**: NPU benchmarks often exclude data transfer overhead. Moving image data from camera to NPU and back can exceed inference time for small models.
:::

#### Heterogeneous Processor Coordination {#sec-benchmarking-heterogeneous-processor-coordination-68cf}

Mobile SoCs integrate heterogeneous processors (CPU, GPU, DSP, NPU) requiring specialized benchmarking that captures workload distribution complexity while accounting for thermal and battery constraints. Effective processor coordination achieves 3--5$\times$ performance improvements through intelligent work distribution.\index{NPU!neural network acceleration} Each processor excels at different workload profiles: CPUs handle control flow, small batches, and sequential processing; GPUs accelerate parallel floating-point operations and general ML inference; DSPs excel at fixed-point signal processing and always-on detection tasks; and NPUs target specific neural network architectures with INT8/INT4 precision.

Benchmarks must evaluate workload placement decisions, not just individual processor performance. A voice assistant, for example, might use the DSP for always-on wake-word detection (5 mW continuous), switch to the NPU for speech recognition (200 mW burst), and use the CPU for language understanding (100 mW). Single-processor benchmarks miss these orchestration dynamics entirely.

#### Battery and Thermal Benchmarking {#sec-benchmarking-battery-thermal-benchmarking-79dd}

\index{Battery Benchmarking!duty cycle specification}
Battery impact varies dramatically by use case: computational photography consumes 2–5 W during active capture, while background AI for activity recognition requires 5–50 mW for acceptable all-day endurance. The challenge is that instantaneous power draw during inference tells only part of the story—what matters for battery life is the *total energy budget* across a realistic usage pattern.

The most important factor is the workload duty cycle: what fraction of time the system actually runs inference. A doorbell camera that processes 100 frames per day spends nearly all its time idle, making standby power the dominant concern. A real-time video analytics pipeline running at 30 FPS, by contrast, is inference-bound almost continuously, making per-inference energy the critical metric. Background power—the energy consumed when the model is loaded but waiting for input—bridges these extremes and often exceeds inference energy for intermittent workloads. Finally, sustained thermal behavior must be characterized over minutes rather than seconds, because edge devices that deliver impressive burst performance frequently throttle within 2–5 minutes as junction temperatures rise, settling at substantially lower steady-state throughput.

#### Edge-Cloud Coordination {#sec-benchmarking-edgecloud-coordination-061d}

Mobile benchmarking must also evaluate 5G/WiFi edge-cloud coordination, with URLLC[^fn-urllc] demanding <1ms latency for critical applications. This coordination introduces benchmarking dimensions absent from purely local evaluation. Network latency variability—4G/5G latency ranges from 10 ms to 100 ms+ depending on congestion—means that inference pipelines splitting work between device and cloud face unpredictable round-trip costs. Fallback behavior determines what happens when connectivity fails entirely: does the device degrade gracefully to a smaller on-device model, or does it queue requests until connectivity resumes? Workload splitting decisions (what computation runs locally versus remotely) and privacy constraints (what data can be transmitted for cloud inference) further shape the benchmark design space. Each of these dimensions must be measured under realistic network conditions rather than idealized lab connectivity.

[^fn-urllc]: **URLLC (Ultra-Reliable Low-Latency Communication)**: 5G service category requiring 99.999% reliability and <1 ms latency. These dual constraints force a systems trade-off: achieving both simultaneously requires edge compute placement within 10 km of users (speed-of-light constraint), which limits available hardware to low-power accelerators, which in turn constrains model size. URLLC benchmarking must therefore measure the entire chain: radio latency + compute latency + model accuracy at the constrained size. \index{URLLC!edge inference constraint}

Automotive deployments add ASIL validation, multi-sensor fusion, and -40°C to +85°C environmental testing. These unique requirements necessitate comprehensive frameworks evaluating sustained performance under thermal constraints, battery efficiency across usage patterns, and connectivity-dependent behavior, extending beyond isolated peak measurements.

Whether benchmarking cloud servers or microcontrollers, however, a critical distinction cuts across all deployment contexts: the same neural network behaves entirely differently depending on whether it is *learning* or *predicting*. This distinction shapes what we measure, how we measure it, and which metrics matter—and it is so fundamental that separate benchmarking frameworks have emerged for each phase.

## Training vs. Inference {#sec-benchmarking-training-vs-inference-evaluation-a3be}

Training and inference pursue fundamentally different objectives, and these contrasting goals create evaluation requirements so different that separate benchmarking frameworks emerged for each: MLPerf Training and MLPerf Inference. The critical question is whether theoretical TFLOPS translate to practical time-to-train or queries-per-second. Training seeks optimal parameters through iterative refinement (@sec-model-training), processing billions of examples over hours or days, stressing memory bandwidth, multi-GPU scaling, and sustained throughput. Inference applies those parameters to individual inputs under deployment strategies (@sec-ml-operations), often within millisecond deadlines, stressing latency consistency, cold-start time, and power efficiency.

The differences cascade through every aspect of system design. Training involves bidirectional computation (forward and backward passes), while inference performs single forward passes with fixed parameters. Memory allocation diverges sharply: training requires simultaneous access to parameters, gradients, optimizer states, and activations, creating 3--4$\times$ memory overhead compared to inference. Training employs mixed-precision computation and gradient compression to manage this overhead, while inference uses more aggressive precision reduction (detailed in @sec-benchmarking-inference-metrics-78d4) and techniques like post-training quantization and knowledge distillation. Resource utilization patterns also contrast: training targets sustained GPU saturation, whereas inference contends with variable request patterns that leave hardware underutilized, as the roofline analysis in @sec-benchmarking-system-benchmarks-393c demonstrated.

Energy costs follow different patterns. Training energy costs are amortized across model lifetime and measured in total energy per trained model; estimates for large training runs can reach the scale of thousands of megawatt-hours (GPT-3 has been estimated at roughly 1,287 MWh) [@patterson2021carbon]. Inference energy costs accumulate per query and can become a dominant operational consideration at scale. A durable way to reason about per-query energy is the identity \(E = P \times t\). For example, a `{python} accel_power_w_str` W accelerator running a `{python} latency_fast_ms_str` ms inference consumes \(`{python} accel_power_w_str` \times 0.01 = `{python} energy_fast_j_str`\) joules, which is about \(`{python} energy_fast_wh_str`\) Wh; at `{python} latency_slow_ms_str` ms, that becomes about \(`{python} energy_slow_wh_str`\) Wh.

This comparative framework guides benchmark design by highlighting which metrics matter most for each phase and how evaluation methodologies must differ. Training benchmarks emphasize convergence time and scaling efficiency; inference benchmarks prioritize latency consistency and resource efficiency across diverse deployment scenarios. We examine training benchmarks first, because the quality of the trained model sets the ceiling for everything inference can deliver.

## Training Benchmarks {#sec-benchmarking-training-benchmarks-96da}

\index{Training Benchmarks!convergence throughput scalability}
\index{Training Benchmarks!definition}
A team purchases a \$10M GPU cluster expecting 5$\times$ the training speed of their \$2M setup, only to discover that communication overhead and memory bottlenecks limit the actual speedup to 2.8$\times$. Training benchmarks exist to catch this kind of gap before procurement. They divide into three categories: convergence metrics that measure learning progress, throughput metrics that measure computational efficiency, and scalability metrics that measure distributed performance.

Training benchmarks validate whether hardware acceleration delivers promised training throughput. The GPU clusters, TPU pods, and distributed training strategies examined in @sec-hardware-acceleration all claim dramatic speedups, and training benchmarks reveal which claims hold under realistic workloads. They evaluate how hardware configurations, data loading mechanisms, and distributed training strategies perform when training production-scale models.

These benchmarks are vital because training represents the largest capital expenditure in ML systems. A cluster that costs $10M should demonstrably outperform a $2M cluster on training time-to-accuracy, but only rigorous benchmarking reveals whether the 5$\times$ cost delivers proportional value or falls victim to scaling inefficiencies, memory bottlenecks, or communication overhead.

For instance, large-scale models like OpenAI's GPT-3[^fn-bench-gpt3] [@brown2020language], which consists of `{python} gpt3_params_b_str` billion parameters trained on approximately 570GB of filtered CommonCrawl text (from a ~45TB raw dataset, combined with other sources to form `{python} gpt3_tokens_b_str` billion training tokens), highlight the immense computational demands of modern training. Standardized *ML training benchmarks* provide systematic evaluation of the underlying systems to ensure that hardware and software configurations can meet these unprecedented demands efficiently.

[^fn-bench-gpt3]: **GPT-3**: OpenAI's 2020 language model (`{python} gpt3_params_b_str`B parameters, `{python} gpt3_tokens_b_str`B training tokens) consumed an estimated 3,640 petaFLOP-days on 10,000 V100 GPUs at an estimated cost exceeding \$4.6M [@patterson2021carbon]. GPT-3 established that training cost scales roughly linearly with parameter count, making training benchmarks essential for predicting whether a planned training run is economically viable before committing the compute. \index{GPT-3!training cost}

::: {.callout-definition title="ML Training Benchmarks"}

***ML Training Benchmarks***\index{ML Training Benchmarks!definition} measure the **Rate of Convergence** per unit of resource (time, energy, cost).

1.  **Significance (Quantitative):** They validate the system's ability to sustain high **Arithmetic Intensity** across distributed accelerators while managing the **Communication Overhead** ($L_{lat}$) of gradient synchronization.
2.  **Distinction (Durable):** Unlike **Inference Benchmarks**, which focus on **Input-Output Latency**, Training Benchmarks focus on **Throughput ($\eta$)** and **Total Training Time ($T_{train}$)**.
3.  **Common Pitfall:** A frequent misconception is that training benchmarks only measure "how fast the GPU runs." In reality, for large models, the **Interconnect Bandwidth ($BW$)** and the **Fault Tolerance Overhead** are often more critical to the benchmark result than the raw FLOPs.

:::

\index{MLPerf Training!standardized framework}
MLPerf Training [@mlperf_training_website] provides the standardized framework referenced throughout this analysis of training benchmarks.

### Training Benchmark Motivation {#sec-benchmarking-training-benchmark-motivation-f365}

The impact of standardized training measurement is striking. \index{Moore's Law!outpaced by ML training improvements}
@fig-mlperf-training-improve demonstrates that performance improvements across successive MLPerf Training benchmark versions have consistently outpaced Moore's Law, with ResNet training speedups exceeding 30$\times$ over five years while semiconductor scaling would predict only 6.6$\times$. This exponential improvement illustrates a core principle: what gets measured gets improved. The standardized benchmarking framework creates competitive pressure that drives rapid optimization across the entire ML computing stack.

::: {#fig-mlperf-training-improve fig-env="figure" fig-pos="htb" fig-cap="**MLPerf Training Progress**: Standardized benchmarks reveal that machine learning training performance consistently surpasses Moore's Law, indicating substantial gains from systems-level optimizations. These trends emphasize how focused measurement and iterative improvement drive rapid advancements in ML training efficiency and scalability. Source: [@tschand2024mlperf]." fig-alt="Line chart with nine model benchmarks from 2018 to 2024 showing relative performance gains up to 48$\times$ for Mask R-CNN, all exceeding the Moore's Law baseline of 6.6$\times$."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
%\node[anchor=south west]at(-0.4,-12){%
%\includegraphics[width=281.1mm,height=188.2mm]{1}};

\makeatletter
\newcommand*\short[1]{\expandafter\@gobbletwo\number\numexpr#1\relax}
\makeatother

\begin{axis}[
   axis line style={draw=none},
  /pgf/number format/.cd,
  width=163mm,
  height=93mm,
  legend style={at={(0.16,0.98)}, anchor=north},
  legend cell align=left,
  legend style={fill=BrownL!40,draw=BrownLine,row sep=-1.1pt,
  font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
  date coordinates in=x,
  table/col sep=comma,
  xticklabel=\month/\short{\year},
  xtick={2018-12-01,2019-06-01,2019-12-01,
  2020-06-01,2020-12-01,2021-06-01,2021-12-01,2022-06-01,2022-12-01,
 2023-06-01,2023-12-01, 2024-06-01},
  x tick label style={rotate=0, anchor=north},
  xmin=2018-10-18,
  xmax=2024-07-30,
  ymin=0.95, ymax=64,
  ymode=log,
  log basis y=2,
  ytick={1,2,4,8,16,32,64},
  yticklabels={1,2,4,8,16,32,64},
  ylabel={},
  title={Relative performance - Best results - Closed, available, on premises},
  grid=both,
  major grid style={black!60},
        tick label style={/pgf/number format/assume math mode=true},
        ticklabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
        xticklabel style={yshift=-3pt},
]
%green-ResNet
\addplot[green!70!black,mark=Mercedes star,
mark options={line width=1pt},
mark size=3pt,line width=1.15pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2018-12-15
4.87, 2019-07-15
8.2, 2020-07-15
15.5, 2021-06-15
17.85, 2021-12-15
32.5, 2022-06-15
32.5, 2022-11-15
33.8, 2023-06-15
33.8, 2023-11-15
33.8, 2024-06-15
};
\addlegendentry{ResNet}
%diamond-Mask R-CNN
\addplot[cyan!90!black,mark=diamond*,
mark size=2pt,line width=1.15pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2018-12-15
 3.95, 2019-07-15
6.95, 2020-07-15
18.25, 2021-06-15
22.15, 2021-12-15
32.5, 2022-06-15
32.5, 2022-11-15
48.8, 2023-06-15
48.8, 2023-11-15
};
\addlegendentry{Mask R-CNN}
%triangle RetinaNet
\addplot[OliveLine,
line width=1.15pt,
mark size=2pt,mark=triangle*,
mark options={line width=1pt}
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
3.45, 2022-06-15
4.35, 2022-11-15
5.3, 2023-06-15
8.6, 2023-11-15
10.3, 2024-06-15
};
\addlegendentry{RetinaNet}
%red 3D-U-Net
\addplot[red,line width=1.15pt,
mark=square*,mark size=1.5pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
2.45, 2021-06-15
5.8, 2021-12-15
6.05, 2022-06-15
6.05, 2022-11-15
8.94, 2023-06-15
9.45, 2023-11-15
9.4, 2024-06-15
};
\addlegendentry{3D-U-Net}
%pentagon-Bert-large
\addplot[BlueLine,line width=1.15pt,
  mark=pentagon*,
  mark size=2pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1.78, 2020-07-15
4.45, 2021-06-15
6.3, 2021-12-15
7.95, 2022-06-15
6.9, 2022-11-15
10.6, 2023-06-15
11.9, 2023-11-15
11.99, 2024-06-15
};
\addlegendentry{BERT-large}
%violet GPT3
\addplot[pink!59!orange,line width=1.15pt,
mark=|,
  mark options={line width=1pt},
  mark size=2pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
4.8, 2023-06-15
13.45, 2023-11-15
15.39, 2024-06-15
};
\addlegendentry{GPT3}
%red-DLRM
\addplot[RedLine,line width=1.15pt,
  mark=star,
  mark size=3pt,mark options={line width=1pt}
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1.78, 2020-07-15
5.95, 2021-06-15
9.3, 2021-12-15
10.0, 2022-06-15
10, 2022-11-15
};
\addlegendentry{DLRM}
%plus DLRM-dcnv2
\addplot[BrownLine,
line width=1.15pt,
mark size=2pt,mark=+,
mark options={line width=1pt}
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
4.8, 2023-06-15
7.55, 2023-11-15
7.9, 2024-06-15
};
\addlegendentry{DLRM-dcnv2}
%violet-Stable diffusion v2
\addplot[VioletLine,
line width=1.25pt,
mark size=2pt,mark=x,
mark options={line width=1pt}
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
5.5, 2023-11-15
9.8, 2024-06-15
};
\addlegendentry{Stable diffusion v2}
%orange-Moore's Law Cumulative
\addplot[orange,line width=1.25pt,
mark size=2pt,mark=*,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2018-12-15
1.23, 2019-07-15
1.78, 2020-07-15
2.45, 2021-06-15
2.85, 2021-12-15
3.45, 2022-06-15
3.9, 2022-11-15
4.8, 2023-06-15
5.5, 2023-11-15
6.6, 2024-06-15
};
\addlegendentry{Moores Law Cumulative}
\end{axis}
\end{tikzpicture}
```
:::

Training benchmarks drive progress because the exponential improvement curve traced in @fig-mlperf-training-improve did not happen by accident: they emerged because standardized measurement created competitive pressure to optimize. What gets measured gets improved. Training benchmarks uncover inefficiencies invisible without systematic evaluation: slow data loading, underutilized accelerators, excessive memory overhead, and communication bottlenecks that erode scaling efficiency. The theoretical hardware capabilities established in @sec-hardware-acceleration (e.g., GPU TFLOPS, TPU tensor throughput) only translate to actual training speedups when benchmarks verify them under realistic conditions.

\index{Mixed-Precision Training!FP16 and FP32}
Training benchmarks serve four interconnected functions. First, they enable hardware and software optimization by providing vendor-neutral comparisons across accelerator architectures and frameworks (TensorFlow, PyTorch) on standardized tasks, guiding hardware selection for data centers and cloud environments. Software optimizations including mixed-precision training[^fn-bench-mixed-precision] and memory-efficient data loading are similarly quantified. Second, they evaluate scalability: adding GPUs should reduce training time proportionally, but communication overhead, synchronization latency, and memory bottlenecks limit scaling efficiency in practice. Training benchmarks quantify these losses, revealing whether infrastructure investments deliver proportional returns. Third, they provide cost and energy accountability: with large-scale training runs consuming thousands of megawatt-hours, benchmarks that track cost per training run and power consumption per unit of progress help organizations balance computational power with sustainability goals. Finally, they ensure fair, reproducible comparison through standardized evaluation criteria, controlled randomness, and strict submission guidelines that guarantee performance results reflect genuine system capabilities rather than implementation-specific tuning.

[^fn-bench-mixed-precision]: **Mixed-Precision Training**: Uses FP16 for computation and FP32 for accumulation, achieving 1.5--2$\times$ speedups on Tensor Cores while reducing memory by ~40%. The benchmarking consequence: mixed-precision and full-precision runs are not directly comparable because reduced memory enables larger batch sizes, which change convergence dynamics. MLPerf addresses this by fixing the accuracy target, making time-to-accuracy the comparable quantity regardless of precision strategy. \index{Mixed-Precision Training!benchmarking comparability}

### Training Metrics {#sec-benchmarking-training-metrics-0f1a}

From a systems perspective, training benchmarks assess how efficiently a model reaches a predefined accuracy threshold. Metrics like throughput and scalability are only meaningful relative to whether the model achieves its target accuracy; without this constraint, optimizing raw speed may be misleading.

MLPerf Training codifies this by defining specific accuracy targets per task: a system that trains quickly but misses the target is invalid, and one that converges accurately but too slowly is impractical. Effective benchmarking balances speed, efficiency, and accuracy convergence.

#### Time and Throughput {#sec-benchmarking-time-throughput-1252}

\index{Time-to-Accuracy!training benchmark primary metric}
One of the primary metrics for evaluating training efficiency is the time required to reach a predefined accuracy threshold. Training time ($T_{\text{train}}$) measures how long a model takes to converge to an acceptable performance level, reflecting the overall computational efficiency of the system. Let $\text{accuracy}(t)$ be the model's accuracy at training time $t$, and let **target accuracy** be the benchmark-specific threshold (e.g., 75.9% top-1 accuracy for ResNet-50 on ImageNet in MLPerf). @eq-training-time-benchmark formally defines this metric:

$$T_{\text{train}} = \arg\min_{t} \big\{ \text{accuracy}(t) \geq \text{target accuracy} \big\}$$ {#eq-training-time-benchmark}

This metric ensures that benchmarking focuses on how quickly and effectively a system can achieve meaningful results.

\index{Throughput!definition}
Throughput[^fn-throughput-etymology], often expressed as the number of training samples processed per second, provides an additional measure of system performance. Let $N_{\text{samples}}$ be the total number of training samples processed and $T_{\text{train}}$ the training time from @eq-training-time-benchmark. @eq-throughput-benchmark shows:

\index{Throughput!etymology}

[^fn-throughput-etymology]: **Throughput**: From manufacturing, where it measured units passing through a production line per unit time. The term entered computing in the 1960s batch-processing era. The manufacturing origin carries a systems lesson: throughput and latency are inherently opposed, because batching increases throughput (more units per hour) at the cost of individual item wait time. In ML serving, this manifests as the batch-size trade-off: larger batches improve GPU utilization but increase per-request latency. \index{Throughput!etymology}

$$\text{Throughput} = \frac{N_{\text{samples}}}{T_{\text{train}}}$$ {#eq-throughput-benchmark}

Throughput alone does not guarantee meaningful results, as a model may process a large number of samples quickly without necessarily reaching the desired accuracy.

For example, in MLPerf Training, the benchmark for ResNet-50 may require reaching an accuracy target like 75.9% top-1 on the ImageNet dataset. A system that processes 10,000 images per second but fails to achieve this accuracy is not considered a valid benchmark result, while a system that processes fewer images per second but converges efficiently is preferable. This highlights why throughput should be evaluated in relation to time-to-accuracy rather than as an independent performance measure.

#### Scalability and Parallelism {#sec-benchmarking-scalability-parallelism-1124}

\index{Scalability!distributed training metric}
Scalability measures how effectively training performance improves as resources are added. Ideally, doubling GPU count should halve training time. In practice, communication overhead, memory bandwidth limits, and parallelization inefficiencies constrain scaling well below linear.

When training large-scale models such as GPT-3, OpenAI employed approximately 10,000 NVIDIA V100 GPUs in a distributed training setup. Google's systems have demonstrated similar scaling challenges with their 4,096-node TPU v4 clusters, where adding computational resources provides more raw power but performance improvements are constrained by network communication overhead between nodes. Benchmarks such as MLPerf quantify how well a system scales across multiple GPUs, providing insights into where inefficiencies arise in distributed training.

\index{Data Parallelism!training strategy}
\index{Model Parallelism!training strategy}
\index{Pipeline Parallelism!training strategy}
Parallelism in training is categorized into data parallelism, model parallelism, and pipeline parallelism (see @sec-model-training), each presenting distinct challenges. Data parallelism, the most commonly used strategy, involves splitting the training dataset across multiple compute nodes. The efficiency of this approach depends on synchronization mechanisms and gradient communication overhead. In contrast, model parallelism partitions the neural network itself, requiring efficient coordination between processors. Benchmarks evaluate how well a system manages these parallelism strategies without degrading accuracy convergence.

\index{Scaling Efficiency!strong scaling definition}
A key metric for evaluating parallelism is *scaling efficiency*, which quantifies how much of the added computational capacity translates into actual speedup.

```{python}
#| label: scaling-efficiency-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SCALING EFFICIENCY CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "Scaling Efficiency Calculation" — worked example showing
# │          strong scaling efficiency for multi-GPU ResNet-50 training
# │
# │ Goal: Demonstrate the practical impact of scaling efficiency.
# │ Show: That an 8-GPU cluster can lose 25% of its potential to overhead.
# │ How: Calculate the efficiency ratio between single-GPU and multi-GPU training times.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: ideal_str, eff_str, loss_str, eff_denom_str, t1_hours_str,
# │          n_gpus_str, tn_hours_str, scaling_eq_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt_percent, fmt, check

class ScalingEfficiencyCalc:
    """Strong scaling efficiency for 8-GPU ResNet-50 training: 75% efficiency, 25% overhead loss."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    t1_hours = 24                                                               # single-GPU training time
    n_gpus = 8
    tn_hours = 4                                                                # actual N-GPU training time
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    ideal_hours = t1_hours / n_gpus
    efficiency_pct = t1_hours / (n_gpus * tn_hours) * 100
    loss_pct = 100 - efficiency_pct
    eff_denom = n_gpus * tn_hours
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    ideal_str = fmt(ideal_hours, precision=0, commas=False)
    eff_str = fmt(efficiency_pct, precision=0, commas=False)
    loss_str = fmt(loss_pct, precision=0, commas=False)
    eff_denom_str = f"{eff_denom}"
    t1_hours_str = fmt(t1_hours, precision=0, commas=False)
    n_gpus_str = fmt(n_gpus, precision=0, commas=False)
    tn_hours_str = fmt(tn_hours, precision=0, commas=False)
    scaling_eq_str = f"Efficiency({n_gpus}) = {t1_hours} hours / ({n_gpus} × {tn_hours} hours) × 100%"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
ideal_str = ScalingEfficiencyCalc.ideal_str
eff_str = ScalingEfficiencyCalc.eff_str
loss_str = ScalingEfficiencyCalc.loss_str
eff_denom_str = ScalingEfficiencyCalc.eff_denom_str
t1_hours_str = ScalingEfficiencyCalc.t1_hours_str
n_gpus_str = ScalingEfficiencyCalc.n_gpus_str
tn_hours_str = ScalingEfficiencyCalc.tn_hours_str
scaling_eq_str = ScalingEfficiencyCalc.scaling_eq_str
```

::: {.callout-notebook title="Scaling Efficiency Calculation"}

**Problem**: Your team trains ResNet-50 on ImageNet. Single-GPU training takes `{python} t1_hours_str` hours. With `{python} n_gpus_str` GPUs, training takes `{python} tn_hours_str` hours. Is this good scaling? Where did the efficiency go?

#### Step 1: Define Scaling Efficiency {.unnumbered}

For **strong scaling** (fixed problem size, more processors), let $T(1)$ be the training time on a single GPU, $T(N)$ the training time on $N$ GPUs, and $N$ the GPU count. @eq-scaling-efficiency defines efficiency:

$$\text{Scaling Efficiency}(N) = \frac{T(1)}{N \times T(N)} \times 100\%$$ {#eq-scaling-efficiency}

#### Step 2: Calculate Efficiency {.unnumbered}

`{python} scaling_eq_str` = `{python} t1_hours_str`/`{python} eff_denom_str` = **`{python} eff_str`%**

With perfect scaling, `{python} n_gpus_str` GPUs would complete in `{python} ideal_str` hours (`{python} t1_hours_str`/`{python} n_gpus_str`). The actual `{python} tn_hours_str` hours represents `{python} eff_str`% efficiency.

#### Step 3: Account for the Efficiency Loss {.unnumbered}

The "missing" `{python} loss_str`% decomposes into measurable overhead:

| **Source**                   | **Typical Contribution** | **Measurement**                     |
|:-----------------------------|-------------------------:|:------------------------------------|
| **Gradient synchronization** |                   10-15% | AllReduce time per step             |
| **Memory copy (CPU↔GPU)**    |                     3-5% | Data transfer profiling             |
| **Load imbalance**           |                     2-5% | Per-GPU step time variance          |
| **Batch size effects**       |                     2-5% | Larger batches converge differently |

#### Step 4: The Systems Insight {.unnumbered}

Scaling efficiency decreases as $N$ grows because communication overhead scales with GPU count while per-GPU compute shrinks. At 8 GPUs, 75% efficiency is typical. At 64 GPUs, efficiency often drops to 50-60%. At 1000+ GPUs, even 30-40% efficiency requires sophisticated optimization.

This is why MLPerf reports both raw performance AND scaling efficiency: a system achieving 2$\times$ throughput at 50% efficiency may be worse than 1.5$\times$ throughput at 90% efficiency, depending on your cost constraints.
:::

#### Resource Utilization {#sec-benchmarking-resource-utilization-336b}

\index{Resource Utilization!compute and memory}
The efficiency of machine learning training depends not only on speed and scalability but also on how well available hardware resources are utilized. Compute utilization measures the extent to which processing units, such as GPUs or TPUs, are actively engaged during training. Low utilization may indicate bottlenecks in data movement, memory access, or inefficient workload scheduling.

For instance, when training BERT on a TPU cluster, researchers observed that input pipeline inefficiencies were limiting overall throughput. Although the TPUs had high raw compute power, the system was not keeping them fully utilized due to slow data retrieval from storage. By profiling the resource utilization, engineers identified the bottleneck and optimized the input pipeline using TFRecord and data prefetching, leading to improved performance.

Memory bandwidth is another critical factor, as deep learning models require frequent access to large volumes of data during training. If memory bandwidth becomes a limiting factor, increasing compute power alone will not improve training speed. Benchmarks assess how well models use available memory, ensuring that data transfer rates between storage, main memory, and processing units do not become performance bottlenecks.

I/O performance also plays a direct role in training efficiency, particularly when working with large datasets that cannot fit entirely in memory. Benchmarks evaluate the efficiency of data loading pipelines, including preprocessing operations, caching mechanisms, and storage retrieval speeds. Systems that fail to optimize data loading can experience order-of-magnitude slowdowns, regardless of computational power.

#### Energy Efficiency and Cost {#sec-benchmarking-energy-efficiency-cost-15ca}

\index{Energy Efficiency!training cost accounting}
Training large-scale machine learning models requires substantial computational resources, leading to considerable energy consumption and financial costs. Energy efficiency metrics quantify the power usage of training workloads, helping identify systems that optimize computational efficiency while minimizing energy waste. The increasing focus on sustainability has led to the inclusion of energy-based benchmarks, such as those in MLPerf Training, which measure power consumption per training run. To understand what these energy benchmarks actually measure, we must decompose *why INT8 saves energy* at the hardware level.

```{python}
#| label: energy-breakdown-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ENERGY BREAKDOWN CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "Why INT8 Saves Energy" — decomposes MobileNet inference
# │          energy into memory-load and compute components for FP32 vs INT8
# │
# │ Goal: Demonstrate that INT8 quantization saves energy primarily through
# │       reduced DRAM traffic (4× fewer bytes loaded), not cheaper arithmetic.
# │ Show: FP32 vs INT8 load and compute energy costs, and total savings ratio.
# │ How: Multiply MobileNet parameter count and MAC count by pJ-per-operation
# │      constants; compare FP32 and INT8 totals to get savings ratios.
# │
# │ Imports: mlsys.constants (ENERGY_DRAM_PJ_PER_BYTE, ENERGY_FLOP_FP32_PJ,
# │          ENERGY_FLOP_INT8_PJ), mlsys.formatting (fmt)
# │ Exports: dram_energy_pj_str, m_params_str, m_macs_str, m_fp32_mb_str,
# │          m_int8_mb_str, e_fp32_load_str, e_fp32_compute_str,
# │          e_fp32_total_str, e_int8_load_str, e_int8_compute_str,
# │          e_int8_total_str, s_load_str, s_compute_str, s_total_str,
# │          e_fp32_load_mj_str, e_fp32_compute_mj_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt_percent, fmt, check
from mlsys.constants import ENERGY_DRAM_PJ_PER_BYTE, ENERGY_FLOP_FP32_PJ, ENERGY_FLOP_INT8_PJ

class EnergyBreakdownCalc:
    """FP32 vs INT8 MobileNet inference energy: memory load dominates; INT8 attacks both sources."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    m_params = 4.3                                                              # Million params
    m_macs = 300                                                                # Million MACs
    _dram_pj = ENERGY_DRAM_PJ_PER_BYTE.m_as(ureg.picojoule / byte)
    _fp32_pj = ENERGY_FLOP_FP32_PJ.m_as(ureg.picojoule / ureg.flop)
    _int8_pj = ENERGY_FLOP_INT8_PJ.m_as(ureg.picojoule / ureg.flop)
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    # FP32: 4 bytes/param for load; INT8: 1 byte/param
    e_fp32_load = (m_params * MILLION * 4 * _dram_pj) / MILLION               # uJ
    e_fp32_compute = (m_macs * MILLION * _fp32_pj) / MILLION                  # uJ
    e_int8_load = (m_params * MILLION * 1 * _dram_pj) / MILLION               # uJ
    e_int8_compute = (m_macs * MILLION * _int8_pj) / MILLION                  # uJ
    m_fp32_mb = m_params * 4
    m_int8_mb = m_params
    e_fp32_total = e_fp32_load + e_fp32_compute
    e_int8_total = e_int8_load + e_int8_compute
    s_load = e_fp32_load / e_int8_load
    s_compute = e_fp32_compute / e_int8_compute
    s_total = e_fp32_total / e_int8_total
    e_fp32_load_mj = e_fp32_load / 1000
    e_fp32_compute_mj = e_fp32_compute / 1000
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    dram_energy_pj_str = fmt(_dram_pj, precision=0, commas=False)
    m_params_str = f"{m_params}"
    m_macs_str = fmt(m_macs, precision=0, commas=False)
    m_fp32_mb_str = fmt(m_fp32_mb, precision=0, commas=False)
    m_int8_mb_str = f"{m_int8_mb}"
    e_fp32_load_str = fmt(e_fp32_load, precision=0, commas=False)
    e_fp32_compute_str = fmt(e_fp32_compute, precision=0, commas=True)
    e_fp32_total_str = fmt(e_fp32_total, precision=0, commas=True)
    e_int8_load_str = fmt(e_int8_load, precision=0, commas=False)
    e_int8_compute_str = fmt(e_int8_compute, precision=0, commas=False)
    e_int8_total_str = fmt(e_int8_total, precision=0, commas=True)
    s_load_str = fmt(s_load, precision=0, commas=False)
    s_compute_str = fmt(s_compute, precision=0, commas=False)
    s_total_str = fmt(s_total, precision=0, commas=False)
    e_fp32_load_mj_str = fmt(e_fp32_load_mj, precision=1, commas=False)
    e_fp32_compute_mj_str = fmt(e_fp32_compute_mj, precision=1, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
dram_energy_pj_str = EnergyBreakdownCalc.dram_energy_pj_str
m_params_str = EnergyBreakdownCalc.m_params_str
m_macs_str = EnergyBreakdownCalc.m_macs_str
m_fp32_mb_str = EnergyBreakdownCalc.m_fp32_mb_str
m_int8_mb_str = EnergyBreakdownCalc.m_int8_mb_str
e_fp32_load_str = EnergyBreakdownCalc.e_fp32_load_str
e_fp32_compute_str = EnergyBreakdownCalc.e_fp32_compute_str
e_fp32_total_str = EnergyBreakdownCalc.e_fp32_total_str
e_int8_load_str = EnergyBreakdownCalc.e_int8_load_str
e_int8_compute_str = EnergyBreakdownCalc.e_int8_compute_str
e_int8_total_str = EnergyBreakdownCalc.e_int8_total_str
s_load_str = EnergyBreakdownCalc.s_load_str
s_compute_str = EnergyBreakdownCalc.s_compute_str
s_total_str = EnergyBreakdownCalc.s_total_str
e_fp32_load_mj_str = EnergyBreakdownCalc.e_fp32_load_mj_str
e_fp32_compute_mj_str = EnergyBreakdownCalc.e_fp32_compute_mj_str
```

::: {.callout-notebook title="Why INT8 Saves Energy"}

Recall from @sec-hardware-acceleration that moving data costs far more energy than computing on it (the Energy-Movement Invariant formalized in @sec-data-engineering). Understanding WHY quantization reduces energy consumption\index{INT8 Savings!compute vs. memory energy} requires decomposing energy into its physical sources. Two dominant factors determine inference energy: compute operations and memory access.

**Compute Energy (per multiply-accumulate operation)**:

| **Precision** |          **Multiplier Energy** | **Relative Cost** |
|:--------------|-------------------------------:|------------------:|
| **FP32**      | ~`{python} energy_fp32_str` pJ |       1.0$\times$ |
| **FP16**      | ~`{python} energy_fp16_str` pJ |       0.3$\times$ |
| **INT8**      | ~`{python} energy_int8_str` pJ |      0.05$\times$ |

An 8-bit multiplier uses ~20$\times$ less energy than a 32-bit floating-point multiplier because transistor count scales roughly with bit-width squared, and switching energy scales with transistor count.

**Memory Access Energy (per byte)**:

| **Memory Level** |               **Energy per Byte** | **Relative Cost** |
|:-----------------|----------------------------------:|------------------:|
| **Register**     |     ~`{python} energy_reg_str` pJ |         1$\times$ |
| **L1 Cache**     |      ~`{python} energy_l1_str` pJ |        50$\times$ |
| **L2 Cache**     |      ~`{python} energy_l2_str` pJ |       200$\times$ |
| **DRAM**         | ~`{python} dram_energy_pj_str` pJ |    16,000$\times$ |

Memory access dominates: reading one byte from DRAM costs over 10,000$\times$ more energy than a register access.

**Combined Effect for MobileNet Inference**:

| **Component**                              | **FP32 (`{python} m_fp32_mb_str` MB)** | **INT8 (`{python} m_int8_mb_str` MB)** |                        **Savings** |
|:-------------------------------------------|---------------------------------------:|---------------------------------------:|-----------------------------------:|
| **Model load from DRAM**                   |          `{python} e_fp32_load_str` μJ |          `{python} e_int8_load_str` μJ |      `{python} s_load_str`$\times$ |
| **Compute (`{python} m_macs_str` M MACs)** |       `{python} e_fp32_compute_str` μJ |       `{python} e_int8_compute_str` μJ |   `{python} s_compute_str`$\times$ |
| **Total**                                  |     **`{python} e_fp32_total_str` μJ** |     **`{python} e_int8_total_str` μJ** | **`{python} s_total_str`$\times$** |

**The Systems Insight**: Memory access dominates FP32 energy consumption (~`{python} e_fp32_load_mj_str` mJ vs `{python} e_fp32_compute_mj_str` mJ compute). INT8 quantization provides `{python} s_load_str`$\times$ memory energy reduction and ~`{python} s_compute_str`$\times$ compute energy reduction. The combined effect explains why quantized models on edge devices achieve dramatic battery life improvements: they attack the dominant memory bottleneck while simultaneously accelerating compute.
:::

Training GPT-3 was estimated to consume `{python} gpt3_energy_mwh_str` MWh of electricity [@patterson2021carbon]. If a system can achieve the same accuracy with fewer training iterations, it directly reduces energy consumption. Energy-aware benchmarks help guide the development of hardware and training strategies that optimize power efficiency while maintaining accuracy targets.

Cost considerations extend beyond electricity usage to include hardware expenses, cloud computing costs, and infrastructure maintenance. Training benchmarks provide insights into the cost-effectiveness of different hardware and software configurations by measuring training time in relation to resource expenditure. Organizations can use these benchmarks to balance performance and budget constraints when selecting training infrastructure.

#### Fault Tolerance and Robustness {#sec-benchmarking-fault-tolerance-robustness-93e8}

\index{Fault Tolerance!training checkpoint strategy}
Training workloads often run for extended periods, sometimes spanning days or weeks, making fault tolerance an essential consideration. A resilient system must handle unexpected failures (hardware malfunctions, network disruptions, and memory errors) without compromising accuracy convergence.

In large-scale cloud-based training, node failures are common due to hardware instability. If a GPU node in a distributed cluster fails, training must continue without corrupting the model. MLPerf Training includes evaluations of fault-tolerant training strategies, such as checkpointing, where models periodically save their progress. This ensures that failures do not require restarting the entire training process.

#### Reproducibility and Standardization {#sec-benchmarking-reproducibility-standardization-7ecf}

In 2019, a research team reported a 2% accuracy improvement on a standard NLP benchmark, but three independent groups failed to replicate the result—the improvement vanished when different random seeds, GPU models, or PyTorch versions were used. This reproducibility failure illustrates a pervasive problem: training benchmarks involve stochastic processes (weight initialization, data shuffling, dropout masks) that interact with hardware-specific behaviors (floating-point rounding, memory layout, compiler optimizations) to produce results that can vary meaningfully across environments. Without explicit controls for these sources of variability, benchmark numbers reflect a specific confluence of conditions rather than a system's genuine capability.

MLPerf Training addresses this by enforcing strict reproducibility requirements: fixed random seeds, standardized data preprocessing, and mandatory multi-run submissions that demonstrate result stability. When NVIDIA submitted ResNet-50 results, for instance, they had to show consistent training times across different GPU configurations—demonstrating that the reported performance reflected hardware capability rather than a lucky combination of stochastic factors.

### Training Performance Evaluation {#sec-benchmarking-training-performance-evaluation-bdc3}

A comprehensive training benchmark considers multiple dimensions of system behavior. The metrics used depend on whether the goal is speed, resource efficiency, energy consumption, or reproducibility.

@tbl-training-metrics summarizes the core categories and associated metrics commonly used to benchmark system-level training performance, providing a framework for understanding how training systems behave under different workloads and configurations.

| **Category**                            | **Key Metrics**                                                                                                        | **Example Benchmark Use**                                   |
|:----------------------------------------|:-----------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------|
| **Training Time and Throughput**        | Time-to-accuracy (seconds, minutes, hours); Throughput (samples/sec)                                                   | Comparing training speed across different GPU architectures |
| **Scalability and Parallelism**         | Scaling efficiency (% of ideal speedup); Communication overhead (latency, bandwidth)                                   | Analyzing distributed training performance for large models |
| **Resource Utilization**                | Compute utilization (% GPU/TPU usage); Memory bandwidth (GB/s); I/O efficiency (data loading speed)                    | Optimizing data pipelines to improve GPU utilization        |
| **Energy Efficiency and Cost**          | Energy consumption per run (MWh, kWh); Performance per watt (TOPS/W)                                                   | Evaluating energy-efficient training strategies             |
| **Fault Tolerance and Robustness**      | Checkpoint overhead (time per save); Recovery success rate (%)                                                         | Assessing failure recovery in cloud-based training systems  |
| **Reproducibility and Standardization** | Variance across runs (% difference in accuracy, training time); Framework consistency (TensorFlow vs. PyTorch vs. JAX) | Ensuring consistency in benchmark results across hardware   |

: **Training Benchmark Dimensions.** Key categories and metrics for evaluating machine learning training systems beyond simple speed, covering resource efficiency, reproducibility, and overall performance tradeoffs across different training approaches and infrastructure configurations. {#tbl-training-metrics}

These dimensions interact in ways that tables cannot capture. Higher throughput from reduced precision (e.g., TF32) is meaningless if it increases the iterations required to reach target accuracy, making time-to-accuracy the essential corrective metric. Scaling efficiency often looks linear up to 64 GPUs but tapers beyond that threshold as gradient synchronization costs dominate. Resource utilization metrics reveal why: a BERT pretraining task with moderate GPU utilization may be bottlenecked by its data pipeline, not its accelerators. And checkpointing for fault tolerance introduces its own overhead (typically 5–10% throughput reduction), requiring balance between resilience and performance.

Across all dimensions, measurement accuracy depends on controlling for hardware variability. GPU boost clock[^fn-gpu-boost] behavior and thermal throttling[^fn-thermal-throttling] can shift results by 20–50%, making repeated runs and statistical rigor (as established earlier) essential for distinguishing genuine performance differences from noise.

[^fn-gpu-boost]: **GPU Boost Clock**: NVIDIA's dynamic frequency scaling raises clocks 10--30% above base when thermal headroom permits (e.g., RTX 4090: 2230 MHz base, 2520 MHz boost). The benchmarking trap: short benchmark runs capture boost-clock performance, but sustained ML training settles to base frequency within minutes as junction temperature rises. Reporting burst-phase results overstates throughput by the same 10--30% margin. \index{GPU Boost Clock!benchmark variability}

[^fn-thermal-throttling]: **Thermal Throttling**: Frequency reduction triggered when junction temperature exceeds safe limits (83--90°C for GPUs, 100--105°C for CPUs), cutting throughput by 20--50%. For edge devices without active cooling, throttling can begin within 2--5 minutes of sustained inference, meaning peak throughput numbers from short benchmarks misrepresent steady-state performance by a factor of 2$\times$ or more. \index{Thermal Throttling!sustained workload impact}

Despite the availability of well-defined benchmarking methodologies, certain misconceptions and flawed evaluation practices often lead to misleading conclusions. Understanding these pitfalls is important for interpreting benchmark results correctly.

#### Overemphasis on Raw Throughput {#sec-benchmarking-overemphasis-raw-throughput-e4df}

A common mistake in training benchmarks is assuming that higher throughput always translates to better training performance. It is possible to artificially increase throughput by using lower numerical precision, reducing synchronization, or even bypassing certain computations. However, these optimizations do not necessarily lead to faster convergence.

For example, a system using TF32 precision may achieve higher throughput than one using FP32, but if TF32 introduces numerical instability that increases the number of iterations required to reach the target accuracy, the overall training time may be longer. The correct way to evaluate throughput is in relation to time-to-accuracy, ensuring that speed optimizations do not come at the expense of convergence efficiency.

#### Scaling Extrapolation {#sec-benchmarking-scaling-extrapolation-9477}

As the scaling efficiency calculation above demonstrated (where `{python} n_gpus_str` GPUs achieved only `{python} eff_str`% efficiency), extrapolating single-node results to clusters is a common error. Google's experience with 4,096-node TPU v4 clusters shows this effect at extreme scale, where synchronization challenges become the dominant performance factor. Proper benchmarking should measure scaling efficiency explicitly rather than assuming linear improvement.

#### Ignoring Failures and Interference {#sec-benchmarking-ignoring-failures-interference-e488}

Many benchmarks assume idealized conditions where hardware failures, network instability, and workload interference do not occur. In reality, these are routine at scale. Effective benchmarking should account for checkpointing overhead, failure recovery efficiency, and resource contention rather than reporting only best-case performance.

#### Ignoring Reproducibility {#sec-benchmarking-ignoring-reproducibility-15c5}

Benchmark results are often reported without verifying their reproducibility across different hardware and software frameworks. Even minor variations in floating-point arithmetic, memory layouts, or optimization strategies can introduce statistical differences in training time and accuracy.

For example, a benchmark run on TensorFlow with XLA optimizations may exhibit different convergence characteristics compared to the same model trained using PyTorch with Automatic Mixed Precision (AMP). Proper benchmarking requires evaluating results across multiple frameworks to ensure that software-specific optimizations do not distort performance comparisons.

Avoiding these pitfalls requires evaluating throughput in relation to accuracy convergence, assessing scaling efficiency holistically, and accounting for real-world failures rather than assuming idealized conditions. A model trained efficiently, however, still requires validation of its deployment performance, which shifts the evaluation framework entirely.

## Inference Benchmarks {#sec-benchmarking-inference-benchmarks-2c1f}

Where training benchmarks ask "how quickly can we learn?" inference benchmarks ask "how reliably can we serve?" This shift changes nearly every aspect of evaluation. Training tolerates variable iteration times as long as convergence proceeds; inference requires consistent latency because users experience every slow response. Training optimizes for aggregate throughput across hours; inference must handle unpredictable request patterns with millisecond-level guarantees. Training runs on dedicated high-performance hardware; inference spans environments from datacenter GPUs to mobile phones to microcontrollers.

This is where the optimization chapters converge: the accelerated hardware from @sec-hardware-acceleration runs compressed models from @sec-model-compression to deliver real-time predictions. Inference benchmarks reveal whether those theoretical speedups become actual latency reductions under realistic deployment conditions.

::: {.callout-definition title="ML Inference Benchmarks"}

***ML Inference Benchmarks***\index{ML Inference Benchmarks!definition} quantify the system's ability to meet **Latency Constraints** ($L_{lat}$) under load.

1.  **Significance (Quantitative):** They measure the **Tail Latency** (p99) and **Jitter** of the serving stack, validating its suitability for interactive applications.
2.  **Distinction (Durable):** Unlike **Training Benchmarks**, which prioritize **Throughput ($\eta$)**, Inference Benchmarks prioritize **Response Time** and **Determinism**.
3.  **Common Pitfall:** A frequent misconception is that "Average Latency" is a sufficient benchmark. In reality, for production systems, the **Tail at Scale** (the slowest 1% of requests) is what defines the user experience and the system's reliability.

:::

\index{NPU!inference benchmarking considerations}
Unlike training, which runs on dedicated data center hardware, inference must be optimized for dramatically diverse deployment scenarios — from real-time applications like autonomous driving and conversational AI to mobile devices, IoT systems, and embedded processors. This diversity extends to hardware: while GPUs and TPUs dominate training, inference workloads often require specialized accelerators like NPUs, FPGAs, and dedicated inference chips such as Google's Edge TPU[^fn-edge-tpu]. Inference benchmarks evaluate how well hardware selection, model optimization, and data pipeline design work together across these deployment environments.

[^fn-edge-tpu]: **Edge TPU**: Google's 2-watt AI accelerator delivering 4 TOPS for ~\$25 per unit. The Edge TPU illustrates a benchmarking constraint specific to fixed-function accelerators: it supports only quantized TensorFlow Lite models with specific operator types, so its 4 TOPS rating applies only to the subset of models that map fully to its hardware. Models requiring even one unsupported operator fall back to the host CPU at orders-of-magnitude lower throughput. \index{Edge TPU!operator coverage}

Scaling inference workloads across cloud servers, edge platforms, mobile devices, and tinyML systems introduces additional complexity. @fig-power-differentials reveals the staggering power consumption differentials among these systems—spanning six orders of magnitude from milliwatts in tiny embedded devices to hundreds of kilowatts in datacenter training clusters. The ranges are representative rather than exhaustive. This spread explains why no single benchmark can serve all deployment contexts: a metric meaningful for datacenter optimization (kilowatts per rack) becomes irrelevant for battery-powered edge devices (milliwatts per inference). Inference benchmarks must evaluate the trade-offs between latency, cost, and energy efficiency within each scale to assist organizations in making informed deployment decisions.

::: {#fig-power-differentials fig-env="figure" fig-pos="htb" fig-cap="**Power Consumption Differentials**: Power usage spans six orders of magnitude across ML system types, from milliwatts in tinyML devices through watts at the edge to kilowatts in datacenter inference and hundreds of kilowatts for training clusters. Ranges are representative and vary by hardware and workload." fig-alt="Dumbbell chart showing power consumption ranges: Tiny 5.6 to 167 mW, Edge 3.9 to 1100 W, Datacenter 267 to 6300 W, Training 5.5 to 498,000 W on logarithmic scale."}
```{python}
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ POWER CONSUMPTION DIFFERENTIALS FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-power-differentials — dumbbell chart showing power ranges
# │          across Tiny, Edge, Datacenter, and Training system types
# │
# │ Goal: Visualize the massive spread in power consumption across system types.
# │ Show: The six-order-of-magnitude gap from TinyML to Training clusters.
# │ How: Plot power ranges (mW to kW) on a logarithmic dumbbell chart.
# │
# │ Imports: mlsys.viz (viz)
# │ Exports: (figure only, no prose variables)
# └─────────────────────────────────────────────────────────────────────────────
from mlsys import viz

fig, ax, COLORS, plt = viz.setup_plot()
system_types = ["Tiny", "Edge", "Datacenter", "Training"]
min_power = [5.6, 3.9, 266.9, 5.5]
max_power = [166.6, 1100, 6300, 498000]

for i, (minp, maxp) in enumerate(zip(min_power, max_power)):
    ax.plot([i, i], [minp, maxp], color=COLORS['grid'], linewidth=2, zorder=1)

ax.scatter(range(len(system_types)), min_power, color=COLORS['BlueLine'], s=80, label='Minimum Power', zorder=2, edgecolors='white')
ax.scatter(range(len(system_types)), max_power, color=COLORS['RedLine'], s=80, label='Maximum Power', zorder=2, edgecolors='white')

ax.set_yscale('log')
ax.set_ylabel('Power Consumption (Log Scale)')
ax.set_xlabel('System Type')
ax.set_xticks(range(len(system_types)))
ax.set_xticklabels(system_types)
ax.legend(loc='upper left', fontsize=8)
plt.show()
```
:::

MLPerf's inference benchmarks provide standardized evaluation across deployment scenarios from cloud to edge devices.

### Inference Benchmark Motivation {#sec-benchmarking-inference-benchmark-motivation-eee1}

\index{Inference Compiler!graph optimization framework}
\index{Cross-Platform Inference Engine!model portability}
\index{ML Compiler Stack!automated hardware optimization}
Inference benchmarks evaluate the bottlenecks that emerge when models transition from development to production serving. The motivating factors parallel those for training (hardware optimization, scalability, cost, fair comparison) but differ in specifics. Software optimization frameworks apply inference-specific techniques -- operator fusion (see @sec-model-compression and @sec-hardware-acceleration), precision calibration, and kernel tuning -- whose impact on latency, throughput, and power efficiency must be measured under realistic conditions to confirm they deliver real improvements without degrading accuracy. Auto-tuning compilers add a hidden variable: the compiler itself can require hours of optimization per model-hardware pair, meaning benchmark results reflect the tuning budget as much as the hardware capability, and comparing results across submissions requires normalizing for compiler optimization time.

\index{Cold-Start Performance!model loading latency}
Scalability concerns also shift character. Training scales by adding GPUs to reduce time-to-accuracy on a fixed workload, whereas inference must scale dynamically in response to fluctuating user demand, handling traffic spikes without violating latency guarantees. Cold-start performance, the time required for a model to load and begin processing queries, becomes a distinct inference concern with no training analog. Applications that load models on demand, such as serverless AI deployments, are particularly sensitive to this overhead.

The cost and energy profile of inference differs sharply from training. Training costs are incurred once and amortized over the model's lifetime, while inference costs accumulate continuously as models serve production traffic. Running an inefficient model at scale can multiply cloud compute expenses, and on battery-powered devices, excessive computation directly impacts usability. Benchmarks that measure cost per inference request and efficiency per watt help organizations optimize for both performance and sustainability across deployment platforms.

MLPerf Inference extends the standardized comparison principles established for training benchmarks to deployment scenarios, defining evaluation criteria for tasks such as image classification, object detection, and speech recognition across different hardware platforms. This ensures that inference performance comparisons remain meaningful and reproducible while accounting for deployment-specific constraints like latency requirements and energy efficiency [@reddi2020mlperf].

### Inference Metrics {#sec-benchmarking-inference-metrics-78d4}

A voice assistant must respond within 200 milliseconds or users perceive lag; a recommendation engine must score thousands of candidates per second to keep pace with user scrolling. These constraints—latency and throughput—define the performance envelope within which all serving optimizations must operate. Inference metrics formalize these real-world demands into measurable quantities, and they differ from training metrics in kind, not just degree, because the optimization target shifts from "how fast can we learn?" to "how reliably can we serve?" Training cares about throughput and time-to-accuracy; inference cares about latency consistency, resource efficiency, and deployment practicality—from cloud data centers handling millions of requests to edge devices operating under strict power constraints.

#### Latency and Tail Latency {#sec-benchmarking-latency-tail-latency-5cde}

\index{Latency!inference real-time requirement}
Latency (introduced in @sec-ml-systems) measures the time for an inference system to process an input and produce a prediction. Average latency is useful, but it does not capture worst-case delays that degrade reliability in high-demand scenarios.

To account for this, benchmarks often measure tail latency\index{Tail Latency!p99 significance}[^fn-tail-latency], which reflects the worst-case delays in a system. These are typically reported as the 95th percentile (p95) or 99th percentile (p99) latency, meaning that 95% or 99% of inferences are completed within a given time. For applications such as autonomous driving or real-time trading, maintaining low tail latency is essential to avoid unpredictable delays that could lead to catastrophic outcomes.

[^fn-tail-latency]: **Tail Latency**: The 95th or 99th percentile response time, which determines production SLA compliance. Jeff Dean's "Tail at Scale" analysis showed that in fan-out architectures (common in recommendation systems), even 1% slow responses compound: a request touching 100 backend shards has a 63% chance that at least one shard hits its 1% tail, making p99 latency the effective average [@dean2013tail]. Benchmarks reporting only mean latency hide this failure mode. \index{Tail Latency!fan-out amplification}

These measurements form the basis for Service Level Objectives (SLOs) and Service Level Agreements (SLAs), which formalize performance expectations.

::: {.callout-definition title="SLO vs. SLA"}

***SLOs and SLAs***\index{SLO!definition}\index{SLA!definition} define the **Operating Envelope** and **Contractual Commitments** for a system's performance.

1.  **Significance (Quantitative):** They provide the **Constraint Boundary** ($L_{lat}$) against which all architectural trade-offs are evaluated, specifying strict upper bounds on tail latency and error rate.
2.  **Distinction (Durable):** An **SLO (Service Level Objective)** is the internal **Engineering Target**, while an **SLA (Service Level Agreement)** is the external **Financial Obligation** for reliability violations.
3.  **Common Pitfall:** A frequent misconception is that SLOs and SLAs should be identical. In reality, the gap between them represents the **Error Budget**—the amount of degradation a system can absorb before breaching external commitments.

:::

Understanding this distinction matters: your engineering team optimizes for SLOs while your business commits to SLAs. Choosing the wrong metric to optimize can waste engineering effort or violate customer guarantees.

::: {.callout-checkpoint title="Metric Selection" collapse="false"}
The metric shapes the optimization.

**The Golden Rules**

- [ ] **Throughput vs. Latency**: Are you optimizing for cost (Throughput) or user experience (Latency)? You cannot maximize both simultaneously.
- [ ] **Tail Latency**: Do you measure p99? (Averages hide the failures that drive users away).
- [ ] **End-to-End**: Does your "inference latency" include preprocessing? (If not, your benchmark is a lie).
:::

Tail latency's connection to user experience at scale becomes critical in production systems serving millions of users. Even small P99 latency degradations create compounding effects across large user bases: if 1% of requests experience 10$\times$ latency (e.g., 1000 ms instead of 100 ms), this affects 10,000 users per million requests, potentially leading to timeout errors, poor user experience, and customer churn. Search engines and recommendation systems demonstrate this sensitivity: industry studies have shown that latency increases on the order of hundreds of milliseconds can reduce engagement by 10–20% and conversions by measurable percentages, making sub-100 ms response times a common target for interactive services.

Service level objectives (SLOs) in production systems therefore focus on tail latency rather than mean latency to ensure consistent user experience. Typical production SLOs specify P95 < 100ms and P99 < 500ms for interactive services, recognizing that occasional slow responses have disproportionate impact on user satisfaction. Large-scale systems like Netflix and Uber optimize for P99.9 latency to handle traffic spikes and infrastructure variations that affect service reliability.

::: {.callout-war-story title="The Tail Latency Death"}
**The Context**: Discord, a real-time chat platform, used Go for its core services. The system required low latency for millions of concurrent users.

**The Failure**: Engineers observed massive latency spikes every few minutes. The culprit was Go's Garbage Collector (GC). While the average request was fast, the "Stop-the-World" GC pauses (collecting memory from millions of objects) froze the entire server for significant intervals.

**The Consequence**: These spikes caused "lag" for users and instability in the cluster. Discord eventually rewrote the service in Rust (which has no GC) to eliminate these pauses, achieving consistent tail latency.

**The Systems Lesson**: Average latency is a vanity metric. In high-throughput systems, the tail (P99) *is* the experience. Language runtime choices (GC vs. manual memory management) are architectural constraints that determine your tail latency floor [@discord2020rust].
:::

#### End-to-End vs. Component Latency {#sec-benchmarking-endtoend-vs-component-latency-9952}

A critical distinction in inference benchmarking is between component latency (time spent in model computation) and end-to-end latency (total time from request arrival to response delivery). Many benchmarks report only model inference time, obscuring the remaining overhead that determines actual user experience.

::: {.callout-war-story title="The JSON Serialization Trap"}
**The Context**: Researchers at Berkeley developed Clipper, a low-latency model serving system. They benchmarked standard serving approaches using Python-based web servers.

**The Failure**: They found that for simple models like linear regression or small CNNs, the API overhead (JSON serialization/deserialization) consumed more CPU time than the actual inference.

**The Consequence**: The system's throughput was capped not by the model's math, but by the text processing of the input data. The GPU sat idle while the CPU parsed JSON strings.

**The Systems Lesson**: Text protocols (JSON/HTTP) are CPU-bound bottlenecks for high-throughput ML. Binary protocols (gRPC/Protobuf) or shared memory (Apache Arrow) are mandatory for high-performance serving. The "wrapper" often costs more than the "gift" [@crankshaw2017clipper_paper].
:::

@tbl-latency-breakdown quantifies a typical latency breakdown for an inference request. Notice that model inference (the "benchmark" number) may represent only 10–50% of total request time, with queue wait time potentially dominating under load:

| **Component**              | **Typical Range** | **Notes**                  |
|:---------------------------|------------------:|:---------------------------|
| **Network round-trip**     |         10–100 ms | Varies by region           |
| **Request parsing**        |          0.1–1 ms | JSON/protobuf              |
| **Input preprocessing**    |           1–50 ms | Tokenization, image resize |
| **Queue wait time**        |        0–1000+ ms | Load-dependent             |
| **Model inference**        |          5–100 ms | The "benchmark"            |
| **Output postprocessing**  |         0.5–10 ms | Decoding, format           |
| **Response serialization** |          0.1–1 ms | JSON/protobuf              |

: **Inference Latency Breakdown.** Different pipeline components contribute to end-to-end latency, with model inference—the number vendors typically report—often representing only 10–50% of total request time. Queue wait time can dominate under load, making end-to-end measurement essential for realistic performance assessment. {#tbl-latency-breakdown}

\index{Amdahl's Law!optimization ceiling}
These component-level contributions explain why optimizing any single stage yields diminishing returns on end-to-end performance, an *optimization ceiling* formalized by Amdahl's Law.

```{python}
#| label: amdahl-benchmark-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ AMDAHL'S LAW BENCHMARK CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "Amdahl's Law: Optimization Ceiling" — demonstrates why
# │          optimizing only the inference component of a pipeline yields
# │          diminishing end-to-end returns
# │
# │ Goal: Demonstrate that a 5× inference speedup yields only ~1.8× end-to-end
# │       improvement when preprocessing dominates the unconstrained pipeline.
# │ Show: Pre-optimization and post-optimization total latency, Amdahl ceiling.
# │ How: Apply Amdahl's Law: new total = non-optimized + optimized/speedup.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: bench_preprocess_ms_str, bench_inference_ms_str,
# │          bench_inf_speedup_str, bench_total_ms_str, bench_e2e_str,
# │          bench_opt_total_str, bench_opt_inf_str, amdahl_ceiling_str,
# │          preprocess_pct_str, preprocess_fraction_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt_percent, fmt, check

class AmdahlBenchmarkCalc:
    """Amdahl ceiling: 5× inference speedup → only 1.8× end-to-end when preprocessing dominates."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    bench_preprocess_ms = 8
    bench_inference_ms = 10
    bench_inf_speedup = 5
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    bench_total_ms = bench_preprocess_ms + bench_inference_ms
    bench_opt_inference_ms = bench_inference_ms / bench_inf_speedup
    bench_opt_total_ms = bench_preprocess_ms + bench_opt_inference_ms
    bench_e2e_improvement = bench_total_ms / bench_opt_total_ms
    preprocess_fraction = bench_preprocess_ms / bench_total_ms
    amdahl_ceiling = 1 / preprocess_fraction
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    bench_preprocess_ms_str = fmt(bench_preprocess_ms, precision=0, commas=False)
    bench_inference_ms_str = fmt(bench_inference_ms, precision=0, commas=False)
    bench_inf_speedup_str = fmt(bench_inf_speedup, precision=0, commas=False)
    bench_total_ms_str = fmt(bench_total_ms, precision=0, commas=False)
    bench_e2e_str = fmt(bench_e2e_improvement, precision=1, commas=False)
    bench_opt_total_str = fmt(bench_opt_total_ms, precision=0, commas=False)
    bench_opt_inf_str = fmt(bench_opt_inference_ms, precision=0, commas=False)
    amdahl_ceiling_str = fmt(amdahl_ceiling, precision=2, commas=False)
    preprocess_pct_str = fmt_percent(preprocess_fraction, precision=0, commas=False)
    preprocess_fraction_str = fmt(preprocess_fraction, precision=2, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
bench_preprocess_ms_str = AmdahlBenchmarkCalc.bench_preprocess_ms_str
bench_inference_ms_str = AmdahlBenchmarkCalc.bench_inference_ms_str
bench_inf_speedup_str = AmdahlBenchmarkCalc.bench_inf_speedup_str
bench_total_ms_str = AmdahlBenchmarkCalc.bench_total_ms_str
bench_e2e_str = AmdahlBenchmarkCalc.bench_e2e_str
bench_opt_total_str = AmdahlBenchmarkCalc.bench_opt_total_str
bench_opt_inf_str = AmdahlBenchmarkCalc.bench_opt_inf_str
amdahl_ceiling_str = AmdahlBenchmarkCalc.amdahl_ceiling_str
preprocess_pct_str = AmdahlBenchmarkCalc.preprocess_pct_str
preprocess_fraction_str = AmdahlBenchmarkCalc.preprocess_fraction_str
```

::: {.callout-notebook title="Amdahl's Law: Optimization Ceiling"}

The latency breakdown reveals why aggressive model optimization often yields disappointing end-to-end results. Consider a vision pipeline where preprocessing (JPEG decode, resize, normalize) consumes `{python} bench_preprocess_ms_str` ms and inference consumes `{python} bench_inference_ms_str` ms. Optimizing inference by `{python} bench_inf_speedup_str`$\times$ (from `{python} bench_inference_ms_str` ms to `{python} bench_opt_inf_str` ms) reduces total latency from `{python} bench_total_ms_str` ms to only `{python} bench_opt_total_str` ms, a `{python} bench_e2e_str`$\times$ improvement rather than `{python} bench_inf_speedup_str`$\times$.

Amdahl's Law formalizes this ceiling: if preprocessing consumes fraction f of total latency, then even infinitely fast inference yields at most 1/f speedup. With preprocessing at `{python} preprocess_pct_str`% of latency (f ≈ `{python} preprocess_fraction_str`), the maximum achievable speedup is 1/f ≈ `{python} amdahl_ceiling_str`$\times$ regardless of model optimization.

This principle has direct implications for benchmarking interpretation. A 3$\times$ inference speedup reported in isolation might translate to only 1.5$\times$ end-to-end improvement in production. Comprehensive benchmarks must either include preprocessing in measurements or clearly state that reported speedups apply only to the inference component.

:::

These mathematical constraints highlight why rigorous benchmarking methodology matters. Before interpreting any benchmark result, verify that the measurement approach itself is sound.

::: {.callout-checkpoint title="Benchmarking Methodology" collapse="false"}
Bad benchmarks optimize the wrong things.

**Best Practices**

- [ ] **Representative Data**: Does your benchmark use real data distributions, or "clean" academic datasets? (Data determines performance).
- [ ] **Warm-up**: Did you discard the first 50 runs? (JIT compilation and caching skew initial results).
- [ ] **Isolation**: Are you running on a dedicated machine? (Noisy neighbors invalidate benchmarks).
:::

Comprehensive latency reporting therefore requires specifying which components are included, measuring under realistic load conditions, and distinguishing component from end-to-end metrics.

#### Throughput and Batch Efficiency {#sec-benchmarking-throughput-batch-efficiency-fe85}

\index{Queries per Second!inference throughput metric}
Throughput measures how many inference requests a system can process per second, typically expressed as queries per second (QPS) or frames per second (FPS). Single-instance systems process each input independently on arrival; batch systems process multiple inputs in parallel, exploiting hardware parallelism for higher efficiency.

For example, cloud-based services handling millions of queries per second benefit from batch inference, where large groups of inputs are processed together to maximize computational efficiency. In contrast, applications like robotics, interactive AI, and augmented reality require low-latency single-instance inference, where the system must respond immediately to each new input.

Benchmarks must consider both single-instance and batch throughput to provide a comprehensive understanding of inference performance across different deployment scenarios.

#### Precision and Accuracy Trade-offs {#sec-benchmarking-precision-accuracy-tradeoffs-3bbb}

Optimizing inference performance often involves reducing numerical precision, which can accelerate computation by 2--4$\times$ while reducing memory and energy consumption. However, lower-precision calculations can introduce accuracy degradation, making it essential to benchmark the trade-offs between speed and predictive quality.

Inference benchmarks evaluate how well models perform under different numerical settings, such as FP32, FP16, and INT8[^fn-int8]. Many modern AI accelerators support mixed-precision inference, allowing systems to dynamically adjust numerical representation based on workload requirements. Model compression techniques[^fn-model-compression-benchmarking] further improve efficiency, but their impact on model accuracy varies depending on the task and dataset. Benchmarks help determine whether these optimizations are viable for deployment, ensuring that improvements in efficiency do not come at the cost of unacceptable accuracy loss.

[^fn-int8]: **INT8 (8-bit Integer)**: INT8 sits at the aggressive end of the precision hierarchy (FP32 baseline, FP16 halves memory, INT8 quarters it), and each step demands increasing care to preserve accuracy. The benchmarking catch: INT8 requires post-training calibration using a representative dataset, and accuracy preservation (typically 95--99% of FP32) depends on the calibration data's similarity to deployment data. INT8 benchmarks without specifying the calibration dataset and procedure are not reproducible. \index{INT8!calibration dependency}

[^fn-model-compression-benchmarking]: **Model Compression Benchmarking**: Compression impact must be measured across four dimensions simultaneously: accuracy degradation, inference speedup, memory reduction, and energy savings. A technique achieving 10$\times$ size reduction with 1% accuracy loss may still be unsuitable if latency does not improve proportionally -- unstructured pruning, for example, reduces parameter count but rarely improves latency on dense hardware because sparse operations lack efficient hardware support on most GPUs. \index{Model Compression!multi-dimensional benchmarking}

#### Memory Footprint and Model Size {#sec-benchmarking-memory-footprint-model-size-bf31}

Memory footprint is critical for inference, especially on resource-constrained devices. Unlike training, where models can span multiple accelerators, inference often runs within strict memory budgets. Total model size determines storage requirements, RAM usage reflects working memory during execution, and memory bandwidth can bottleneck data transfer between processing units.

Inference benchmarks evaluate these factors to ensure that models can be deployed effectively across a range of devices. A model that achieves high accuracy but exceeds memory constraints may be impractical for real-world use. To address this, various compression techniques are often applied to reduce model size while maintaining accuracy. Benchmarks help assess whether these optimizations strike the right balance between memory efficiency and predictive performance.

#### Cold-Start and Model Load Time {#sec-benchmarking-coldstart-model-load-time-b06a}

\index{Cold-Start Latency!serverless AI impact}
Cold-start performance becomes critical when models are loaded on demand rather than kept resident in memory. In serverless AI environments[^fn-serverless-ai], where resources scale dynamically with incoming requests, the time from idle to active execution determines whether users experience acceptable response times.

[^fn-serverless-ai]: **Serverless AI**: Deployment paradigm where models scale from zero instances on demand. The benchmarking trap: serverless providers report inference latency excluding cold-start time, but for intermittent workloads, cold starts (100 ms for small models, 10+ seconds for LLMs) dominate the user-perceived latency. Benchmark results from warm instances systematically understate real-world latency for workloads with low request rates. \index{Serverless AI!cold-start benchmarking}

Model load time refers to the duration required to load a trained model into memory before it can process inputs. In some cases, particularly on resource-limited devices, models must be reloaded frequently to free up memory for other applications. The time taken for the first inference request is also an important consideration, as it reflects the total delay users experience when interacting with an AI-powered service. Benchmarks help quantify these delays, ensuring that inference systems can meet real-world responsiveness requirements.

#### Dynamic Workload Scaling {#sec-benchmarking-dynamic-workload-scaling-5842}

Inference workloads must scale across fluctuating usage patterns. Cloud services must handle millions of concurrent users efficiently; mobile devices must manage multiple simultaneous AI models without overloading the system.

Scalability measures how well inference performance improves when additional computational resources are allocated. In some cases, adding more GPUs or TPUs increases throughput proportionally, but in other scenarios, bottlenecks such as memory bandwidth limitations or network latency may limit scaling efficiency. Benchmarks also assess how well a system balances multiple concurrent models in real-world deployment, where different AI-powered features may need to run at the same time without interference.

For cloud-based AI, benchmarks evaluate how efficiently a system handles fluctuating demand, ensuring that inference servers can dynamically allocate resources without compromising latency. In mobile and embedded AI, efficient multi-model execution is essential for running multiple AI-powered features simultaneously without degrading system performance.

#### Energy Consumption and Efficiency {#sec-benchmarking-energy-consumption-efficiency-adad}

Since inference workloads run continuously in production, power consumption and energy efficiency are critical considerations. Mobile and edge devices face the most acute constraints, where battery life and thermal limits restrict available computational resources. Even in large-scale cloud environments, power efficiency directly impacts operational costs and sustainability goals.

The energy required for a single inference is often measured in joules per inference, reflecting how efficiently a system processes inputs while minimizing power draw. In cloud-based inference, efficiency is commonly expressed as queries per second per watt (QPS/W) to quantify how well a system balances performance and energy consumption. For mobile AI applications, optimizing inference power consumption extends battery life and allows models to run efficiently on resource-constrained devices. Reducing energy use also plays a key role in making large-scale AI systems more environmentally sustainable, ensuring that computational advancements align with energy-conscious deployment strategies.

### Inference Performance Evaluation {#sec-benchmarking-inference-performance-evaluation-6793}

Unlike training, inference systems must process inputs and deliver predictions efficiently across diverse deployment scenarios. Latency, throughput, memory usage, and energy efficiency provide the structured measures for evaluating this performance.

@tbl-inference-metrics highlights key metrics for evaluating inference systems and their relevance to different deployment contexts. While each metric offers unique insights, it is important to approach inference benchmarking holistically. Trade-offs between metrics, including speed versus accuracy and throughput versus power consumption, are common, and understanding these trade-offs is essential for effective system design.

| **Category**                    | **Key Metrics**                                                      | **Example Benchmark Use**                                |
|:--------------------------------|:---------------------------------------------------------------------|:---------------------------------------------------------|
| **Latency and Tail Latency**    | Mean latency (ms/request); Tail latency (p95, p99, p99.9)            | Evaluating real-time performance for safety-critical AI  |
| **Throughput and Efficiency**   | Queries per second (QPS); Frames per second (FPS); Batch throughput  | Comparing large-scale cloud inference systems            |
| **Numerical Precision Impact**  | Accuracy degradation (FP32 vs. INT8); Speedup from reduced precision | Balancing accuracy vs. efficiency in optimized inference |
| **Memory Footprint**            | Model size (MB/GB); RAM usage (MB); Memory bandwidth utilization     | Assessing feasibility for edge and mobile deployments    |
| **Cold-Start and Load Time**    | Model load time (s); First inference latency (s)                     | Evaluating responsiveness in serverless AI               |
| **Scalability**                 | Efficiency under load; Multi-model serving performance               | Measuring robustness for dynamic, high-demand systems    |
| **Power and Energy Efficiency** | Power consumption (Watts); Performance per Watt (QPS/W)              | Optimizing energy use for mobile and sustainable AI      |

: **Inference Performance Metrics.** Latency, throughput, and resource usage metrics provide a quantitative basis for optimizing deployed machine learning systems and selecting appropriate hardware configurations, balancing speed, cost, and accuracy in production applications. {#tbl-inference-metrics}

These metrics interact through unavoidable trade-offs. Optimizing for high throughput via large batch sizes increases latency, making a system unsuitable for real-time applications. Reducing numerical precision improves power efficiency and speed but may degrade accuracy. The deployment environment determines which trade-offs are acceptable: cloud systems prioritize scalability and throughput, while edge devices are dominated by memory and power constraints. Evaluating inference performance holistically — rather than fixating on a single metric — ensures that systems meet their functional, resource, and performance goals in context.

Different deployment scenarios require distinctly different metric priorities, as the operational constraints and success criteria vary dramatically across contexts. Understanding these priorities allows engineers to focus benchmarking efforts effectively and interpret results within appropriate decision frameworks. @tbl-metric-priorities illustrates how performance priorities shift across five major deployment contexts, revealing the systematic relationship between operational constraints and optimization targets.

| **Deployment Context**     | **Primary Priority** | **Secondary Priority** | **Tertiary Priority** | **Key Design Constraint**                         |
|:---------------------------|:---------------------|:-----------------------|:----------------------|:--------------------------------------------------|
| **Real-Time Applications** | Latency (p95 < 50ms) | Reliability (99.9%)    | Memory Footprint      | User experience demands immediate response        |
| **Cloud-Scale Services**   | Throughput (QPS)     | Cost Efficiency        | Average Latency       | Business viability requires massive scale         |
| **Edge/Mobile Devices**    | Power Consumption    | Memory Footprint       | Latency               | Battery life and resource limits dominate         |
| **Training Workloads**     | Training Time        | GPU Utilization        | Memory Efficiency     | Research velocity enables faster experimentation  |
| **Scientific/Medical**     | Accuracy             | Reliability            | Explainability        | Correctness cannot be compromised for performance |

: **Performance Metric Priorities by Deployment Context.** Different operational environments demand distinct optimization focuses, reflecting varying constraints and success criteria. These priorities guide both benchmark selection and result interpretation. {#tbl-metric-priorities}

The key insight from @tbl-metric-priorities is that the *same metric* can be primary in one context and irrelevant in another. Latency ranks first for real-time applications (autonomous vehicles must process sensor data within strict timing deadlines) but tertiary for cloud services (which accept higher latency in exchange for cost efficiency per query). A smartphone AI assistant that improves throughput by 50% but increases power consumption by 30% represents a net regression since battery life directly impacts user satisfaction. Medical diagnostic systems prioritize accuracy as non-negotiable — achieving 99.2% accuracy at 10 ms latency provides superior value compared to 98.8% at 5 ms. This context-dependence means that a 2$\times$ throughput improvement represents substantial value for cloud deployments but minimal benefit for battery-powered edge devices, where 20% power reduction delivers superior operational impact.

Even with well-defined metrics, benchmarking inference systems can be challenging. Missteps during the evaluation process often lead to misleading conclusions. Students and practitioners should be aware of common pitfalls when analyzing inference performance.

#### Overemphasis on Average Latency {#sec-benchmarking-overemphasis-average-latency-dc31}

As established above, tail latency (p95, p99) determines production reliability, not averages. Conversational AI systems failing to maintain tail latency targets will exhibit unacceptable response delays regardless of acceptable average performance.

#### Ignoring Memory and Energy Constraints {#sec-benchmarking-ignoring-memory-energy-constraints-014f}

A model with excellent throughput or latency may be unsuitable for mobile or edge deployments if it requires excessive memory or power. For example, an inference system designed for cloud environments might fail to operate efficiently on a battery-powered device. Proper benchmarks must consider memory footprint and energy consumption to ensure practicality across deployment contexts.

#### Ignoring Cold-Start Performance {#sec-benchmarking-ignoring-coldstart-performance-1e83}

In serverless environments, where models are loaded on demand, cold-start latency[^fn-cold-start] is a critical factor. Ignoring the time it takes to initialize a model and process the first request can result in unrealistic expectations for responsiveness. Evaluating both model load time and first-inference latency ensures that systems are designed to meet real-world responsiveness requirements.

[^fn-cold-start]: **Cold-Start Latency**: The initialization time from idle state, dominated by model weight loading from storage to accelerator memory. For a 7B-parameter model in FP16 (~14 GB), cold start on PCIe 4.0 (25 GB/s effective) takes ~560 ms for weight transfer alone, plus framework initialization overhead. This physical lower bound means that cold-start mitigation (model caching, speculative loading) is a systems design requirement, not just an operational convenience. \index{Cold-Start Latency!physical lower bound}

#### Isolated Metrics Evaluation {#sec-benchmarking-isolated-metrics-evaluation-922d}

Benchmarking inference systems often involves balancing competing metrics. For example, maximizing batch throughput might degrade latency, while aggressive precision reduction could reduce accuracy. Focusing on a single metric without considering its impact on others can lead to incomplete or misleading evaluations.

Numerical precision optimization exemplifies this challenge particularly well. Individual accelerator benchmarks show INT8 operations achieving 4$\times$ higher TOPS[^fn-tops] (Tera Operations Per Second) compared to FP32, creating compelling performance narratives.

[^fn-tops]: **TOPS (Tera Operations Per Second)**: A measure of raw computational throughput (trillions of operations/second). The H100 delivers `{python} h100_tflops_int8_str` TOPS INT8 versus the Apple M2 Neural Engine at 15.8 TOPS and Edge TPU at 4 TOPS, but these numbers conflate different operation types (MAC vs. accumulate vs. activation). TOPS comparisons across vendors are meaningful only when the operation definition, precision, and sparsity assumptions are identical -- conditions rarely met in vendor specifications. \index{TOPS!comparability limitation}

#### Linear Scaling Assumption {#sec-benchmarking-linear-scaling-assumption-6a7a}

The linear scaling pitfall discussed for training benchmarks applies equally to inference, though the bottlenecks differ. Where training scaling is limited primarily by gradient synchronization overhead, inference scaling encounters bottlenecks from memory bandwidth saturation, thermal throttling under sustained load, and request routing overhead in distributed serving. As discussed in @sec-hardware-acceleration, these limitations arise from physical hardware constraints and interconnect architectures. Benchmarks that assume linear scaling behavior may overestimate system performance, particularly in distributed deployments.

#### Ignoring Application Requirements {#sec-benchmarking-ignoring-application-requirements-9268}

Generic benchmarking results may fail to account for the specific needs of an application. For instance, a benchmark optimized for cloud inference might be irrelevant for edge devices, where energy and memory constraints dominate. Tailoring benchmarks to the deployment context ensures that results are meaningful and actionable.

#### Statistical Significance and Noise {#sec-benchmarking-statistical-significance-noise-20ee}

Distinguishing meaningful performance improvements from measurement noise requires proper statistical analysis. Following the evaluation methodology principles established earlier, MLPerf addresses measurement variability by requiring multiple benchmark runs and reporting percentile-based metrics rather than single measurements [@reddi2020mlperf]. For instance, MLPerf Inference reports 99th percentile latency alongside mean performance, capturing both typical behavior and worst-case scenarios that single-run measurements might miss. This approach recognizes that system performance naturally varies due to factors like thermal throttling, memory allocation patterns, and background processes.

Avoiding these pitfalls requires treating benchmarking as a process of balancing multiple priorities — latency, throughput, memory, energy, and accuracy — rather than optimizing for any single metric in isolation.

### MLPerf Inference Benchmarks {#sec-benchmarking-mlperf-inference-benchmarks-e878}

\index{MLPerf Inference!benchmark family evolution}
\index{MLCommons!non-profit benchmark consortium}
The MLPerf Inference benchmark, developed by MLCommons[^fn-mlcommons], provides a standardized framework for evaluating machine learning inference performance across a range of deployment environments. Initially, MLPerf started with a single inference benchmark, but as machine learning systems expanded into diverse applications, it became clear that a one-size-fits-all benchmark was insufficient. Different inference scenarios, including cloud-based AI services and resource-constrained embedded devices, demanded tailored evaluations. This realization led to the development of a family of MLPerf inference benchmarks, each designed to assess performance within a specific deployment setting.

[^fn-mlcommons]: **MLCommons**: Non-profit consortium (founded 2018, rebranded from MLPerf) with members including Google, NVIDIA, Intel, and leading universities. MLCommons addresses benchmark credibility by requiring open submissions with full system specifications, preventing the cherry-picking that plagued earlier benchmarks. Published results reveal 10$\times$ performance differences between vendors on identical workloads, making MLCommons the closest the field has to SPEC-style apples-to-apples hardware comparison. \index{MLCommons!open submission}

#### MLPerf Inference {#sec-benchmarking-mlperf-inference-4b4a}

MLPerf Inference [@reddi2020mlperf] serves as the baseline benchmark, originally designed to evaluate large-scale inference systems. It primarily focuses on data center and cloud-based inference workloads, where high throughput, low latency, and efficient resource utilization are essential. The benchmark assesses performance across a range of deep learning models, including image classification, object detection, natural language processing, and recommendation systems. This version of MLPerf is a widely used reference point for comparing AI accelerators, GPUs, TPUs, and CPUs in high-performance computing environments.

\index{DLRM!recommendation model benchmark}
Major technology companies regularly reference MLPerf results for hardware procurement decisions. When evaluating hardware for recommendation systems infrastructure, MLPerf benchmark scores on DLRM[^fn-dlrm-benchmarking] (Deep Learning Recommendation Model) workloads can inform choices between different accelerator generations. Across generations, benchmark results often show substantial throughput improvements, although the magnitude depends on workload, software stack, and system configuration. This illustrates how standardized benchmarks can translate into consequential infrastructure decisions.

[^fn-dlrm-benchmarking]: **DLRM (Deep Learning Recommendation Model)**: Facebook's 2019 recommendation architecture combining embedding tables (categorical features) with MLPs (continuous features). DLRM stresses benchmarks differently than vision or language models: its embedding tables can consume terabytes, making memory capacity and bandwidth the dominant constraints rather than compute throughput. This makes DLRM the canonical memory-bound MLPerf workload, revealing hardware limitations invisible to compute-bound benchmarks. \index{DLRM!memory-bound benchmark}

These standardized evaluations provide invaluable comparisons, but the cost of comprehensive benchmarking limits who can participate and how thoroughly systems are evaluated.

::: {.callout-perspective title="The Cost of Comprehensive Benchmarking"}

While benchmarking is essential for ML system development, it comes with substantial costs that limit participation to well-resourced organizations. Submitting to MLPerf can require months of engineering effort and dedicated hardware and cloud compute time. A comprehensive MLPerf Training submission can involve months of engineering time for optimization, tuning, and validation across multiple hardware configurations, and can require compute budgets that reach six figures in dollars depending on the scope.

This cost barrier explains why MLPerf submissions are dominated by major technology companies and hardware vendors, while smaller organizations rely on published results rather than conducting their own comprehensive evaluations. The high barrier to entry motivates the need for more lightweight, internal benchmarking practices that organizations can use to make informed decisions without the expense of full-scale standardized benchmarking.

:::

#### MLPerf Mobile {#sec-benchmarking-mlperf-mobile-83cc}

MLPerf Mobile [@mlperf_mobile_website] extends MLPerf's evaluation framework to smartphones and other mobile devices. Unlike cloud-based inference, mobile inference operates under strict power and memory constraints, requiring models to be optimized for efficiency without sacrificing responsiveness. The benchmark measures latency and responsiveness for real-time AI tasks, such as camera-based scene detection, speech recognition, and augmented reality applications. MLPerf Mobile has become an industry standard for assessing AI performance on flagship smartphones and mobile AI chips, helping developers optimize models for on-device AI workloads.

#### MLPerf Client {#sec-benchmarking-mlperf-client-c48d}

MLPerf Client [@mlperf_client_website] focuses on inference performance on consumer computing devices, such as laptops, desktops, and workstations. This benchmark addresses local AI workloads that run directly on personal devices, eliminating reliance on cloud inference. Tasks such as real-time video editing, speech-to-text transcription, and AI-enhanced productivity applications fall under this category. Unlike cloud-based benchmarks, MLPerf Client evaluates how AI workloads interact with general-purpose hardware, such as CPUs, discrete GPUs, and integrated Neural Processing Units (NPUs), making it relevant for consumer and enterprise AI applications.

#### MLPerf Tiny {#sec-benchmarking-mlperf-tiny-2346}

\index{MLPerf Tiny!embedded AI benchmark}
MLPerf Tiny [@banbury2021mlperftiny] was created to benchmark embedded and ultra-low-power AI systems, such as IoT devices, wearables, and microcontrollers. Unlike other MLPerf benchmarks, which assess performance on powerful accelerators, MLPerf Tiny evaluates inference on devices with limited compute, memory, and power resources. This benchmark is particularly relevant for applications such as smart sensors, AI-driven automation, and real-time industrial monitoring, where models must run efficiently on hardware with minimal processing capabilities. MLPerf Tiny helps developers optimize models for constrained environments and has become the standard benchmark for edge AI performance.

#### MLPerf Execution Scenarios {#sec-benchmarking-mlperf-execution-scenarios-89f0}

The same hardware can report dramatically different benchmark numbers depending on *how* requests arrive—a fact that explains why vendor claims often fail to predict production performance. MLPerf addresses this by defining four execution scenarios that characterize distinct traffic patterns, each requiring different optimization strategies.

##### SingleStream {#sec-benchmarking-singlestream-c709}

\index{SingleStream!sequential inference scenario}
SingleStream processes one request at a time, measuring latency for sequential inference. This scenario models mobile and embedded applications where a single user interacts with the device: a smartphone camera app classifying images, a voice assistant processing speech, or a wearable detecting gestures. The key metric is per-request latency, and batching provides no benefit since requests arrive only after the previous result is consumed. Optimization focuses on preprocessing efficiency and power consumption rather than throughput.

##### MultiStream {#sec-benchmarking-multistream-38d1}

\index{MultiStream!synchronized sensor fusion}
MultiStream processes multiple synchronized input streams simultaneously, modeling scenarios like autonomous vehicles with multiple cameras that must be processed together for spatial fusion. Unlike SingleStream's sequential requests, MultiStream requires processing frames from all sensors within tight timing deadlines (typically 33 ms for 30 FPS). The key distinction from Server mode is that MultiStream inputs arrive in lockstep, while Server requests arrive independently and unpredictably. The key constraint is synchronization: all streams must complete before the planning module can act. Optimization focuses on jitter handling and meeting hard deadlines rather than average throughput.

##### Server {#sec-benchmarking-server-e69f}

\index{Server Scenario!Poisson-distributed traffic}
Server generates requests following a Poisson distribution, simulating cloud API traffic where requests arrive independently and unpredictably. This scenario models web services handling millions of queries from different users. Unlike SingleStream's guaranteed sequential arrival, Server traffic creates queuing dynamics where multiple requests compete for resources. The key metrics are throughput (queries per second) and tail latency (p99), and dynamic batching can improve efficiency by grouping requests that arrive within a time window. Optimization balances throughput against latency SLOs.

##### Offline {#sec-benchmarking-offline-6ece}

\index{Offline Scenario!maximum throughput mode}
Offline provides all inputs upfront, measuring maximum throughput when latency constraints are removed. This scenario models batch processing pipelines: overnight data processing, scientific computing, or pre-computing recommendations. With no latency requirement, systems can use maximum batch sizes to saturate hardware utilization. The key metric is pure throughput (samples per second), and optimization focuses entirely on hardware efficiency.

@tbl-mlperf-scenarios maps these execution scenarios to their deployment contexts and optimization strategies:

| **Scenario**     | **Context**                         | **Strategy**                  | **Focus**                            |
|:-----------------|:------------------------------------|:------------------------------|:-------------------------------------|
| **SingleStream** | Mobile apps, embedded devices       | No batching (batch=1)         | Preprocessing, power efficiency      |
| **MultiStream**  | Autonomous driving, video analytics | Synchronized sensor fusion    | Jitter handling, deadline guarantees |
| **Server**       | Cloud APIs, web services            | Dynamic batching with timeout | Throughput-latency tradeoff tuning   |
| **Offline**      | Batch processing, data pipelines    | Maximum batch size            | Throughput, hardware utilization     |

: **MLPerf Execution Scenarios.** The four MLPerf inference scenarios map to distinct deployment contexts, each requiring different optimization strategies. SingleStream and MultiStream prioritize latency, Server balances throughput and latency, and Offline maximizes throughput. Matching the scenario to deployment context determines which benchmark results are relevant. {#tbl-mlperf-scenarios}

These scenarios explain why the same hardware can report dramatically different benchmark numbers. An accelerator achieving 10,000 samples/second in Offline mode might achieve only 200 queries/second in Server mode with p99 latency constraints, because Server mode includes queuing overhead and cannot use maximum batch sizes. When evaluating hardware for a specific application, selecting the appropriate scenario ensures benchmark results predict production performance. To demonstrate scenario-based validation concretely, we return to our *MobileNet on EdgeTPU* lighthouse.

```{python}
#| label: edgetpu-speedup-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ EDGETPU SPEEDUP CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "MobileNet on EdgeTPU" — validates hardware acceleration
# │          claims by comparing EdgeTPU vs ARM Cortex-M7 latency and energy
# │
# │ Goal: Contrast raw inference gains with end-to-end efficiency.
# │ Show: That instantaneous power spikes (EdgeTPU) can yield lower total energy.
# │ How: Compare EdgeTPU and CPU on latency, end-to-end speed, and Joules per inference.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: edgetpu_latency_ms_str, cpu_latency_ms_str, edgetpu_e2e_ms_str,
# │          cpu_e2e_ms_str, edgetpu_power_mw_str, cpu_power_mw_str,
# │          inference_speedup_str, e2e_speedup_str, edgetpu_power_ratio_str,
# │          cpu_energy_mj_str, edgetpu_energy_mj_str, energy_ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt_percent, fmt, check

class EdgeTPUSpeedupCalc:
    """EdgeTPU vs Cortex-M7: 7.5× inference speedup, higher peak power, lower energy per inference."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    edgetpu_latency_ms = 2
    cpu_latency_ms = 15
    edgetpu_e2e_ms = 6
    cpu_e2e_ms = 18
    edgetpu_power_mw = 500
    cpu_power_mw = 120
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    inference_speedup = cpu_latency_ms / edgetpu_latency_ms
    e2e_speedup = cpu_e2e_ms / edgetpu_e2e_ms
    power_ratio = edgetpu_power_mw / cpu_power_mw
    cpu_energy_mj = cpu_power_mw * cpu_latency_ms / 1000
    edgetpu_energy_mj = edgetpu_power_mw * edgetpu_latency_ms / 1000
    energy_ratio = cpu_energy_mj / edgetpu_energy_mj
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    edgetpu_latency_ms_str = fmt(edgetpu_latency_ms, precision=0, commas=False)
    cpu_latency_ms_str = fmt(cpu_latency_ms, precision=0, commas=False)
    edgetpu_e2e_ms_str = fmt(edgetpu_e2e_ms, precision=0, commas=False)
    cpu_e2e_ms_str = fmt(cpu_e2e_ms, precision=0, commas=False)
    edgetpu_power_mw_str = fmt(edgetpu_power_mw, precision=0, commas=False)
    cpu_power_mw_str = fmt(cpu_power_mw, precision=0, commas=False)
    inference_speedup_str = fmt(inference_speedup, precision=0, commas=False)
    e2e_speedup_str = fmt(e2e_speedup, precision=0, commas=False)
    edgetpu_power_ratio_str = fmt(power_ratio, precision=0, commas=False)
    cpu_energy_mj_str = fmt(cpu_energy_mj, precision=1, commas=False)
    edgetpu_energy_mj_str = fmt(edgetpu_energy_mj, precision=1, commas=False)
    energy_ratio_str = fmt(energy_ratio, precision=1, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
edgetpu_latency_ms_str = EdgeTPUSpeedupCalc.edgetpu_latency_ms_str
cpu_latency_ms_str = EdgeTPUSpeedupCalc.cpu_latency_ms_str
edgetpu_e2e_ms_str = EdgeTPUSpeedupCalc.edgetpu_e2e_ms_str
cpu_e2e_ms_str = EdgeTPUSpeedupCalc.cpu_e2e_ms_str
edgetpu_power_mw_str = EdgeTPUSpeedupCalc.edgetpu_power_mw_str
cpu_power_mw_str = EdgeTPUSpeedupCalc.cpu_power_mw_str
inference_speedup_str = EdgeTPUSpeedupCalc.inference_speedup_str
e2e_speedup_str = EdgeTPUSpeedupCalc.e2e_speedup_str
edgetpu_power_ratio_str = EdgeTPUSpeedupCalc.edgetpu_power_ratio_str
cpu_energy_mj_str = EdgeTPUSpeedupCalc.cpu_energy_mj_str
edgetpu_energy_mj_str = EdgeTPUSpeedupCalc.edgetpu_energy_mj_str
energy_ratio_str = EdgeTPUSpeedupCalc.energy_ratio_str
```

::: {.callout-lighthouse title="MobileNet on EdgeTPU"}
Completing our MobileNet lighthouse example, we validate the hardware acceleration claims from @sec-hardware-acceleration using MLPerf Tiny scenarios.

*Note: The following values are illustrative, based on typical EdgeTPU and Cortex-M7 performance characteristics. Actual results vary with clock frequency, thermal conditions, and specific implementation. Always benchmark your specific configuration.*

**Hardware acceleration claim**: EdgeTPU achieves ~`{python} edgetpu_latency_ms_str` ms inference for INT8 MobileNetV2, approximately `{python} inference_speedup_str`$\times$ speedup over ARM Cortex-M7 CPU (~`{python} cpu_latency_ms_str` ms).

**Validation protocol** (SingleStream scenario):

| **Metric**               |               **CPU (Cortex-M7)** |                           **EdgeTPU** | **Claimed**                                     |                                      **Validated?** |
|:-------------------------|----------------------------------:|--------------------------------------:|:------------------------------------------------|----------------------------------------------------:|
| **Inference latency**    | ~`{python} cpu_latency_ms_str` ms | ~`{python} edgetpu_latency_ms_str` ms | `{python} inference_speedup_str`$\times$ faster |                                                   ✓ |
| **End-to-end latency**   |     ~`{python} cpu_e2e_ms_str` ms |     ~`{python} edgetpu_e2e_ms_str` ms | —                                               |          ~`{python} e2e_speedup_str`$\times$ faster |
| **Power consumption**    |   ~`{python} cpu_power_mw_str` mW |   ~`{python} edgetpu_power_mw_str` mW | —                                               |  ~`{python} edgetpu_power_ratio_str`$\times$ higher |
| **Energy per inference** |  ~`{python} cpu_energy_mj_str` mJ |  ~`{python} edgetpu_energy_mj_str` mJ | —                                               | ~`{python} energy_ratio_str`$\times$ more efficient |

**What this reveals**: The `{python} inference_speedup_str`$\times$ inference speedup is real, but end-to-end improvement is only ~`{python} e2e_speedup_str`$\times$ because preprocessing (image capture, resize, normalize) runs on the CPU in both cases. EdgeTPU consumes more power but completes faster, yielding better energy efficiency per inference.

**The deployment decision**: For battery-powered devices running infrequently (doorbell camera: ~100 inferences/day), CPU is more power-efficient overall because the device spends most time in sleep mode. For continuous operation (real-time video analytics: 30 FPS), EdgeTPU's per-inference energy efficiency dominates.

This illustrates why benchmarking requires matching the MLPerf scenario to your deployment context: SingleStream validates mobile applications; Offline benchmarks would give different conclusions optimized for throughput rather than latency.
:::

\index{Power Measurement!evaluation triad completion}
Training benchmarks measure learning speed; inference benchmarks measure serving speed. But both measures share a critical blind spot: they say nothing about *how much energy* the system consumes to achieve that speed. A system that sets throughput records while consuming kilowatts of power may be economically unsustainable or physically impossible to deploy at the edge. Completing the evaluation picture requires measuring the energy cost of performance.

## Power Measurement Techniques {#sec-benchmarking-power-measurement-techniques-bcc2}

A chip vendor advertises "10 TOPS at 0.5 W," but under sustained inference load, thermal throttling drops actual throughput to 3 TOPS at 2 W. Without standardized power measurement, this 6.7$\times$ efficiency gap between the datasheet and reality goes undetected until deployment.

\index{TOPS per Watt!primary design objective}
This third dimension is critical because @sec-hardware-acceleration established TOPS/Watt as a primary design objective alongside raw TOPS. Power benchmarks validate whether efficiency-optimized accelerators deliver their promised energy savings. Power claims are particularly susceptible to gaming: a chip advertising "10 TOPS at 0.5 W" might achieve that ratio only at minimal utilization; under sustained load, thermal throttling and voltage scaling may deliver 3 TOPS at 2 W. Power benchmarks expose these gaps.

However, measuring power consumption in machine learning systems presents challenges distinct from measuring time or throughput. Power varies with temperature, workload phase, and system configuration in ways that performance metrics do not. @tbl-power quantifies how energy demands of ML models vary dramatically across deployment environments, spanning multiple orders of magnitude from TinyML devices consuming mere microwatts to data center racks requiring kilowatts. This wide spectrum illustrates the central challenge in creating standardized benchmarking methodologies [@henderson2020towards].

| **Category** | **Device Type**                 | **Power Consumption** |
|:-------------|:--------------------------------|----------------------:|
| **Tiny**     | Neural Decision Processor (NDP) |                150 µW |
| **Tiny**     | M7 Microcontroller              |                 25 mW |
| **Mobile**   | Raspberry Pi 4                  |                 3.5 W |
| **Mobile**   | Smartphone                      |                   4 W |
| **Edge**     | Smart Camera                    |               10-15 W |
| **Edge**     | Edge Server                     |               65-95 W |
| **Cloud**    | ML Server Node                  |             300-500 W |
| **Cloud**    | ML Server Rack                  |               4-10 kW |

: **Power Consumption Spectrum.** Machine learning deployments span over four orders of magnitude in power demands, from microwatt-scale TinyML devices to kilowatt-scale server racks. This enormous range explains why no single measurement technique or efficiency metric applies universally: benchmarking a 150 µW neural processor requires fundamentally different instrumentation than measuring a 10 kW server rack. {#tbl-power}

Creating a unified methodology across this four-orders-of-magnitude range requires careful consideration of each scale's unique characteristics: microwatt-level TinyML measurements demand different instrumentation than kilowatt-scale server rack monitoring. A comprehensive framework must accommodate these scales while maintaining consistency, fairness, and reproducibility.

### Power Measurement Boundaries {#sec-benchmarking-power-measurement-boundaries-982c}

To address these measurement challenges, we must understand how power consumption is measured at different system scales, from TinyML devices to full-scale data center inference nodes. @fig-power-diagram lays out the distinct measurement boundaries for each scenario—pay attention to the color coding: components in green fall inside the energy accounting boundary, while components with red dashed outlines are explicitly excluded from power measurements. This distinction matters because where the boundary is drawn determines what counts as "efficient."

::: {#fig-power-diagram fig-env="figure" fig-pos="htb" fig-cap="**Power Measurement Boundaries**: MLPerf defines system boundaries for power measurement, ranging from single-chip devices to full data center nodes, to enable fair comparisons of energy efficiency across diverse hardware platforms. These boundaries delineate which components' power consumption is included in reported metrics, impacting the interpretation of performance results. Source: [@tschand2024mlperf]." fig-alt="System diagram showing four measurement boundaries: Tiny SoC with compute units, Inference SoC with accelerators and DRAM, Inference Node with cooling and NIC, and Training Rack with compute nodes."}
```{.tikz}
\begin{tikzpicture}[font=\footnotesize\usefont{T1}{phv}{m}{n}]
\tikzset{%
Line/.style={line width=1.0pt,black!50,text=black,align=center},
BoxG/.style={inner xsep=4pt,
    node distance=0.3,
    draw=GreenLine,
    line width=0.5pt,
    fill=GreenL!60,
    align=flush center,
    rounded corners=2pt,
    minimum height=7.5mm
  },
BoxFill/.style={draw=BackLine,inner xsep=2mm,inner ysep=2mm,
yshift=0mm,fill=BackColor!60,line width=1pt},
BoxFill2/.style={draw=BackLine,inner sep=1pt,fill=BackColor!60,line width=1pt,align=flush center},
BoxDash2/.style={draw=RedLine,inner sep=1pt,fill=white,line width=1pt,dashed,align=flush center},
BoxDash/.style={draw=RedLine,inner xsep=2mm,inner ysep=2mm,
yshift=0mm,fill=white,line width=1pt,dashed,align=flush center},
BoxB/.style={BoxG,fill=cyan!10},
BoxR/.style={BoxG,fill=magenta!15},
BoxO/.style={BoxG,fill=orange!15},
BoxV/.style={BoxG,fill=violet!15}
}
%%%Tiny Example
\foreach \j in {1,2} {
\node[BoxG](1C\j) at({0}, {-0.15*\j}){Compute Unit};
}
\node[BoxB,below =0.4 of  1C2.south west,minimum height=11mm](1C3){Basic\\ Switch};
\node[BoxR,below =0.4 of 1C2.south east,minimum height=11mm](1C4){On Chip\\ SRAM};
\scoped[on background layer]
\node[BoxFill,inner xsep=5mm,fit=(1C1)(1C3)(1C4)](BB1){};
\node[above=4pt of  BB1.north,inner sep=0pt, anchor=south](THE){\textbf{Tiny Example}};
\node[below=4pt of  BB1.south,inner sep=0pt, anchor=north]{Traditional (ultra) Low Power SoC};
%%%Diagam Key
\node[BoxFill,below =1.4 of BB1.219,minimum width=5mm](PMB){};
\node[right=1mm of PMB,yshift=-1pt](PMBT){Power Measurement Boundary};
\node[BoxDash,below =0.13of PMB,,minimum width=5mm](NIB){};
\node[right=1mm of NIB,yshift=-1pt](NIBT){Not in Boundary};
\scoped[on background layer]
\node[BoxFill,fill=white,inner ysep=4mm,yshift=2mm,fit=(PMB)(PMBT)(NIB)](1BB1){};
\node[below left=4pt and -4ptof  1BB1.north west,inner sep=0pt, anchor=north west]{\textbf{Diagram Key}};
%%%Inference Example
%%Typical Inference SoC 1
\foreach \j in {1,2} {
\node[BoxG,minimum height=12mm,yshift=-8mm](2C\j) at({5.6}, {-0.15*\j}){Compute\\ Unit};
}
\node[BoxB,below=of 2C2.south west,anchor=north west,minimum height=12mm](2C3){On Chip\\SRAM};
\node[BoxR,right=of 2C1.north east,anchor=north west,minimum height=10mm](2C4){Switching\\NoC};
\coordinate(S1)at($(2C3.north east)+(1,-0.25)$);
\begin{scope}[local bounding box=CU2,shift={($(S1)+(0,0)$)}]
\foreach \j in {1,2} {
\node[BoxG,minimum height=10mm](22C\j) at({0.15*\j}, {0}){Compute\\ Unit};
}
\end{scope}
\scoped[on background layer]
\node[BoxFill,fill=white,fit=(2C1)(2C3)(2C4)(CU2)](2BB1){};
\node[above left=2pt and -4pt of  2BB1.north west,inner sep=0pt, anchor=south west](TIS){Typical Inference SoC 1};
\node[BoxO,xshift=2mm,below=0mm of 2BB1.east,rotate=90,minimum height=6mm](OCD1){Off-Chip DRAM};
\node[BoxO,xshift=-2mm,above=0mm of 2BB1.west,rotate=90,minimum height=6mm](OCD2){Off-Chip DRAM};
\node[BoxO,below =11mm of OCD2.west,minimum height=8mm,minimum width=6mm](OCD4){};
\node[BoxO,below =11mm of OCD1.west,minimum height=8mm,minimum width=6mm](OCD3){};
%
\path[red](OCD4)-|coordinate(S2)(2C3.south west);
\path[red](OCD3)-|coordinate(S3)(22C2.south east);
\node[BoxG,anchor=west,minimum height=6mm,minimum width=6mm](2B1)at(S2){};
\node[BoxR,anchor=east,minimum height=6mm,minimum width=6mm](2B4)at(S3){};
\node[BoxB,minimum height=6mm,minimum width=6mm](2B3)at($(2B1)!0.66!(2B4)$){};
\node[BoxO,minimum height=6mm,minimum width=6mm](2B2)at($(2B1)!0.33!(2B4)$){};
\scoped[on background layer]
\node[BoxFill,inner xsep=5mm,fit=(OCD3)(TIS)(OCD4)](BB2){};
\scoped[on background layer]
\node[BoxFill,fill=white,fit=(2B1)(2B4),inner ysep=1.5mm,](2BB2){};
\node[above left=2pt and -4pt of  2BB2.north west,inner sep=0pt, anchor=south west]{Typical Inference SoC n};
\scoped[on background layer]
\node[BoxFill,fill=white,fit=(2C1)(2C3)(2C4)(CU2)](2BB1){};
%%%Typical Inference Node 1
\begin{scope}[local bounding box=CU3,shift={($(15,-0.45)+(0,0)$)}]
\foreach \j in {1,2} {
\node[BoxG,minimum height=17mm](3C\j) at({0}, {-0.2*\j}){Accelerator (s) +\\ Local RAM};
}
\node[BoxV,below=4mm of 3C2.south east,minimum width=15mm,minimum height=9mm,anchor=north east](3C3){Active\\ Cooling};
\node[BoxR,below=4mm of 3C3.south east,minimum width=15mm,minimum height=11mm,anchor=north east](3C4){NIC};
\node[BoxR,left=4mm of 3C2.south west,minimum width=15mm,minimum height=9mm,anchor=south east](3C5){Local\\ Storage};
\node[BoxO,below=4mm of 3C2.south west,minimum width=15mm,minimum height=15mm,anchor=north east](3C6){Host\\ DRAM};
\coordinate(S4)at($(3C6.230)+(0,-0.75)$);
\begin{scope}[local bounding box=CU2,shift={($(S4)+(0,0)$)}]
\foreach \j in {1,2} {
\node[BoxG,minimum width=16mm,minimum height=9mm](33C\j) at({0.15*\j}, {0}){Host (s)};
}
\end{scope}
%
\scoped[on background layer]
\node[BoxFill,inner xsep=5mm,fit=(3C1)(3C4)(33C2)](BB4){};
\node[below=4pt of  BB4.south west,inner sep=0pt, anchor=north west]{Traditional Inference Node 1};
\end{scope}
%%%Training Example
\def\ra{1.89mm}
\node[BoxFill2,right=24mm of 3C1.north east,minimum width=41mm,minimum height=6mm,anchor=north west](4C1){
Compute Node 1 (Measured)};
\node[BoxFill2,below=\ra of 4C1.south,minimum width=41mm,minimum height=6mm,anchor=north](4C2){
Compute Node 2 (Measured)};
\node[BoxFill2,below=\ra of 4C2.south,minimum width=41mm,minimum height=9mm,anchor=north](4C3){
Network Switches\\ (Measured/Estimated))};
\node[BoxDash2,below=\ra of 4C3.south,minimum width=41mm,minimum height=6mm,anchor=north](4C4){
Storage Node};
\node[BoxFill2,below=\ra of 4C4.south,minimum width=41mm,minimum height=6mm,anchor=north](4C5){
Compute Node n (Measured)};
\node[BoxDash2,below=\ra of 4C5.south,minimum width=41mm,minimum height=6mm,anchor=north](4C6){
DC Cooling Components};
%
\scoped[on background layer]
\node[BoxFill,inner xsep=5mm,fit=(4C1)(4C6),fill=white,draw=BrownLine,line width=0.75pt](BB6){};
\node[below=4pt of  BB6.south west,inner sep=0pt, anchor=north west]{Training Rack 1};
%%%Right
\node[BoxFill2,right=22mm of 4C1.east,minimum width=7mm,minimum height=6mm,anchor=west](5C1){};
\node[BoxFill2,below=\ra of 5C1.south,minimum width=7mm,minimum height=6mm,anchor=north](5C2){};
\node[BoxFill2,below=\ra of 5C2.south,minimum width=7mm,minimum height=9mm,anchor=north](5C3){};
\node[BoxDash2,below=\ra of 5C3.south,minimum width=7mm,minimum height=6mm,anchor=north](5C4){};
\node[BoxFill2,below=\ra of 5C4.south,minimum width=7mm,minimum height=6mm,anchor=north](5C5){};
\node[BoxDash2,below=\ra of 5C5.south,minimum width=7mm,minimum height=6mm,anchor=north](5C6){};
%
\scoped[on background layer]
\node[BoxFill,inner xsep=4.5mm,fit=(5C1)(5C6),fill=white,draw=BrownLine,line width=0.75pt](BB7){};
\node[below=4pt of  BB7.south west,inner sep=0pt, anchor=north west]{Training Rack n};
%
\node[BoxDash2,rotate=90,minimum height=6mm,minimum width=46mm](RS1)at($(BB2.east)!0.5!(BB4.west)$){Remote Storage};
\node[BoxDash2,rotate=90,minimum height=6mm,minimum width=46mm](RS2)at($(BB4.east)!0.5!(BB6.west)$){Remote Storage};
\node[BoxFill2,rotate=90,minimum height=6mm,minimum width=46mm,
fill=OrangeL!40](RS3)at($(BB6.east)!0.5!(BB7.west)$){Interconnection Fabrics};
\path[red](THE)-|coordinate(S6)(RS1);
\path[red](THE)-|coordinate(S7)($(BB6.north west)!0.5!(BB7.north east)$);
\node[]at(S6){\textbf{Inference Example}};
\node[]at(S7){\textbf{Training Example}};
\end{tikzpicture}
```
:::

The diagram is organized into three categories, Tiny, Inference, and Training examples, each reflecting different measurement scopes based on system architecture and deployment environment. In TinyML systems, the entire low-power SoC, including compute, memory, and basic interconnects, typically falls within the measurement boundary. Inference nodes introduce more complexity, incorporating multiple SoCs, local storage, accelerators, and memory, while often excluding remote storage and off-chip components. Training deployments span multiple racks, where only selected elements, including compute nodes and network switches, are measured, while storage systems, cooling infrastructure, and parts of the interconnect fabric are often excluded.

System-level power measurement offers a more holistic view than measuring individual components in isolation. While component-level metrics (e.g., accelerator or processor power) are valuable for performance tuning, real-world ML workloads involve intricate interactions between compute units, memory systems, and supporting infrastructure. For instance, analysis of Google's TensorFlow Mobile workloads shows that data movement accounts for 57.3% of total inference energy consumption [@BoroumandASPLOS2018], highlighting how memory-bound operations can dominate system power usage.

Shared infrastructure presents additional challenges. In data centers, resources such as cooling systems and power delivery are shared across workloads, complicating attribution of energy use to specific ML tasks. Cooling alone can account for 20–30% of total facility power consumption, making it a major factor in energy efficiency assessments [@barroso2019datacenter]. Even at the edge, components like memory and I/O interfaces may serve both ML and non-ML functions, further blurring measurement boundaries.

Shared infrastructure complexity is further compounded by dynamic power management techniques that modern systems employ to optimize energy efficiency. \index{DVFS!dynamic voltage and frequency scaling}
Dynamic voltage and frequency scaling (DVFS) adjusts processor voltage and clock frequency based on workload demands, enabling 30–50% power reductions during periods of lower computational intensity. Advanced DVFS implementations using on-chip switching regulators can achieve comparable energy savings [@kim2008system], causing power consumption to vary by 30–50% for the same ML model depending on system load and concurrent activity. This variability affects not only the compute components but also the supporting infrastructure, as reduced processor activity can lower cooling requirements and overall facility power draw.

\index{Power Usage Effectiveness!datacenter cooling metric}
Support infrastructure, particularly cooling systems, is a major component of total energy consumption in large-scale deployments. Data centers must maintain operational temperatures, typically between 20-25°C, to ensure system reliability. Cooling overhead is captured in the Power Usage Effectiveness (PUE) metric, which ranges from 1.1 in highly efficient facilities to over 2.0 in less optimized ones [@barroso2019datacenter]. The interaction between compute workloads and cooling infrastructure creates complex dependencies; for example, power management techniques like DVFS not only reduce direct processor power consumption but also decrease heat generation, creating cascading effects on cooling requirements. Even edge devices require basic thermal management.

### Computational Efficiency vs. Power Consumption {#sec-benchmarking-computational-efficiency-vs-power-consumption-d01e}

\index{Frequency Scaling!cubic power relationship}
The relationship between computational performance and energy efficiency is a central tradeoff in modern ML system design. As systems push for higher performance, they often encounter diminishing returns in energy efficiency due to physical limitations in semiconductor scaling and power delivery [@koomey2011web]. This relationship is particularly evident in processor frequency scaling, where increasing clock frequency by 20% typically yields only modest performance improvements (around 5%) while dramatically increasing power consumption by up to 50%, reflecting the cubic relationship between voltage, frequency, and power consumption [@le2010dynamic].

In deployment scenarios with strict energy constraints, particularly battery-powered edge devices and mobile applications, optimizing this performance-energy tradeoff becomes essential for practical viability. Model optimization techniques offer promising approaches to achieve better efficiency without material accuracy degradation. Numerical precision optimization techniques, which reduce computational requirements while maintaining model quality, demonstrate this tradeoff effectively. Research shows that reduced-precision computation can maintain model accuracy within 1–2% of the original while delivering 3--4$\times$ improvements in both inference speed and energy efficiency.

These optimization strategies span three interconnected dimensions: accuracy, computational performance, and energy efficiency. Advanced optimization methods enable fine-tuned control over this tradeoff space. Similarly, model optimization and compression techniques require careful balancing of accuracy losses against efficiency gains. The optimal operating point among these factors depends heavily on deployment requirements and constraints; mobile applications typically prioritize energy efficiency to extend battery life, while cloud-based services might optimize for accuracy even at higher power consumption costs, benefiting from economies of scale and dedicated cooling infrastructure.

Energy efficiency metrics now occupy a central position in AI system evaluation. Power measurement standards such as MLPerf Power [@tschand2024mlperf] provide standardized frameworks for comparing energy efficiency across hardware platforms and deployment scenarios. These standards enable engineers to systematically balance performance, power consumption, and environmental impact when selecting hardware and optimization strategies.

### Standardized Power Measurement {#sec-benchmarking-standardized-power-measurement-7fae}

Power measurement techniques like SPEC Power [@spec_power_website] have long served general computing [@lange2009identifying], but ML workloads expose a fundamental difficulty: instantaneous power consumption during a single inference can vary by 10$\times$ between the compute-intensive phases of matrix multiplication and the memory-stall phases of weight loading. A transformer attention layer may spike to 400 W while the subsequent data-movement phase drops to 40 W—all within a few milliseconds. This volatility means that any single-point measurement is misleading, and the act of measurement itself (instrumentation overhead, sampling-induced delays) can perturb the very power profile being characterized.

The core challenge is therefore temporal: how do you characterize a quantity that fluctuates faster than most measurement instruments can sample? Dense matrix operations in transformer layers create short, intense power spikes requiring high-frequency sampling (>1 KHz) to capture accurately, while CNN inference tends toward more consistent power draw amenable to lower sampling rates. The measurement window must also account for ML-specific warm-up periods, where initial inferences consume more power due to cache population and pipeline initialization. Sliding-window averages over hundreds of inferences smooth these fluctuations into actionable efficiency numbers, but the window size itself becomes a design parameter that can hide or reveal different aspects of the power profile.

Memory access patterns compound the measurement problem because ML systems often spend more energy moving data than computing on it. Recommendation models like DLRM, for example, can consume more energy on memory access than computation—a pattern that traditional compute-focused power measurement misses entirely. Capturing both compute and memory subsystem power consumption requires instrumenting the full data path, not just the processor.

Heterogeneous accelerator configurations introduce further complexity. GPUs, TPUs, and NPUs each maintain independent power management schemes, and modern SoCs dynamically switch between compute resources based on workload characteristics. Accurate system-level measurement requires synchronized power capture across all active compute units—a challenge that scales with system size. Multi-GPU configurations must account for gradient synchronization energy alongside computation, and multi-node deployments add non-trivial network infrastructure power. At the other extreme, edge deployments must capture the energy cost of model updates and data preprocessing alongside inference itself.

Batch size creates a non-linear relationship with power consumption that single-point measurements cannot characterize. Larger batches improve compute efficiency (better amortization of memory loads) but increase memory pressure and peak power requirements, meaning the most efficient batch size for throughput may differ from the most efficient batch size for energy. Measurement across multiple batch sizes is essential for a complete efficiency profile. System idle states deserve equal attention, particularly for intermittent edge workloads: a wake-word detection TinyML system that actively processes audio for only a small fraction of operating time may be dominated by idle power consumption rather than inference energy. Finally, sustained ML workloads can cause 20–30°C temperature increases that trigger thermal throttling and alter power consumption patterns—an effect particularly acute in edge devices, where thermal constraints limit sustained performance and make extended benchmarking runs essential for realistic characterization.

### MLPerf Power Case Study {#sec-benchmarking-mlperf-power-case-study-a554}

MLPerf Power [@tschand2024mlperf] is a standard methodology for measuring energy efficiency in machine learning systems. This comprehensive benchmarking framework provides accurate assessment of power consumption across diverse ML deployments. At the datacenter level, it measures power usage in large-scale AI workloads, where energy consumption optimization directly impacts operational costs. For edge computing, it evaluates power efficiency in consumer devices like smartphones and laptops, where battery life constraints are critical. In tiny inference scenarios, it assesses energy consumption for ultra-low-power AI systems, particularly IoT sensors and microcontrollers operating with strict power budgets.

The MLPerf Power methodology applies the standardized evaluation principles discussed earlier, adapting to various hardware architectures from general-purpose CPUs to specialized AI accelerators. Meaningful cross-platform comparisons are ensured while maintaining measurement integrity across different computing scales.

The benchmark has accumulated thousands of reproducible measurements submitted by industry organizations, demonstrating their latest hardware capabilities and the sector-wide focus on energy-efficient AI technology. Examine the three panels in @fig-power-trends to track how energy efficiency has evolved across system scales through successive MLPerf versions. The gains are not uniform—compare the datacenter panel against the tiny deployment panel to see where the most dramatic efficiency improvements have occurred and where progress has been more incremental.

::: {#fig-power-trends fig-env="figure" fig-pos="htb" fig-cap="**Energy Efficiency Gains**: Successive MLPerf inference benchmark versions show energy efficiency (samples per watt) improving up to 378$\times$ for datacenter workloads and 1070$\times$ for tinyML deployments across successive releases. Standardized measurement protocols enable meaningful cross-platform comparisons, driving sector-wide progress toward sustainable AI. Source: [@tschand2024mlperf]." fig-alt="Three line charts showing normalized energy efficiency across MLPerf versions: datacenter models up to 378$\times$ gain, edge models up to 4$\times$, and tiny models up to 1070$\times$ improvement."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
%\node[anchor=south west]at(4.4,-5.54){%
%\includegraphics[width=69.7mm,height=40.1mm]{1}};

\makeatletter
\newcommand*\short[1]{\expandafter\@gobbletwo\number\numexpr#1\relax}
\makeatother

\pgfplotsset{myaxis/.style={
  axis line style={draw=none},
  /pgf/number format/.cd,
  1000 sep={},
   legend style={at={(0.22,0.9)}, anchor=north},
   legend cell align=left,
   legend style={fill=BrownL!30,draw=BrownLine,row sep=-1.1pt,
   font=\fontsize{5pt}{5}\selectfont\usefont{T1}{phv}{m}{n}},
   width=85mm,
   height=50mm,
  % axis lines=left,
   axis line style={thick,-latex},
   tick label style={/pgf/number format/assume math mode=true},
   yticklabel style={xshift=1mm,font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},
   /pgf/number format/.cd, fixed, fixed zerofill, precision=0},
   xticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
   ylabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},align=center,yshift=-1.2mm},
   xlabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
   y tick style={draw=none},
   x tick style={draw=none,thin},
   tick align=outside,
   major tick length=1mm,
   title style={yshift=-4pt},
   grid=both,
   major grid style={black!60},
   log basis y=10,
   x tick label style={rotate=0, anchor=south,yshift=-3pt},
  % xlabel near ticks
    }}
%LEFT
 \begin{scope}[local bounding box=GR1,shift={(0,0)}]
%value top
\begin{axis}[myaxis,
  date coordinates in=x,
  xticklabel pos=right,
  xticklabel=\month/\short{\year},
  xtick={2021-04-01,2021-09-01,
  2022-04-01,2022-09-01,
  2023-04-01,2023-09-01,
  2024-03-01,2024-08-01},
  xmin=2021-03-01,
  xmax=2024-09-30,
  ymin=0.75, ymax=464,
  ymode=log,
  ytick={1,10,100},
  yticklabels={10\textsuperscript{0},10\textsuperscript{1},10\textsuperscript{2}},
  ylabel={Normalized Energy Efficiency\\ (Samples/Joule)},
]
\end{axis}
%value bototm
\begin{axis}[myaxis,
  date coordinates in=x,
  xticklabel=\month/\short{\year},
  xtick={2021-04-01,2021-09-01,
  2022-04-01,2022-09-01,
  2023-04-01,2023-09-01,
  2024-03-01,2024-08-01},
  xticklabels={v1.0,v1.1,v2.0,v2.1,v3.0,v3.1,v4.0,v4.1},
  xmin=2021-03-01,
  xmax=2024-09-30,
  ymin=0.75, ymax=464,
  ymode=log,
  ytick={1,10,100},
  yticklabels={,,},
  x tick label style={rotate=0, anchor=north,yshift=6pt},
  xlabel={MLPerf Inference Benchmark Version},
]
%green-ResNet
\addplot[green!70!black,mark=diamond*,
mark options={line width=1pt},
mark size=1.75pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-04-01
1.1, 2021-09-01
1.32, 2022-04-01
1.32, 2022-09-01
1.32, 2023-04-01
1.42, 2023-09-01
1.56, 2024-03-01
2.99, 2024-08-01
};
\addlegendentry{ResNet}
%red-BERT-99.0
\addplot[red!70!black,mark=square*,
mark options={line width=1pt},
mark size=1pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-04-01
1.07, 2021-09-01
1.12, 2022-04-01
1.20, 2022-09-01
1.72, 2023-04-01
1.72, 2023-09-01
1.69, 2024-03-01
1.73, 2024-08-01
};
\addlegendentry{BERT-99.0}
%red-RetinaNet
\addplot[blue!70!black,mark=*,
mark options={line width=1pt},
mark size=1.5pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-04-01
1.07, 2021-09-01
1.12, 2022-04-01
1.0, 2022-09-01
2.3, 2023-04-01
2.45, 2023-09-01
2.95, 2024-03-01
2.99, 2024-08-01
};
\addlegendentry{RetinaNet}
%violet-RNN-T
\addplot[violet,mark=square*,
mark options={line width=1pt},
mark size=1pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-04-01
1.07, 2021-09-01
1.12, 2022-04-01
1.16, 2022-09-01
1.25, 2023-04-01
1.27, 2023-09-01
1.29, 2024-03-01
};
\addlegendentry{RNN-T}
%orange-GPTJ-99.0
\addplot[orange,mark=triangle*,
mark options={line width=1pt},
mark size=1.7pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2023-09-01
1.56, 2024-03-01
112.73, 2024-08-01
};
\addlegendentry{GPTJ-99.0}
%purple-DLRM-v2-99.0
\addplot[purple,mark=+,
mark options={line width=1pt},
mark size=1.7pt,line width=0.5pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2023-09-01
1.46, 2024-03-01
1.58, 2024-08-01
};
\addlegendentry{DLRM-v2-99.0}
%gray-Llama2-70b-99.9
\addplot[BrownLine,mark=x,
mark options={line width=1pt},
mark size=2pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date

1, 2024-03-01
378, 2024-08-01
};
\addlegendentry{Llama2-70b-99.9}
\node[font=\fontsize{6pt}{6}\selectfont\usefont{T1}{phv}{m}{n},
anchor=south,fill=white,inner sep=1pt]at (axis description cs: 0.22,0.91) {Benchmark};
\end{axis}
\end{scope}
%%%%%%%%%%
%RIGHT GRAPH
%%%%%%%%%%
 \begin{scope}[local bounding box=GR2,shift={(9,0)}]
%value top
\begin{axis}[myaxis,
  date coordinates in=x,
  xticklabel pos=right,
  xticklabel=\month/\short{\year},
  xtick={2021-04-01,2021-09-01, 2022-04-01,2022-09-01,
  2023-04-01,2023-09-01, 2024-03-01},
  xmin=2021-03-01,
  xmax=2024-04-30,
  ymin=0.85, ymax=12,
  ymode=log,
  ytick={1,5,10},
  yticklabels={10\textsuperscript{0},$5\times10$\textsuperscript{0},10\textsuperscript{1}},
  ylabel={Normalized Energy Efficiency\\ (Samples/Joule)},
]
\end{axis}
%value bototm
\begin{axis}[myaxis,
  date coordinates in=x,
  xticklabel=\month/\short{\year},
  xtick={2021-04-01,2021-09-01,  2022-04-01,2022-09-01,
  2023-04-01,2023-09-01, 2024-03-01},
  xticklabels={v1.0,v1.1,v2.0,v2.1,v3.0,v3.1,v4.0},
  xmin=2021-03-01,
  xmax=2024-04-30,
  ymin=0.85, ymax=12,
  ymode=log,
  ytick={1,5,10},
  yticklabels={,,},
  x tick label style={rotate=0, anchor=north,yshift=6pt},
  xlabel={MLPerf Inference Benchmark Version},
   legend style={at={(0.18,0.9)}, anchor=north},
]
%green-ResNet
\addplot[green!70!black,mark=diamond*,
mark options={line width=1pt},
mark size=1.75pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-04-01
1.1, 2021-09-01
1.32, 2022-04-01
1.42, 2022-09-01
1.42, 2023-04-01
1.42, 2023-09-01
1.41, 2024-03-01
};
\addlegendentry{ResNet}
%red-RNN-T
\addplot[red!70!black,mark=square*,
mark options={line width=1pt},
mark size=1pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-04-01
2.41, 2021-09-01
2.41, 2022-04-01
2.41, 2022-09-01
2.41, 2023-04-01
2.88, 2023-09-01
2.88, 2024-03-01
};
\addlegendentry{RNN-T}
%red-RetinaNet
\addplot[blue!70!black,mark=*,
mark options={line width=1pt},
mark size=1.5pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1.0, 2022-09-01
3.4, 2023-04-01
3.55, 2023-09-01
3.55, 2024-03-01
};
\addlegendentry{RetinaNet}
%violet-BERT-99.0
\addplot[violet,mark=triangle*,
mark options={line width=1pt},
mark size=1.5pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-04-01
2.07, 2021-09-01
3.12, 2022-04-01
3.8, 2022-09-01
3.86, 2023-04-01
3.89, 2023-09-01
3.9, 2024-03-01
};
\addlegendentry{BERT-99.0}

\node[font=\fontsize{6pt}{6}\selectfont\usefont{T1}{phv}{m}{n},
anchor=south,fill=white,inner sep=1pt]at (axis description cs: 0.18,0.91) {Benchmark};
\end{axis}
\end{scope}
%%%%%%%%%%
%BOTTOM GRAPH
%%%%%%%%%%
 \begin{scope}[local bounding box=GR2,shift={(4.5,-5.25)}]
%value top
\begin{axis}[myaxis,
  date coordinates in=x,
  xticklabel pos=right,
  xticklabel=\month/\short{\year},
  xtick={2021-06-01, 2022-02-01,2022-11-01,2023-06-01,2024-04-01},
  xmin=2021-05-01,
  xmax=2024-04-30,
  ymin=0.6, ymax=1400,
  ymode=log,
  ytick={1,10,100,1000},
  yticklabels={10\textsuperscript{0},10\textsuperscript{1},10\textsuperscript{2},10\textsuperscript{3}},
  ylabel={Normalized Energy Efficiency\\ (Samples/Joule)},
]
\end{axis}
%value bototm
\begin{axis}[myaxis,
  date coordinates in=x,
  xticklabel=\month/\short{\year},
  xtick={2021-06-01, 2022-02-01,2022-11-01,2023-06-01,2024-04-01},
  xticklabels={v0.5,v0.7,v1.0,v1.1,v1.2},
  xmin=2021-05-01,
  xmax=2024-04-30,
  ymin=0.6, ymax=1400,
  ymode=log,
  ytick={1,10,100,1000},
  yticklabels={,,,},
  x tick label style={rotate=0, anchor=north,yshift=6pt},
  xlabel={MLPerf Tiny Benchmark Version},
  legend style={at={(0.82,0.48)}, anchor=north},
]
%green-DSCNN
\addplot[green!70!black,mark=diamond*,
mark options={line width=1pt},
mark size=1.75pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-06-01
200.1, 2022-02-01
392, 2022-11-01
391, 2023-06-01
391, 2024-04-01
};
\addlegendentry{DSCNN}
%red-AutoEncoder
\addplot[red!70!black,mark=*,
mark options={line width=1pt},
mark size=1.5pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-06-01
9.1, 2022-02-01
80, 2022-11-01
80, 2023-06-01
80, 2024-04-01
};
\addlegendentry{AutoEncoder}
%red- ResNet
\addplot[blue!70!black,mark=triangle*,
mark options={line width=1pt},
mark size=1.5pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-06-01
13.5, 2022-02-01
1070, 2022-11-01
1070, 2023-06-01
1070, 2024-04-01
};
\addlegendentry{ResNet}
%violet-MobileNet
\addplot[violet,mark=square*,
mark options={line width=1pt},
mark size=1pt,line width=1pt,
] table[x=Date, y=Y,  col sep=comma] {
Y,Date
1, 2021-06-01
14.5, 2022-02-01
600, 2022-11-01
600, 2023-06-01
600, 2024-04-01
};
\addlegendentry{MobileNet}

\node[font=\fontsize{6pt}{6}\selectfont\usefont{T1}{phv}{m}{n},
anchor=south,fill=white,inner sep=1pt]at (axis description cs: 0.82,0.49) {Benchmark};
\end{axis}
\end{scope}
\end{tikzpicture}
```
:::

Analysis of these MLPerf Power trends reveals two notable patterns. First, energy efficiency improvements for traditional ML workloads (ResNet, BERT, RNN-T) have plateaued after initial gains; the low-hanging fruit of optimization has been harvested. Second, generative AI applications show dramatic efficiency increases (378$\times$ for Llama2, 113$\times$ for GPTJ), reflecting rapid innovation as researchers optimize these newer, larger models. This dichotomy suggests that established workloads have reached optimization maturity while frontier models still offer substantial efficiency headroom, a pattern likely to repeat as each new model architecture matures.

Timing protocols and power instrumentation provide the raw data for benchmarking. Raw data alone, however, does not guarantee sound conclusions. Converting measurements into meaningful comparisons requires understanding the systematic sources of error, bias, and misalignment that can make even carefully collected benchmark numbers misleading.

## Benchmarking Best Practices {#sec-benchmarking-benchmarking-limitations-best-practices-9d65}

Training throughput, inference latency, and power efficiency each have established measurement protocols validated through MLPerf. Knowing *what* to measure, however, is insufficient without understanding what benchmarks *cannot* capture—and why this gap has derailed countless deployments.

Every benchmark makes simplifying assumptions that enable standardized comparison but diverge from production reality. Training benchmarks assume fixed datasets and reproducible random seeds; production data drifts continuously. Inference benchmarks assume steady-state operation; production traffic spikes unpredictably. Power benchmarks assume controlled thermal environments; real hardware throttles under sustained load. Four categories of limitations—statistical, deployment-related, system design, and organizational—determine whether benchmark results translate to deployment success.

### Statistical & Methodological Issues {#sec-benchmarking-statistical-methodological-issues-7aa5}

Benchmark results are only as reliable as the measurements that produce them. Three pervasive issues undermine this reliability if left unaddressed.

\index{Benchmark Coverage!incomplete problem representation}
Incomplete problem coverage represents one of the most pervasive limitations. Many benchmarks, while useful for controlled comparisons, fail to capture the full diversity of real-world applications. Common image classification datasets such as CIFAR-10 [@cifar10_website] contain a limited variety of images. Models that perform well on these datasets may struggle when applied to more complex, real-world scenarios with greater variability in lighting, perspective, and object composition. This gap between benchmark tasks and real-world complexity means strong benchmark performance provides limited guarantees about practical deployment success.

Statistical insignificance arises when benchmark evaluations are conducted on too few data samples or trials. For example, testing an optical character recognition (OCR) system on a small dataset may not accurately reflect its performance on large-scale, noisy text documents. Without sufficient trials and diverse input distributions, benchmarking results may be misleading or fail to capture true system reliability. The statistical confidence intervals around benchmark scores often go unreported, obscuring whether measured differences represent genuine improvements or measurement noise.

\index{Reproducibility!cross-platform challenge}
Reproducibility represents a major ongoing challenge. Benchmark results can vary measurably depending on factors such as hardware configurations, software versions, and system dependencies. Small differences in compilers, numerical precision, or library updates can lead to inconsistent performance measurements across different environments. To mitigate this issue, MLPerf addresses reproducibility by providing reference implementations, standardized test environments, and strict submission guidelines. Even with these efforts, achieving true consistency across diverse hardware platforms remains an ongoing challenge. The proliferation of optimization libraries, framework versions, and compiler flags creates a vast configuration space where slight variations produce different results.

### Laboratory-to-Deployment Performance Gaps {#sec-benchmarking-laboratorytodeployment-performance-gaps-16c8}

Statistical rigor ensures that benchmark measurements are accurate. Accurate measurements of the wrong thing, however, still lead to deployment failures. Benchmarks must also align with practical deployment objectives.

\index{Laboratory-to-Deployment Gap!performance misalignment}
Misalignment with real-world goals occurs when benchmarks emphasize metrics such as speed, accuracy, and throughput, while practical AI deployments require balancing multiple objectives including power efficiency, cost, and robustness. A model that achieves state-of-the-art accuracy on a benchmark may be impractical for deployment if it consumes excessive energy or requires expensive hardware. Similarly, optimizing for average-case performance on benchmark datasets may neglect tail-latency requirements that determine user experience in production systems. The multi-objective nature of real deployment, encompassing resource constraints, operational costs, maintenance complexity, and business requirements, extends far beyond the single-metric optimization that most benchmarks reward.

### System Design Challenges {#sec-benchmarking-system-design-challenges-9ed2}

Statistical methodology and deployment alignment address how we measure and what we optimize for. A third category of limitations emerges from the physical systems being measured. Hardware behavior depends on environmental conditions, architectural compatibility, and operational context in ways that complicate fair comparison.

Environmental conditions affect benchmarks in measurable ways. Benchmark results depend on physical conditions (ambient temperature, humidity, altitude) and operational context (background processes, network load, power supply stability) in subtle but measurable ways. Elevated temperatures trigger thermal throttling that reduces computational speed; background processes compete for resources and alter performance characteristics. Ensuring valid benchmarks requires controlling these factors to the extent possible — temperature-controlled environments, standardized system states, documented background loads — and, when full control is impractical (as in distributed or cloud-based benchmarking), detailed reporting of conditions so that others can account for potential variations when interpreting results.

\index{Hardware Lottery!architectural bias}
\index{Hooker, Sara!hardware lottery concept}
The hardware lottery[^fn-hardware-lottery] [@hooker2021hardware] presents another critical issue. The success of a machine learning model is often dictated not only by its architecture and training data but also by how well it aligns with the underlying hardware. Some models perform exceptionally well not because they are inherently superior but because they map naturally onto GPU or TPU parallel processing capabilities. Other promising architectures may be systematically overlooked because they do not fit dominant hardware platforms.

[^fn-hardware-lottery]: **Hardware Lottery**: Coined by Sara Hooker in 2021 to describe how algorithmic success depends on alignment with available hardware. The Transformer succeeded partly because its dense matrix multiplications map perfectly to GPU Tensor Cores, while graph neural networks and sparse mixture-of-experts models remain underexplored because they map poorly to current silicon. For benchmarking, this means hardware-specific leaderboards systematically favor hardware-aligned architectures, potentially obscuring algorithms that would dominate on different hardware. \index{Hardware Lottery!benchmark bias}

Hardware compatibility dependence introduces subtle but significant biases into benchmarking results. A model that is highly efficient on a specific GPU may perform poorly on a CPU or a custom AI accelerator. @fig-hw-lottery makes this hardware dependence concrete by comparing model performance across different platforms. Follow the arrow between the CPU and DSP plots: multi-hardware models show comparable results to "MobileNetV3 Large min" on both the CPU `uint8` and GPU configurations, but demonstrate significant accuracy improvements over the MobileNetV3 Large baseline when run on the EdgeTPU and DSP hardware. This reveals that "best" model depends entirely on deployment target—a conclusion impossible to reach from single-platform benchmarks.

::: {#fig-hw-lottery fig-env="figure" fig-pos="htb" fig-cap="**Hardware-Dependent Accuracy**: Model performance varies significantly across hardware platforms, indicating that architectural efficiency is not solely determined by design but also by hardware compatibility. Multi-hardware models exhibit comparable accuracy to MobileNetV3 Large on CPU and GPU configurations, yet achieve substantial gains on EdgeTPU and DSP, emphasizing the importance of hardware-aware model optimization for specialized computing environments. Source: [@chu2021discovering]." fig-alt="Five scatter plots comparing model accuracy versus latency across CPU, GPU, EdgeTPU, and DSP platforms, with arrow showing MobileNetV3 gaining on EdgeTPU and DSP versus CPU and GPU."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\pgfplotsset{myaxis/.style={
  /pgf/number format/.cd,
  1000 sep={},
   legend style={at={(1.85,0.97)}, anchor=north},
   legend cell align=left,
   legend style={fill=BrownL!30,draw=BrownLine,row sep=1.1pt,
   font=\fontsize{6pt}{6}\selectfont\usefont{T1}{phv}{m}{n}},
   width=58mm,
   height=50mm,
   axis line style={thick,-latex},
   tick label style={/pgf/number format/assume math mode=true},
   yticklabel style={xshift=0mm,font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},
   /pgf/number format/.cd, fixed, fixed zerofill, precision=2},
   xticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
   ylabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},align=center,yshift=-1mm},
   xlabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
   tick style={draw=black!60,thin},
   tick align=outside,
   tick pos=bottom,
   major tick length=1mm,
   title style={yshift=-4pt},
   grid=none,
   major grid style={black!60},
   x tick label style={rotate=0, anchor=north,yshift=2pt},
   ylabel={Top-1 ImageNet Acc},
   cycle list={
     {myblue,mark=*,mark size=1.5pt,line width=1pt},
     {myolive,mark=*,mark size=1.5pt,line width=1pt},
     {mygreen,mark=*,mark size=1.5pt,line width=1pt},
     {myred,mark=*,mark size=1.5pt,line width=1pt},
     {mypurple,mark=*,mark size=1.5pt,line width=1pt},
     {myorange,mark=*,mark size=1.5pt,line width=1pt},
     {black,mark=triangle*,mark size=2.5pt,line width=1pt},
     {mybrown,mark=triangle*,mark size=2.5pt,line width=1pt}
  }
    }}

%LEFT
 \begin{scope}[local bounding box=GR1,shift={(0,0)}]
\begin{axis}[myaxis,
  xmin=7,
  xmax=115,
  xtick={25,50,75,100},
  ymin=0.6912, ymax=0.783,
  ytick={0.70,0.72,...,0.78},
  xlabel={Pixel4 CPU Float latency},
]
%blue
\addplot+[] coordinates {(24.2,0.711)(37.1,0.736)(55,0.752)};
%olive
\addplot+[] coordinates {(18,0.703)(25.3,0.733)(39,0.753)};
%green
\addplot+[] coordinates {(14,0.731)(20.3,0.754)(30.5,0.766)};
%red
\addplot+[] coordinates {(12,0.695)(17.3,0.725)(27,0.748)};
%purple
\addplot+[] coordinates {(48,0.743)(60,0.762)(109.5,0.779)};
%orange
\addplot+[] coordinates {(20.5,0.7245)(28,0.75)(41.5,0.765)};
%black
\addplot+[] coordinates {(20.5,0.7345)(25,0.748)(35.5,0.758)};
%brown
\addplot+[] coordinates {(26,0.748)(31,0.759)(45.5,0.769)};
\coordinate(X)at(axis cs: 20.3,0.754);
\end{axis}
\end{scope}
%above center
\begin{scope}[local bounding box=GR2,shift={(5.7,0)}]
\begin{axis}[myaxis,
  xmin=5.4,
  xmax=34.4,
  xtick={10,20,30},
  ymin=0.691, ymax=0.782,
  ytick={0.70,0.72,...,0.78},
  xlabel={Pixel4 CPU Uint8 latency}
]
%blue
\addplot+[] coordinates {(9.0,0.711)(12.8,0.736)(18.2,0.751)};
%olive
\addplot+[] coordinates {(8.5,0.703)(11.5,0.733)(16.6,0.753)};
%green
\addplot+[] coordinates {(9.5,0.731)(13.3,0.753)(17.9,0.7655)};
%red
\addplot+[] coordinates {(6.8,0.695)(8.7,0.726)(12.6,0.749)};
%purple
\addplot+[] coordinates {(16.1,0.743)(19.5,0.762)(33.0,0.7785)};
%orange
\addplot+[] coordinates {(11.3,0.7245)(14.8,0.749)(20.2,0.765)};
%black
\addplot+[] coordinates {(10.2,0.7345)(11.5,0.749)(14.8,0.758)};
%brown
\addplot+[] coordinates {(12.3,0.748)(13.9,0.758)(18.4,0.769)};
\end{axis}
\end{scope}

%above right
\begin{scope}[local bounding box=GR3,shift={(11.4,0)}]
\begin{axis}[myaxis,
  xticklabel style={xshift=0mm,font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},
  /pgf/number format/.cd, fixed, fixed zerofill, precision=1},
  xmin=2.2,
  xmax=12.9,
  xtick={2.5,5.0,7.5,10.0,12.5},
  ymin=0.691, ymax=0.782,
  ytick={0.70,0.72,...,0.78},
  xlabel={Pixel4 GPU Adreno 640 latency}
]
%blue
\addplot+[] coordinates {(3.7,0.711)(4.8,0.7355)(7.1,0.751)};
%olive
\addplot+[] coordinates {(3.4,0.703)(4.4,0.7323)(5.7,0.7525)};
%green
\addplot+[] coordinates {(4.75,0.731)(5.62,0.753)(7.29,0.7655)};
%red
\addplot+[] coordinates {(2.7,0.695)(3.37,0.725)(4.6,0.748)};
%purple
\addplot+[] coordinates {(6.1,0.7427)(7.5,0.7615)(12.4,0.7781)};
%orange
\addplot+[] coordinates {(4.86,0.7245)(5.9,0.749)(7.82,0.764)};
%black
\addplot+[] coordinates {(3.92,0.7345)(4.4,0.748)(5.58,0.758)};
%brown
\addplot+[] coordinates {(4.73,0.747)(5.39,0.758)(6.64,0.768)};
\end{axis}
\end{scope}
%below left
\begin{scope}[local bounding box=GR4,shift={(0,-5)}]
\begin{axis}[myaxis,
  xticklabel style={xshift=0mm,font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},
  /pgf/number format/.cd, fixed, fixed zerofill, precision=1},
  xmin=1.85,
  xmax=3.59,
  xtick={2.0,2.5,3.0,3.5},
  ymin=0.691, ymax=0.782,
  ytick={0.70,0.72,...,0.78},
  xlabel={Pixel4 EdgeTPU latency}
]
%blue
\addplot+[] coordinates {(1.92,0.711)(2.38,0.7359)(2.845,0.7514)};
%olive
\addplot+[] coordinates {(2.03,0.703)(2.3,0.7325)(2.93,0.7525)};
%green
\addplot+[] coordinates {(1.942,0.6947)(2.105,0.7253)(2.58,0.749)};
%red
\addplot+[] coordinates {(1.942,0.6947)(2.105,0.7253)(2.58,0.749)};
%purple
\addplot+[] coordinates {(2.34,0.7425)(2.67,0.7615)(3.495,0.77844)};
%orange
\addplot+[] coordinates {(2.6,0.7245)(3.09,0.7484)(3.42,0.764)};
%black
\addplot+[] coordinates {(2.08,0.734)(2.21,0.748)(2.44,0.7577)};
%brown
\addplot+[] coordinates {(2.315,0.747)(2.4,0.7575)(2.9,0.7676)};
\end{axis}
\end{scope}
%below right
\begin{scope}[local bounding box=GR5,shift={(5.7,-5)}]
\begin{axis}[myaxis,
  xmin=2.35,
  xmax=6.35,
  xtick={3,4,5,6},
  ymin=0.691, ymax=0.782,
  ytick={0.70,0.72,...,0.78},
  xlabel={Pixel4 DSP Qualcomm Snapdragon 855 latency}
]
%blue
\addplot+[] coordinates {(2.52,0.711)(3.05,0.736)(3.72,0.751)};
\addlegendentry{Mobilenet V1}
%olive
\addplot+[] coordinates {(3.3,0.703)(3.84,0.733)(4.97,0.7525)};
\addlegendentry{Mobilenet V2}
%green
\addplot+[] coordinates {(3.95,0.731)(4.5,0.753)(5.15,0.7652)};
\addlegendentry{Mobilenet V3 Large}
%red
\addplot+[] coordinates {(2.92,0.6945)(3.29,0.725)(3.81,0.7488)};
\addlegendentry{Mobilenet V3 Large min}
%purple
\addplot+[] coordinates {(3.82,0.7425)(4.29,0.7615)(6.14,0.7781)};
\addlegendentry{Mobilenet-EdgeTPU}
%orange
\addplot+[] coordinates {(3.54,0.7245)(3.885,0.7485)(5.06,0.764)};
\addlegendentry{ProxylessNAS-Mobile}
%black
\addplot+[] coordinates {(3.08,0.7341)(3.377,0.748)(4.05,0.762)};
\addlegendentry{Multi-MAX}
%brown
\addplot+[] coordinates {(3.6,0.747)(3.84,0.759)(4.52,0.7675)};
\addlegendentry{Multi-AVG}
\coordinate(Y)at(axis cs: 4.5,0.753);
\end{axis}
\end{scope}
\draw[VioletLine!60,-{Triangle[width=8pt,length=13pt]}, line width=3pt,
shorten <=2pt](X)--(Y);
\end{tikzpicture}
```
:::

Without careful benchmarking across diverse hardware configurations, the field risks favoring architectures that "win" the hardware lottery rather than selecting models based on their intrinsic strengths. This bias can shape research directions, influence funding allocation, and impact the design of next-generation AI systems. In extreme cases, it may even stifle innovation by discouraging exploration of alternative architectures that do not align with current hardware trends.

### Organizational & Strategic Issues {#sec-benchmarking-organizational-strategic-issues-d25a}

The preceding limitations arise from technical challenges: statistical noise, deployment misalignment, environmental variance, and hardware compatibility. A fourth category emerges from human factors—and these may be the hardest to mitigate because they involve incentives rather than instrumentation. Competitive pressures and research incentives create systematic biases in how benchmarks are used and interpreted. These organizational dynamics require governance mechanisms and community standards to maintain benchmark integrity.

#### Benchmark Engineering {#sec-benchmarking-benchmark-engineering-3af4}

\index{Benchmark Engineering!intentional optimization bias}
While the hardware lottery is an unintended consequence of hardware trends, benchmark engineering is an intentional practice where models or systems are explicitly optimized to excel on specific benchmark tests. This practice can lead to misleading performance claims and results that do not generalize beyond the benchmarking environment.

Benchmark engineering occurs when AI developers fine-tune hyperparameters, preprocessing techniques, or model architectures specifically to maximize benchmark scores rather than improve real-world performance. The distinction between legitimate optimization and benchmark engineering is often blurry: when does "tuning for ImageNet" become "overfitting to ImageNet"? For example, an object detection model might be carefully optimized to achieve record-low latency on a benchmark but fail when deployed in dynamic, real-world environments with varying lighting, motion blur, and occlusions. Similarly, a language model might be tuned to excel on benchmark datasets but struggle when processing conversational speech with informal phrasing and code-switching.

The pressure to achieve high benchmark scores is often driven by competition, marketing, and research recognition. Benchmarks are frequently used to rank AI models and systems, creating an incentive to optimize specifically for them. While this can drive technical advancements, it also risks prioritizing benchmark-specific optimizations at the expense of broader generalization—precisely the Goodhart's Law dynamic introduced in @sec-benchmarking-machine-learning-benchmarking-framework-70b8 and illustrated with the BLEU-score example in @sec-benchmarking-ml-measurement-challenges-60ea.

#### Bias and Over-Optimization {#sec-benchmarking-bias-overoptimization-6c65}

\index{Benchmark Overfitting!over-optimization to test sets}
Several strategies can ensure that benchmarks remain useful and fair. Transparency is paramount: benchmark submissions should include detailed documentation on any optimizations applied, ensuring that improvements are clearly distinguished from benchmark-specific tuning. Researchers and developers should report both benchmark performance and real-world deployment results to provide a complete picture of a system's capabilities. Diversifying evaluation methodologies provides additional protection. Instead of relying on a single static benchmark, AI systems should be evaluated across multiple, continuously updated benchmarks that reflect real-world complexity. This reduces the risk of models being overfitted to a single test set and encourages general-purpose improvements rather than narrow optimizations.

Standardization and third-party verification can also help mitigate bias. By establishing industry-wide benchmarking standards and requiring independent third-party audits of results, the AI community can improve the reliability and credibility of benchmarking outcomes. Third-party verification ensures that reported results are reproducible across different settings and helps prevent unintentional benchmark gaming. Complementing controlled evaluations, application-specific testing remains essential: AI models should be assessed not only on benchmark datasets but also in practical deployment environments. An autonomous driving model, for instance, should be tested in a variety of weather conditions and urban settings rather than being judged solely on controlled benchmark datasets. Finally, benchmarks should test AI models on multiple hardware configurations to ensure that performance is not being driven solely by compatibility with a specific platform, reducing the risk of the hardware lottery.

#### Benchmark Evolution {#sec-benchmarking-benchmark-evolution-b523}

A persistent challenge in benchmarking is that benchmarks are rarely static. As AI systems evolve, so must the benchmarks that evaluate them. What defines "good performance" today may be less relevant tomorrow as models, hardware, and application requirements change. While benchmarks are essential for tracking progress, they can also become outdated, leading to over-optimization for old metrics rather than real-world performance improvements.

\index{SuperGLUE!successor benchmark}
\index{HELM!holistic LLM evaluation}
This evolution is evident in the history of AI benchmarks. Early model benchmarks, for instance, focused heavily on image classification and object detection, as these were some of the first widely studied deep learning tasks. However, as AI expanded into natural language processing, recommendation systems, and generative AI, it became clear that these early benchmarks no longer reflected the most important challenges in the field. In response, new benchmarks emerged to measure language understanding [@wang2018glue; @wang2019superglue] and generative AI [@liang2022helm].

Benchmark evolution extends beyond the addition of new tasks to encompass new dimensions of performance measurement. While traditional AI benchmarks emphasized accuracy and throughput, modern applications demand evaluation across multiple criteria: fairness, robustness, scalability, and energy efficiency. @fig-sciml-graph makes these disparate requirements concrete by mapping scientific applications across data rate and computation time. The visualization reveals a striking pattern: Large Hadron Collider sensors must process data at rates approaching 10$^{14}$ bytes per second with nanosecond-scale computation times, while mobile applications operate at 10$^{4}$ bytes per second with longer computational windows—a span of ten orders of magnitude on each axis. This range of requirements necessitates specialized benchmarks. For example, edge AI applications require benchmarks like MLPerf that specifically evaluate performance under resource constraints and scientific application domains need their own "Fast ML for Science" benchmarks [@duarte2022fastml].

::: {#fig-sciml-graph fig-env="figure" fig-pos="htb" fig-cap="**Performance Spectrum**: Scientific applications and edge devices demand vastly different computational resources, spanning multiple orders of magnitude in data rates and latency requirements. Consequently, traditional benchmarks focused solely on accuracy are insufficient; specialized evaluation metrics and benchmarks like MLPerf become essential for optimizing AI systems across diverse deployment scenarios. Source: [@duarte2022fastml]." fig-alt="Log-scale scatter plot of data rate versus computation time, showing scientific applications from LHC sensors at 10^14 B/s and nanoseconds to mobile devices at 10^4 B/s and seconds."}
```{.tikz}
\scalebox{0.9}{%
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\pgfplotsset{
  errorplot/.style n args={1}{
    scatter,
    line width=0.75pt,
    only marks,
    mark=none,
    error bars/.cd,
      x dir=both, x explicit,
      y dir=both, y explicit relative,
    error bar style={#1, line width=0.75pt, solid},
    error mark options={line width=0.75pt, mark size=3pt, rotate=90}
  }
}
\begin{axis}[
  ymin=2, ymax=14.3,
  ytick={2,4,6,8,10,12,14},
  yticklabels={10\textsuperscript{2},10\textsuperscript{4},10\textsuperscript{6},
  10\textsuperscript{8},10\textsuperscript{10},10\textsuperscript{12},10\textsuperscript{14}},
  xmin=2, xmax=9.0,
  xtick={2,3,4,5,6,7,8,9},
  xticklabels={10\textsuperscript{-9},10\textsuperscript{-7},10\textsuperscript{-5},
  10\textsuperscript{-3},10\textsuperscript{-1},10\textsuperscript{1},10\textsuperscript{3},10\textsuperscript{5}},
   xlabel={Computation time [s]},
   ylabel={Data rate [B/s]},
   width=120mm, height=120mm,
   legend style={at={(0.7,0.3)},anchor=south west},
   grid=both
]
%LHC sensor
\addplot+[errorplot={RedLine}]
  coordinates {(2.6,13.72) +- (0.1,0.022)
}node[RedLine,pos=1,right=8pt,anchor=west]{LHC sensor};
%X-ray diffraction
\addplot+[errorplot={VioletLine}]
  coordinates {(3.82,7.52) +- (0.33,0.069)
 } node[VioletLine,pos=1,right=22pt,anchor=west]{X-ray diffraction};
%Internet-of-things
\addplot+[errorplot={OliveLine}]
   coordinates {(5.49,5.50) +- (0.5,0.18)
} node[OliveLine,pos=1,right=22pt,anchor=west]{Internet-of-things};
%Mobile devices
\addplot+[errorplot={cyan!90!black}]
   coordinates {( 5.87,4.23) +- (0.1,0.044)
}node[cyan!90!black,pos=1,right=7pt,anchor=west]{Mobile devices};
%Plasma control
\addplot+[errorplot={green!70!black}]
  coordinates { (4.0,9.51) +- (0.15,0.) [meta=a]
}node[green!70!black,pos=1,above right=12pt,anchor=north west]{Plasma control};
%LHC trigger
\addplot+[errorplot={BrownLine}]
  coordinates { (3.42,9.11) +- (0.42,0.) }
node[BrownLine,pos=1, right=17pt,anchor= west]{LHC trigger};
%Beam control
\addplot+[errorplot={OrangeLine}]
  coordinates { (4.92,4.69) +- (0.42,0.) [meta=c]
}node[OrangeLine,pos=0.1,below=2pt,anchor=north east]{Beam control};
%
\addplot+[line width=1.15pt,
  scatter,
  only marks,mark size=3.25pt,
  scatter src=explicit symbolic,
  scatter/classes={
   a={mark=+,blue}, b={mark=+,red}, c={mark=+,purple},
   d={mark=+,orange!70!black},e={mark=+,violet!60!black},f={mark=+,GreenD}
  }
]
table[meta=label, row sep=crcr]{
  x    y    label \\
  3.49 8.90 a     \\
  3.34 9.89 b     \\
  2.99 9.97 c     \\
  5.00 6.72 d     \\
  4.330 8.73 f\\
  4.50 6.50 e\\
};
\node[blue,below left=2pt and -11pt]at(axis cs:3.49,8.93){DUNE readout};
\node[red,below left=0.5pt and 0pt]at(axis cs:3.34,9.89){EIC trigger};
\node[purple,above right=2.5pt and -19pt]at(axis cs:2.99,9.97){Qubit Readout};
\node[orange!70!black,above right=1.5pt and 1pt]at(axis cs:5.00,6.72){Neuro};
\node[violet!60!black,below left=0.5pt and 0pt]at(axis cs:4.50,6.50 ){Magnet quench};
\node[GreenD,below right=0.5pt and 0pt]at(axis cs:4.330,8.78){Electron microscopy};
%
\coordinate(A)at(axis cs:2,14.3);
\coordinate(B)at(axis cs:5.33,14.3);
\coordinate(C)at(axis cs:5.33,4.69);
\coordinate(D)at(axis cs:2,4.69);
\scoped[on background layer]
\filldraw[cyan!5](A)--(B)--(C)--(D)--cycle;
\node[align=center]at(axis description cs: 0.8,0.92){\textbf{Fast ML for Science}\\benchmark tasks};
\end{axis}
\end{tikzpicture}}
```
:::

The need for evolving benchmarks also presents a challenge: stability versus adaptability. On the one hand, benchmarks must remain stable for long enough to allow meaningful comparisons over time. If benchmarks change too frequently, it becomes difficult to track long-term progress and compare new results with historical performance. On the other hand, failing to update benchmarks leads to stagnation, where models are optimized for outdated tasks rather than advancing the field. Striking the right balance between benchmark longevity and adaptation is an ongoing challenge for the AI community.

Evolving benchmarks remains essential for meaningful progress measurement. Without updates, benchmarks become detached from real-world needs, and researchers optimize for artificial test cases rather than practical challenges. The transition from ImageNet-era accuracy benchmarks to multi-dimensional evaluations spanning fairness, robustness, and energy efficiency illustrates this evolution in practice.

### MLPerf as Industry Standard {#sec-benchmarking-mlperf-industry-standard-05a1}

MLPerf synthesizes the principles discussed throughout this chapter into a single evolving framework: reference implementations and strict submission rules enforce reproducibility, deployment-specific suites (Inference, Mobile, Client, Tiny) align with the three-dimensional evaluation framework, and regular task updates (including generative AI and energy-efficient computing) prevent benchmark stagnation. This combination of standardization, deployment breadth, and adaptability has made MLPerf the authoritative benchmark for guiding both AI research and production hardware decisions.

### Benchmark Gaming: The Cat and Mouse Game {#sec-benchmarking-gaming}

In the Hennessy & Patterson tradition of quantitative systems, we must acknowledge that benchmarks are not just measurements; they are targets. As Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." In the high-stakes world of AI hardware, this manifests as **Benchmark Gaming**\index{Benchmark Gaming!distorting performance metrics}: the practice of optimizing hardware or compilers specifically for the benchmark's unique characteristics, rather than for real-world performance.

Common "cheating" techniques in ML benchmarking include:

*   **Precision Dropping**: Compilers may silently reduce precision (e.g., from FP32 to BF16) only during the benchmark run to inflate throughput, even if the user did not request it.
*   **Operator Removal**: A compiler might identify that a benchmark only cares about top-1 accuracy and "optimize out" the activation functions or layer norms if they do not affect that specific metric, yielding unrealistic speedups.
*   **Weight Pre-Loading**: Hardcoding the benchmark model's weights into the chip's on-chip SRAM, bypassing the "Memory Wall" bottlenecks that real production models must face.

MLPerf prevents this gaming through its **Reference vs. Submission** validation. Every submitter must run the *exact same* model structure and reach a *verifiable accuracy target* (e.g., 75.9% on ImageNet) to qualify. If a compiler "cheats" by dropping precision or removing operators, the accuracy check fails, and the result is disqualified. This "Accuracy Guardrail" transforms a simple speed test into a rigorous engineering benchmark, forcing vendors to optimize for the **Silicon Contract** rather than gaming the numbers.

Yet even the most rigorous system benchmarks validate only one dimension of deployment readiness. A system achieving record throughput and efficiency on MLPerf says nothing about whether the model it runs is accurate on real-world inputs, or whether the data it was trained on represents the population it will serve. Hardware that delivers promised TFLOPS is necessary but insufficient; the model running on that hardware must preserve the quality users depend on, and the data that shaped that model must represent the world it will encounter. Completing the validation stack requires turning from hardware to the model and data dimensions of our three-dimensional framework.

## Model and Data Evaluation {#sec-benchmarking-model-data-benchmarking-e0ca}

System benchmarks can confirm that hardware delivers promised training throughput, inference latency, and power efficiency. Hardware validation alone, however, cannot ensure deployment success. The optimization pipeline from Part III also included model compression (@sec-model-compression) and data selection (@sec-data-selection), each requiring its own validation — a compressed model running on accelerated hardware trained on biased data will fail despite excellent system benchmarks. The remaining two dimensions of the framework address this gap: model benchmarks verify that compression preserved accuracy and critical model properties, while data benchmarks verify that training data enables robust generalization.

### Model Benchmarking {#sec-benchmarking-model-benchmarking-4847}

\index{Model Benchmarking!multi-dimensional evaluation}
Model benchmarks validate whether compression techniques from @sec-model-compression preserved the properties that matter for deployment. This extends beyond top-line accuracy. A pruned model might maintain ImageNet accuracy while losing robustness to adversarial inputs. A quantized model might preserve average-case performance while degrading on rare but critical edge cases. A distilled model might match the teacher's accuracy while losing calibration. Historically, benchmarks focused almost exclusively on accuracy, but compression makes multi-dimensional evaluation essential.

\index{ImageNet!ILSVRC challenge progression}
Recall from @fig-imagenet-gpus that ImageNet error rates plummeted alongside GPU adoption. @fig-imagenet-challenge adds the architectural milestones to that same progression, tracing error reduction from 28.2% in 2010 to 3.57% by 2015 on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [@russakovsky2015imagenet]. The introduction of AlexNet[^fn-bench-alexnet] in 2012 reduced the error rate from 25.8% to 16.4%. Subsequent models like ZFNet, VGGNet, GoogleNet, and ResNet[^fn-bench-resnet] continued this trend, with ResNet achieving 3.57% by 2015 [@russakovsky2015imagenet]. This progression established the baselines against which model compression techniques are evaluated—a pruned ResNet must demonstrate how much accuracy it sacrifices for a given efficiency gain.

[^fn-bench-alexnet]: **AlexNet**: The 8-layer CNN (60M parameters) that cut ImageNet top-5 error from 25.8% to 16.4% in 2012, trained on two GTX 580 GPUs with 3 GB memory each. AlexNet established the benchmarking paradigm that persists today: accuracy on a fixed dataset as the primary metric, with hardware configuration as a secondary specification. Every subsequent ImageNet result is implicitly compared against this baseline. \index{AlexNet!benchmarking paradigm}

[^fn-bench-resnet]: **ResNet**: Introduced by Kaiming He et al. in 2015, skip connections enabled 152+ layer networks and achieved 3.57% top-5 ImageNet error (ensemble), surpassing the estimated 5.1% human error rate. ResNet-50 became the de facto MLPerf Training reference model because its moderate size (25.6M parameters) and well-understood compute profile (3.8 GFLOPS per image) make it sensitive to both hardware and software optimizations without requiring multi-node setups. \index{ResNet!MLPerf reference model}

::: {#fig-imagenet-challenge fig-env="figure" fig-pos="htb" fig-cap="**ImageNet Challenge Progression**: Neural networks have reduced error rates from 28.2% in 2010 to 3.57% by 2015, highlighting the impact of architectural advancements on classification accuracy. These milestones establish the baselines against which compression techniques are evaluated." fig-alt="Line graph showing ImageNet top-5 error decreasing from 28.2% in 2010 to 3.57% in 2015, with model labels marking AlexNet, ZFNet, VGGNet, GoogleNet, and ResNet milestones."}
```{python}
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ IMAGENET CHALLENGE PROGRESSION FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-imagenet-challenge — line plot of ImageNet top-5 error rate
# │          from 2010 (28.2%) to 2015 (3.57%) with model labels
# │
# │ Goal: Visualize the historical progression of ImageNet benchmarks.
# │ Show: The transition from accuracy-only to multi-dimensional evaluation.
# │ How: Plot top-5 error rates for architectural milestones (AlexNet to ResNet).
# │
# │ Imports: mlsys.viz (viz)
# │ Exports: (figure only, no prose variables)
# └─────────────────────────────────────────────────────────────────────────────
from mlsys import viz

fig, ax, COLORS, plt = viz.setup_plot()
years = [2010, 2011, 2012, 2013, 2014, 2014, 2015]
models = ["Baseline", "Baseline", "AlexNet", "ZFNet", "VGGNet", "GoogleNet", "ResNet"]
errors = [28.2, 25.8, 16.4, 11.7, 7.3, 6.7, 3.57]

ax.plot(years, errors, color=COLORS['BlueLine'], linewidth=1.5, zorder=1)
ax.scatter(years, errors, color=COLORS['RedLine'], s=50, zorder=2, edgecolors='white')

offsets = [(5, 8), (5, 8), (5, 8), (5, 8), (-40, 8), (5, -15), (0, 8)]
for year, model, error, (ox, oy) in zip(years, models, errors, offsets):
    ax.annotate(model, (year, error), textcoords='offset points',
                xytext=(ox, oy), fontsize=9, ha='left' if ox >= 0 else 'right', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))

ax.set_ylim(0, 30)
ax.set_xlabel('Year')
ax.set_ylabel('Top-5 Error (%)')
plt.show()
```
:::

#### Accuracy Metrics and Their Blind Spots {#sec-benchmarking-accuracy-metrics-blind-spots-4dcd}

The most common model metrics—accuracy, precision, recall, F1—each reveal different aspects of model behavior while hiding others, and understanding their blind spots is essential for compression validation.

Top-k accuracy measures whether the correct label appears in the model's top k predictions. Top-1 accuracy is strict; top-5 is lenient. The gap between them reveals model uncertainty: a model with 75% top-1 but 95% top-5 accuracy "knows" the answer is among a few candidates but struggles to commit. For deployment, the acceptable gap depends on whether downstream systems can use ranked predictions or require single answers.

Precision and recall matter when classes are imbalanced or errors have asymmetric costs. A fraud detection model with 99% accuracy might have 10% recall on actual fraud (catching only 1 in 10 fraudulent transactions), a catastrophic failure despite high accuracy. Precision (of predicted positives, how many are correct?) and recall (of actual positives, how many were found?) expose these failures that accuracy hides.

\index{Gender Shades!demographic fairness study}
\index{Buolamwini, Joy!facial recognition bias research}
Perhaps most insidiously, aggregate metrics hide subgroup failures. A model achieving 95% overall accuracy might achieve 60% on a critical demographic subgroup. The Gender Shades project [@buolamwini2018gender] revealed commercial facial recognition systems performing significantly worse on darker-skinned individuals, a disparity invisible to aggregate benchmarks. Disaggregated evaluation across deployment-relevant subgroups is essential; @sec-responsible-engineering examines fairness evaluation systematically.

#### Calibration: When Confidence Scores Matter {#sec-benchmarking-calibration-confidence-scores-matter-3669}

For many deployment scenarios, *how confident* the model is matters as much as *what* it predicts. A well-calibrated[^fn-calibration] model's confidence scores correspond to actual correctness probability: when it says "90% confident," it should be correct 90% of the time.

\index{Calibration!definition and etymology}

[^fn-calibration]: **Calibration**: From Arabic *qalib* (a mold for casting metal) via Latin *calibrare*, originally describing the adjustment of measuring instruments against known standards. In ML, calibration ensures predicted probabilities match empirical frequencies. The etymology is apt: just as an uncalibrated instrument produces precise but inaccurate measurements, an uncalibrated model produces confident but unreliable predictions, causing downstream systems that threshold on confidence scores to make systematically wrong decisions. \index{Calibration!etymology}

Compression frequently degrades calibration even when preserving accuracy—a critical concern when validating the model compression techniques from @sec-model-compression. A quantized model might maintain 94% accuracy while becoming overconfident, predicting 90%+ confidence on examples it gets wrong. This occurs because quantization affects the softmax distribution's shape, compressing probability mass toward the top prediction. Post-hoc calibration techniques (temperature scaling, Platt scaling) can partially correct this, but only if calibration is measured.

\index{Expected Calibration Error!confidence-accuracy gap}
Calibration failures create downstream problems. An overconfident model triggers unnecessary human review (predicted 95% confidence but wrong 30% of the time). An underconfident model fails to automate decisions it could handle (predicted 70% confidence but correct 95% of the time). Expected Calibration Error (ECE) measures the gap between confidence and accuracy across confidence bins; reliability diagrams visualize this correspondence.

#### Compression Validation: The Efficiency-Quality Frontier {#sec-benchmarking-compression-validation-efficiencyquality-frontier-e9c4}

Model compression (@sec-model-compression) trades model capacity for efficiency. Validation must determine whether compression achieved an acceptable trade-off or damaged capabilities that matter.

\index{Pareto Frontier!accuracy-efficiency tradeoff}
Pareto frontier[^fn-pareto] evaluation determines whether a compressed model represents a good trade-off. Plotting accuracy against the target efficiency metric (latency, model size, energy) reveals the trade-off frontier. Models on the Pareto frontier cannot improve one metric without degrading the other; models below the frontier are dominated by better alternatives.

[^fn-pareto]: **Pareto Frontier**: Named after economist Vilfredo Pareto (1896), the frontier contains all solutions where improving one objective requires degrading another. In compression benchmarking, the frontier's shape carries diagnostic information: a steep region means efficiency gains come cheaply (prune here), while a flat region means further compression costs disproportionate accuracy (stop here). Points below the frontier are strictly dominated and represent wasted capacity. \index{Pareto Frontier!compression trade-off}

Different compression techniques fail in different ways. Quantization (reducing numerical precision) typically preserves average-case performance while degrading on inputs near decision boundaries—exactly the edge cases that often matter most. Pruning (removing weights or structures) loses capacity for rare features—potentially fine for common cases but catastrophic for tail scenarios. Distillation (training smaller models to mimic larger ones) can lose calibration even when matching accuracy. Validation must probe these specific failure modes, not just measure aggregate accuracy.

Acceptable degradation depends on deployment context. A 2% accuracy drop might be acceptable for a recommendation system (users tolerate imperfect suggestions) but unacceptable for medical diagnosis (each error has significant consequences). Define accuracy thresholds before compression, then validate against them. Our *MobileNet INT8 compression* lighthouse illustrates the complete validation protocol.

```{python}
#| label: mobilenet-int8-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ MOBILENET INT8 COMPRESSION METRICS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "MobileNet INT8 Compression" — complete validation
# │          protocol for INT8 quantization of MobileNetV2
# │
# │ Goal: Demonstrate the necessity of multi-dimensional compression benchmarks.
# │ Show: That INT8 quantization preserves accuracy while yielding 4× size reduction.
# │ How: Compare accuracy, ECE, and model size for FP32 vs. INT8 MobileNetV2.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: mv2_acc_fp32_str, mv2_params_m_str, mv2_size_fp32_mb_str,
# │          mv2_size_int8_mb_str, mv2_acc_int8_str, mv2_acc_drop_str,
# │          mv2_top5_fp32_str, mv2_top5_int8_str, mv2_ece_fp32_str,
# │          mv2_ece_int8_str, mv2_edge_fp32_str, mv2_edge_int8_str,
# │          mv2_edge_drop_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt_percent, fmt, check

class MobileNetINT8Calc:
    """MobileNetV2 FP32 vs INT8: aggregate accuracy holds but calibration and edge cases degrade."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    acc_fp32 = 71.8
    params_m = 3.5
    size_fp32_mb = 14.0
    size_int8_mb = 3.5                                                          # 14 / 4
    acc_int8 = 70.9
    top5_fp32 = 91.0
    top5_int8 = 90.4
    ece_fp32 = 0.031
    ece_int8 = 0.089
    edge_fp32 = 68.2
    edge_int8 = 61.4
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    acc_drop = acc_fp32 - acc_int8
    edge_drop = edge_fp32 - edge_int8
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    mv2_acc_fp32_str = fmt(acc_fp32, precision=1, commas=False)
    mv2_params_m_str = fmt(params_m, precision=1, commas=False)
    mv2_size_fp32_mb_str = fmt(size_fp32_mb, precision=1, commas=False)
    mv2_size_int8_mb_str = fmt(size_int8_mb, precision=1, commas=False)
    mv2_acc_int8_str = fmt(acc_int8, precision=1, commas=False)
    mv2_acc_drop_str = fmt(acc_drop, precision=1, commas=False)                # 0.9
    mv2_top5_fp32_str = fmt(top5_fp32, precision=1, commas=False)
    mv2_top5_int8_str = fmt(top5_int8, precision=1, commas=False)
    mv2_ece_fp32_str = fmt(ece_fp32, precision=3, commas=False)
    mv2_ece_int8_str = fmt(ece_int8, precision=3, commas=False)
    mv2_edge_fp32_str = fmt(edge_fp32, precision=1, commas=False)
    mv2_edge_int8_str = fmt(edge_int8, precision=1, commas=False)
    mv2_edge_drop_str = fmt(edge_drop, precision=0, commas=False)              # ~7%

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
mv2_acc_fp32_str = MobileNetINT8Calc.mv2_acc_fp32_str
mv2_params_m_str = MobileNetINT8Calc.mv2_params_m_str
mv2_size_fp32_mb_str = MobileNetINT8Calc.mv2_size_fp32_mb_str
mv2_size_int8_mb_str = MobileNetINT8Calc.mv2_size_int8_mb_str
mv2_acc_int8_str = MobileNetINT8Calc.mv2_acc_int8_str
mv2_acc_drop_str = MobileNetINT8Calc.mv2_acc_drop_str
mv2_top5_fp32_str = MobileNetINT8Calc.mv2_top5_fp32_str
mv2_top5_int8_str = MobileNetINT8Calc.mv2_top5_int8_str
mv2_ece_fp32_str = MobileNetINT8Calc.mv2_ece_fp32_str
mv2_ece_int8_str = MobileNetINT8Calc.mv2_ece_int8_str
mv2_edge_fp32_str = MobileNetINT8Calc.mv2_edge_fp32_str
mv2_edge_int8_str = MobileNetINT8Calc.mv2_edge_int8_str
mv2_edge_drop_str = MobileNetINT8Calc.mv2_edge_drop_str
```

::: {.callout-lighthouse title="MobileNet INT8 Compression"}
Returning to our MobileNet lighthouse example, consider the complete validation protocol for INT8 quantization:

**Pre-compression baseline**: MobileNetV2 achieves `{python} mv2_acc_fp32_str`% top-1 accuracy on ImageNet at `{python} mv2_params_m_str`M parameters (`{python} mv2_size_fp32_mb_str` MB FP32).

**Post-compression metrics** (INT8 quantization to `{python} mv2_size_int8_mb_str` MB):

| **Metric**             |                      **FP32** |                      **INT8** | **Acceptable?**                        |
|:-----------------------|------------------------------:|------------------------------:|:---------------------------------------|
| **Top-1 accuracy**     |  `{python} mv2_acc_fp32_str`% |  `{python} mv2_acc_int8_str`% | ✓ (<`{python} mv2_acc_drop_str`% drop) |
| **Top-5 accuracy**     | `{python} mv2_top5_fp32_str`% | `{python} mv2_top5_int8_str`% | ✓                                      |
| **Calibration ECE**    |   `{python} mv2_ece_fp32_str` |   `{python} mv2_ece_int8_str` | ⚠ (degraded)                           |
| **Edge-case accuracy** | `{python} mv2_edge_fp32_str`% | `{python} mv2_edge_int8_str`% | ⚠ (`{python} mv2_edge_drop_str`% drop) |

**Understanding ECE (Expected Calibration Error)**: ECE measures whether predicted confidence matches actual accuracy. When a model predicts "90% confident," it should be correct 90% of the time. Interpretation thresholds:

- ECE < 0.05: Well-calibrated; confidence scores are reliable for threshold-based decisions
- 0.05 < ECE < 0.10: Moderate calibration; use confidence scores with caution
- ECE > 0.10: Poorly calibrated; confidence scores are unreliable

The INT8 model's ECE of 0.089 indicates borderline calibration: confidence scores are becoming unreliable for automated decision thresholds.

**Edge-case definition**: Images with >50% occlusion, <100 lux lighting, or >30° rotation from training distribution (approximately 5% of real-world inputs).

**What this reveals**: Average-case accuracy looks acceptable (0.9% drop), but calibration degraded significantly and edge-case accuracy dropped `{python} mv2_edge_drop_str`%. If the deployment context uses confidence thresholds (e.g., "only act if confidence > 85%") or encounters many edge cases (unusual lighting, partial occlusions), INT8 MobileNet may fail despite passing aggregate benchmarks.

**The fix**: Apply temperature scaling post-hoc to restore calibration. Temperature scaling learns a single scalar $T$ to divide logits before softmax: $\text{softmax}(z_i / T)$. Typical values: $T = 1.5$–$2.5$ for quantized models. In parallel, add edge-case examples to the test set to monitor that specific failure mode continuously.
:::

The Lottery Ticket Hypothesis (@sec-model-compression-lottery-ticket-hypothesis-1b3d) provides concrete benchmarking data illustrating what Pareto-efficient compression looks like. Through iterative pruning, researchers discovered that sparse subnetworks ("winning tickets") can match dense model performance: ResNet-18 subnetworks at 10–20% of original size achieve 93.2% accuracy versus 94.1% for the full model on CIFAR-10—a 0.9 percentage point drop for 80–90% size reduction [@frankle2019lottery]. BERT-base winning tickets retain 97% of original performance with 90% fewer parameters, requiring 5--8$\times$ less training time to converge.

These numbers reveal the shape of compression trade-offs: the ResNet result shows diminishing returns (the last 80% of parameters contribute only 0.9% accuracy), while BERT demonstrates that aggressive pruning can preserve nearly all capability for the right architecture. Compression validation should establish similar trade-off curves for each specific model and task, identifying where the model sits on the Pareto frontier and whether further compression yields meaningful efficiency gains or merely degrades quality.

#### Large Language Model Benchmarks {#sec-benchmarking-large-language-model-benchmarks-4ec7}

The compression evaluation framework applies well to models with well-defined accuracy metrics: classification accuracy, detection mAP, segmentation IoU. Large language models present unique benchmarking challenges because their outputs resist such clean quantification. Unlike classification tasks where ground truth is well-defined, language model evaluation must assess open-ended generation quality, factual accuracy, reasoning capability, and safety—dimensions that resist simple quantification.

\index{MMLU!multitask language understanding}
Knowledge benchmarks like MMLU[^fn-mmlu] (Massive Multitask Language Understanding) evaluate factual knowledge across 57 subjects. Scores range from 25% (random guessing) to 90%+ for frontier models; human experts average 89.8%. However, MMLU's multiple-choice format tests recognition rather than generation, potentially overestimating real-world capability since a model might select the correct answer from options while being unable to produce it unprompted.

[^fn-mmlu]: **MMLU (Massive Multitask Language Understanding)**: Introduced by Hendrycks et al. with 15,908 multiple-choice questions across 57 subjects. MMLU's benchmarking limitation is its format: multiple-choice recognition is fundamentally easier than open-ended generation, so models scoring 80%+ on MMLU may perform substantially worse on equivalent free-form questions, creating a gap between benchmark scores and production capability. \index{MMLU!format limitation}

Holistic benchmarks like HELM[^fn-helm] (Holistic Evaluation of Language Models) address single-metric limitations by evaluating across 7 dimensions: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. This reveals trade-offs invisible to accuracy-only evaluation; a model achieving high accuracy may exhibit poor calibration or elevated toxicity. The same multi-dimensional principle from classification (accuracy alone is insufficient) applies with greater force to generative models.

[^fn-helm]: **HELM (Holistic Evaluation of Language Models)**: Stanford's 2022 evaluation framework testing 30+ models across 7 dimensions (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). HELM's contribution is methodological: by evaluating models that score similarly on accuracy but diverge on calibration or toxicity, it demonstrates that single-metric leaderboards systematically hide failure modes that matter for production deployment. \index{HELM!multi-dimensional evaluation}

Generation-specific metrics capture properties absent from discriminative benchmarks:

\index{Perplexity!language model prediction measure}
- **Perplexity**[^fn-perplexity] measures how well a model predicts held-out text (lower is better). A perplexity of 10 means the model is "10-way confused" on average. Useful for comparing models on the same corpus, but does not directly measure generation quality.

\index{Perplexity!etymology}

[^fn-perplexity]: **Perplexity**: From Latin *perplexus* (entangled); in information theory, $2^{H(p)}$ where $H$ is entropy. A perplexity of 10 means the model is "10-way confused" on average. The systems consequence: perplexity correlates with KV cache memory pressure because high-perplexity (uncertain) models assign probability mass across more tokens, requiring the serving system to track more candidates during beam search or nucleus sampling. Lower perplexity enables more aggressive memory optimization. \index{Perplexity!KV cache implication}

\index{First-Token Latency!LLM responsiveness}
- **First-token latency** (time to first generated token) dominates user-perceived responsiveness for interactive applications. This metric is dominated by prompt processing, proportional to input length.
```{python}
#| label: llm-throughput-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LLM THROUGHPUT CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Prose bullet on "Tokens per second" in the LLM benchmarking
# │          metrics discussion
# │
# │ Goal: Convert abstract token rates into tangible wall-clock times.
# │ Show: That a 4× throughput gain reduces response time from 30s to 7.5s.
# │ How: Calculate response duration for a 750-token payload across token rates.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: slow_str, fast_str, response_tokens_str, slow_toks_str,
# │          fast_toks_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt_percent, fmt, check

class LLMThroughputCalc:
    """25 vs 100 tok/s on a 750-token response: 30 s vs 7.5 s — a 4× user-perceived difference."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    response_tokens = 750                                                       # ~500 words
    slow_toks = 25
    fast_toks = 100
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    slow_sec = response_tokens / slow_toks
    fast_sec = response_tokens / fast_toks
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    slow_str = fmt(slow_sec, precision=0, commas=False)
    fast_str = fmt(fast_sec, precision=1, commas=False)
    response_tokens_str = fmt(response_tokens, precision=0, commas=False)
    slow_toks_str = fmt(slow_toks, precision=0, commas=False)
    fast_toks_str = fmt(fast_toks, precision=0, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
slow_str = LLMThroughputCalc.slow_str
fast_str = LLMThroughputCalc.fast_str
response_tokens_str = LLMThroughputCalc.response_tokens_str
slow_toks_str = LLMThroughputCalc.slow_toks_str
fast_toks_str = LLMThroughputCalc.fast_toks_str
```

- **Tokens per second** measures generation throughput. Modern LLMs achieve 20–100 tokens/second depending on model size and hardware. For a response of ~`{python} response_tokens_str` tokens, `{python} slow_toks_str` vs `{python} fast_toks_str` tokens/second means `{python} slow_str` seconds vs `{python} fast_str` seconds.
- **Time-to-first-token vs inter-token latency** jointly capture different bottlenecks requiring different optimizations. Batching improves throughput but typically increases first-token latency, a trade-off invisible if only one metric is measured.

\index{Benchmark Contamination!LLM memorization risk}
Benchmark contamination is a unique LLM failure mode. Models trained on web-scale corpora may encounter benchmark questions during pretraining, inflating scores through memorization rather than capability [@xu2024benchmarking]. Studies estimate 4–15% contamination rates for popular benchmarks, with contaminated examples showing 10–20% higher accuracy. Mitigation strategies include temporal holdouts (benchmarks from content published after training cutoff), dynamic benchmarks (continuously generated evaluation instances), and contamination detection (testing whether models recall exact benchmark phrasing).

### Data Benchmarking {#sec-benchmarking-data-benchmarking-22f9}

\index{Data Benchmarking!training set validation}
Model benchmarks validate whether compression preserved model quality. Model quality, however, depends entirely on the data used to train and evaluate it, and this dependency creates the most insidious failure mode in ML deployment. A perfectly preserved model trained on biased or unrepresentative data will still fail in production. Data benchmarks validate whether the efficiency strategies from @sec-data-selection—active learning, curriculum design, data augmentation, and synthetic data generation—produced training sets that enable reliable deployment. This is often the last validation to fail and the hardest to diagnose: a model achieving excellent accuracy on held-out test data may collapse on production inputs that the training data never adequately represented.

Contemporary AI development reveals that data quality often determines performance boundaries more than model architecture. This recognition elevated data benchmarking from afterthought to critical discipline. For data selection metrics (PPD, DUE) and benchmarks (DataPerf), see @sec-data-selection.

#### Coverage Metrics {.unnumbered}

The first question data benchmarking must answer is whether the training data represents the inputs the model will encounter. A model cannot learn patterns it has never seen, and the ways training data can *fail* to represent deployment reality are often subtle.

\index{Class Imbalance!coverage failure mode}
Consider class balance: a fraud detection dataset with 99% legitimate transactions and 1% fraud might produce a model that achieves 99% accuracy by simply labeling everything legitimate. The model is useless, but the accuracy metric looks excellent. Imbalance ratios above 10:1 typically require mitigation through oversampling, class weighting, or threshold adjustment. More insidious is subgroup imbalance *within* classes: a dataset might have balanced positive and negative examples overall, but negative examples might be drawn predominantly from one demographic group, creating disparities invisible to aggregate class balance metrics.

Feature coverage presents an even harder challenge because it requires domain knowledge about what variations matter. A computer vision model trained exclusively on daytime images will fail on nighttime inputs; a natural language model trained on formal text will fail on colloquial language. Unlike class balance, which can be computed from labels alone, feature coverage requires understanding the deployment context. *What lighting conditions will the camera encounter? What dialects will users speak? What edge cases exist in production that test sets never capture?* These questions have no algorithmic answer—they demand collaboration between ML engineers and domain experts who understand the deployment environment.

For applications affecting people, demographic representation becomes a coverage dimension with ethical implications. Training data must represent the deployment population across relevant dimensions: age, gender, ethnicity, geography, language. A facial recognition system trained predominantly on one demographic group will systematically underperform on others, even if aggregate accuracy metrics look acceptable. The challenge is that demographic metadata is often unavailable or unreliable, making representation gaps difficult to detect and measure.

#### Quality Metrics {.unnumbered}

Even when training data covers the right inputs, the labels themselves may be unreliable. Studies consistently find 3–6% label error rates in major datasets, including ImageNet [@northcutt2021pervasive]. These errors are not merely noise—they become learned ground truth. A model trained on data where wolves are occasionally labeled as dogs will learn that some wolves *are* dogs. The benchmark will report this as correct behavior because the model matches the (incorrect) labels.

For small datasets, manual audit of a random sample can estimate label accuracy. For large datasets, confident learning techniques identify likely mislabeled examples by finding cases where model predictions systematically disagree with labels. The intuition is that when a model confidently predicts a different label than the ground truth, either the model has learned something incorrect or the label is wrong. But detection is only the first step; correction requires human review, and scaling human review to millions of examples presents its own challenges.

\index{Cohen's Kappa!inter-annotator agreement}
Inter-annotator agreement provides a different lens on label quality by measuring consistency across human labelers. Cohen's kappa or Fleiss' kappa quantify agreement beyond what chance would produce. When agreement falls below 0.6 on tasks with clear ground truth, something is wrong: either the labeling guidelines are ambiguous, the task is inherently subjective, or labeler quality varies significantly. Agreement below 0.4 is problematic for any supervised learning application because the training signal itself is incoherent.

The distinction between random and systematic errors matters enormously for their downstream effects. Random label noise partially averages out during training: if different examples are mislabeled in different directions, the model learns the central tendency. Systematic errors (consistently mislabeling a particular subclass), in contrast, are learned as ground truth. A dataset where all wolves photographed in snow are labeled "dogs" will produce a model that calls snowy wolves dogs, and no amount of additional data fixes this without correcting the systematic error at its source.

#### Distribution Alignment {.unnumbered}

\index{Distribution Shift!train-to-production alignment}
The final category of data benchmarking asks whether models will generalize from training conditions to deployment reality. This is where the gap between benchmark performance and production performance most frequently emerges.

The standard assumption underlying held-out evaluation, that test data comes from the same distribution as training data, is routinely violated in practice. Test sets constructed years after training data may reflect distribution drift as the world changes. Test sets from different geographic regions may reflect population shift. A model achieving 95% accuracy on a held-out test set may drop to 70% when deployed to a region or time period the test set did not represent. Standard held-out evaluation overestimates deployment performance whenever the i.i.d. (independent and identically distributed) assumption fails.

The true test is train-to-production alignment, and this is far harder to measure because production data differs from training data in ways that held-out test sets often fail to capture. Production images come from different cameras with different characteristics. Production users come from different populations with different behaviors. Production inputs include edge cases that curated test sets systematically exclude. The WILDS[^fn-wilds] benchmark [@koh2021wilds] was designed specifically to evaluate models under realistic distribution shifts: hospital systems with different patient populations, wildlife cameras at different locations, satellite imagery from different time periods. The results reveal a stark reality: models achieving 90%+ accuracy on in-distribution test sets may drop to 60% under these realistic shifts.

[^fn-wilds]: **WILDS**: Stanford's 2021 benchmark of 10 datasets with real-world distribution shifts: hospital patient population changes (Camelyon17), wildlife camera location shifts (iWildCam), and satellite imagery temporal drift (PovertyMap). WILDS quantifies the deployment gap: models achieving 97% in-distribution accuracy can drop to 70% under these realistic shifts, demonstrating that standard held-out evaluation systematically overestimates production performance when the i.i.d. assumption fails. \index{WILDS!distribution shift benchmark}

\index{Kolmogorov-Smirnov Test!covariate shift detection}
Given these challenges, shift detection methods become essential for production monitoring. Statistical tests like the Kolmogorov-Smirnov test or Maximum Mean Discrepancy (MMD) on input features can detect covariate shift—when the distribution of inputs changes even if the relationship between inputs and outputs remains stable. Monitoring model confidence distributions can detect when the model encounters inputs unlike anything in training. The goal is early detection: identifying distribution shift before it causes catastrophic performance degradation, enabling intervention through model updates, data collection, or deployment constraints.

These distribution alignment challenges highlight a persistent tension in ML development: should we fix the data and iterate on models, or fix the model and iterate on data? @fig-model-vs-data places these two paradigms side by side, revealing exactly where the feedback loop differs. In the model-centric diagram, the iteration cycle targets the architecture while the data remains static; in the data-centric diagram, the architecture stays fixed while the cycle targets data quality. Research increasingly demonstrates that methodical dataset enhancement can yield superior performance gains compared to model refinements alone—challenging the conventional emphasis on architectural innovation.

::: {#fig-model-vs-data fig-env="figure" fig-pos="htb" fig-cap="**Development Paradigms**: Model-centric AI prioritizes architectural innovation with fixed datasets, while data-centric AI systematically improves dataset quality (annotations, diversity, and bias) with consistent model architectures to achieve performance gains. Modern research indicates that strategic data enhancement often yields greater improvements than solely refining model complexity." fig-alt="Side-by-side diagrams: model-centric AI shows data cylinders feeding CPU with feedback loop to model, data-centric AI shows feedback loop to data instead. Double arrow indicates complementary approaches."}
```{.tikz}

\begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}]

\tikzset{mycylinder/.style={cylinder, shape border rotate=90, aspect=1.3,
draw, fill=white, minimum width=25mm,minimum height=11mm,line width=1pt},
Line/.style={line width=1.0pt,black!50}
}

\begin{scope}[node distance=-0.15,local bounding box = DATA]
\node[mycylinder,fill=red!30] (A) {};
\node[mycylinder, above=of A,fill=red!50] (C) {};
\node[mycylinder, above=of C,fill=red!10] (B) {};
 \end{scope}

%Padlock
\begin{scope}[scale=0.3,shift={(1.3,8.5)}]
\draw[fill=black](0,0)--(2.7,0)--++(270:1.6)to[out=270,in=0](1.85,-2.45)
            --++(180:1.1)to[out=180,in=270](0,-1.3)--cycle;
\draw[draw=none,fill=white](1.32,-0.9)+(230:0.3)
           arc[start angle=230, end angle=-50, radius=0.3]--++(280:0.75)--++(180:0.62)--cycle;
\path[red](0.27,0)circle(1pt)coordinate(K1);
\path[red](0.57,0)circle(1pt)coordinate(K2);
\path[red](2.10,0)circle(1pt)coordinate(K3);
\path[red](2.4,0)circle(1pt)coordinate(K4);

\path[green](K1)--++(90:0.6)coordinate(KK1);
\path[green](K2)--++(90:0.5)coordinate(KK2);
\path[green](K4)--++(90:0.6)coordinate(KK4);
\path[green](K3)--++(90:0.5)coordinate(KK3);
\draw[fill=black](K1)--(KK1)to[out=90,in=90,distance=37](KK4)--(K4)
--(K3)--(KK3)to[out=90,in=90,distance=29](KK2)--(K2)--cycle;
\end{scope}

%CPU
\definecolor{CPU}{RGB}{0,120,176}
\begin{scope}[local bounding box = CPU,shift={(4.5,1.1)}]
\node[fill=CPU,minimum width=66, minimum height=66,
            rounded corners=8,outer sep=2pt] (C1) {};
\node[fill=white,minimum width=54, minimum height=54] (C2) {};
\node[fill=CPU!40,minimum width=44, minimum height=44] (C3) {CPU};

\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=3, minimum height=15,
           inner sep=0pt,anchor=south](GO\y)at($(C1.north west)!\x!(C1.north east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=3, minimum height=15,
           inner sep=0pt,anchor=north](DO\y)at($(C1.south west)!\x!(C1.south east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=15, minimum height=3,
           inner sep=0pt,anchor=east](LE\y)at($(C1.north west)!\x!(C1.south west)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=15, minimum height=3,
           inner sep=0pt,anchor=west](DE\y)at($(C1.north east)!\x!(C1.south east)$){};
}
\end{scope}
\node[below=0.25of CPU](MO){Model};
\path[red](MO)-|coordinate(D)(DATA);
\node[]at(D){Data};
\draw[Line,-latex,shorten <=4pt,shorten >=4pt](DATA)--(CPU);
\draw[Line,-latex](CPU.east)--++(0:1)--++(90:2.6)-|
node[above,text=black](TXT){Systematically enhance the model}(CPU);
%
\scoped[on background layer]
\node[draw=BackLine,inner xsep=5mm,inner ysep=5mm,
yshift=2mm,
fill=BackColor,fit=(D)(DATA)(CPU)(TXT),line width=0.75pt](BB){};
\node[below=4pt of  BB.north,inner sep=0pt,xshift=3,
anchor=north,fill=BackColor]{\textbf{Model-centric AI}};
%%%%%%%%%%%%%%%%%
%right
%%%%%%%%%%%%%%%%%%%%
\begin{scope}[node distance=-0.15,shift={(13,0)},local bounding box = 2DATA]
\node[mycylinder,fill=red!30] (2A) {};
\node[mycylinder, above=of 2A,fill=red!50] (2C) {};
\node[mycylinder, above=of 2C,fill=red!10] (2B) {};
 \end{scope}

%CPU
\definecolor{CPU}{RGB}{0,120,176}
\begin{scope}[local bounding box = 2CPU,shift={(17.5,1.1)}]
\node[fill=CPU,minimum width=66, minimum height=66,
            rounded corners=8,outer sep=2pt] (2C1) {};
\node[fill=white,minimum width=54, minimum height=54] (2C2) {};
\node[fill=CPU!40,minimum width=44, minimum height=44] (2C3) {CPU};

\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=3, minimum height=15,
           inner sep=0pt,anchor=south](2GO\y)at($(2C1.north west)!\x!(2C1.north east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=3, minimum height=15,
           inner sep=0pt,anchor=north](2DO\y)at($(2C1.south west)!\x!(2C1.south east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=15, minimum height=3,
           inner sep=0pt,anchor=east](2LE\y)at($(2C1.north west)!\x!(2C1.south west)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=15, minimum height=3,
           inner sep=0pt,anchor=west](2DE\y)at($(2C1.north east)!\x!(2C1.south east)$){};
}
\end{scope}
%Padlock
\begin{scope}[scale=0.3,shift={(60.3,8.5)}]
\draw[fill=black](0,0)--(2.7,0)--++(270:1.6)to[out=270,in=0](1.85,-2.45)
            --++(180:1.1)to[out=180,in=270](0,-1.3)--cycle;
\draw[draw=none,fill=white](1.32,-0.9)+(230:0.3)
           arc[start angle=230, end angle=-50, radius=0.3]--++(280:0.75)--++(180:0.62)--cycle;
\path[red](0.27,0)circle(1pt)coordinate(K1);
\path[red](0.57,0)circle(1pt)coordinate(K2);
\path[red](2.10,0)circle(1pt)coordinate(K3);
\path[red](2.4,0)circle(1pt)coordinate(K4);

\path[green](K1)--++(90:0.6)coordinate(KK1);
\path[green](K2)--++(90:0.5)coordinate(KK2);
\path[green](K4)--++(90:0.6)coordinate(KK4);
\path[green](K3)--++(90:0.5)coordinate(KK3);
\draw[fill=black](K1)--(KK1)to[out=90,in=90,distance=37](KK4)--(K4)
            --(K3)--(KK3)to[out=90,in=90,distance=29](KK2)--(K2)--cycle;
\end{scope}

\node[below=0.25of 2CPU](2MO){Model};
\path[red](2MO)-|coordinate(2D)(2DATA);
\node[]at(2D){Data};
\draw[Line,-latex,shorten <=4pt,shorten >=4pt](2DATA)--(2CPU);
\draw[Line,-latex](2CPU.east)--++(0:1)coordinate(DE)--++(90:2.6)-|
           node[above,text=black,pos=0.25](2TXT){Systematically enhance the data}(2DATA);
%
\scoped[on background layer]
\node[draw=GreenLine,inner xsep=5mm,inner ysep=5mm,
           yshift=2mm, fill=GreenL!50,fit=(2D)(2DATA)(2CPU)(2TXT)(DE),line width=0.75pt](2BB){};
\node[below=4pt of  2BB.north,inner sep=0pt,xshift=3,
           anchor=north]{\textbf{Data-centric AI}};
%%%%
\node[double arrow, fill=red!80!black!90, xshift=48,
           minimum width = 20pt, double arrow head extend=2pt,
           minimum height=30mm](DA) at(BB.east){};
\node[below=0.2of DA]{Complementary};
 \end{tikzpicture}

```
:::

\index{Data-Centric AI!dataset enhancement paradigm}
\index{DataComp!data-centric benchmark}
Data quality's primacy in AI development reflects an important shift in understanding that challenges the "more data is always better" assumption: *better* datasets, not just *larger* ones, produce more reliable and generalizable AI systems. Initiatives like DataPerf and DataComp[^fn-datacomp] have emerged to systematically evaluate how dataset improvements affect model performance. For instance, DataComp [@gadre2024datacomp] demonstrated that models trained on a carefully curated 30% subset of data achieved better results than those trained on the complete dataset, challenging the assumption that more data automatically leads to better performance [@northcutt2021pervasive].

[^fn-datacomp]: **DataComp**: Introduced in 2023, DataComp inverts the standard benchmark by fixing the model and training code, letting participants compete on dataset curation alone. Results showed that a carefully filtered 30% subset matched models trained on 10$\times$ larger unfiltered data, quantifying a systems insight: for many workloads, engineering the data pipeline yields greater performance gains per dollar than scaling compute. \index{DataComp!data-centric benchmarking}

A persistent challenge in data benchmarking emerges from dataset saturation. When models achieve near-perfect accuracy on benchmarks like ImageNet, practitioners must distinguish whether performance gains represent genuine advances in AI capability or merely optimization to existing test sets. As the timeline in @fig-dataset-saturation illustrates, AI systems have surpassed human performance across multiple benchmarks—first in handwriting recognition in the early 2000s, then speech and image recognition, and finally reading comprehension and language understanding by 2020. Each crossing renders the corresponding benchmark less useful as a differentiator.

::: {#fig-dataset-saturation fig-env="figure" fig-pos="htb" fig-cap="**Dataset Saturation**: AI systems surpass human performance on five benchmark capabilities: handwriting recognition, speech recognition, image recognition, reading comprehension, and language understanding, each crossing the human baseline between 1998 and 2020. This saturation underscores the need for dynamic benchmarks that remain challenging as model capabilities improve. Source: [@kiela2021dynabench]." fig-alt="Line chart showing five AI capabilities crossing human performance baseline from 1998 to 2020: handwriting, speech, image recognition, reading comprehension, and language understanding."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
%\node[anchor=south west]at(-0.93,-0.76){%
%\includegraphics[scale=0.7]{1}};

\begin{axis}[clip=false,
   axis line style={draw=none},
  /pgf/number format/.cd,
 1000 sep={},
  width=155mm,%155.9mm,
  height=80mm,%58.0mm,
  axis x line*=bottom,
  legend style={at={(0.16,0.98)}, anchor=north},
  legend cell align=left,
  title style={yshift=-2pt,font=\fontsize{9pt}{9}\selectfont\usefont{T1}{phv}{m}{n}},
  ylabel style={align=center,font=\footnotesize\usefont{T1}{phv}{m}{n}},
  xmin=1997,
  xmax=2022,
  xtick={2000,2005,2010,2015,2020},
   x tick label style={rotate=0, anchor=north},
  ymin=-100, ymax=24,
  ytick={-100,-80,...,20},
  yticklabels={$-$100,$-$80,$-$60,$-$40,$-$20,0,+20},
  ylabel={Test score of the AI relative\\ to human performance},
  title={Language and image recognition capabilities of AI systems have improved rapidly},
  grid=both,
  major grid style={black!60},
  minor grid style={draw=none},
  minor x tick num=4,
  minor x tick  style={thin,black!60},
  tick label style={/pgf/number format/assume math mode=true},
  ticklabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
  xticklabel style={yshift=-3pt},
]
%Handwriting recognition
\addplot[OrangeLine,mark=*,
mark size=2pt,line width=1.5pt,
]
coordinates{
(1998,-100)(1998,-80)(2002,-48)(2003,-27)(2006,-25)(2010,-20)(2012,-5)(2013,-1)(2018,2)
}node[pos=0.67,above=3mm]{Handwriting recognition};
%Speech recognition
\addplot[RedLine,mark=*,
mark size=2.0pt,line width=1.5pt,
] coordinates{
(1998,-100)(2011,-66)(2013,-53)(2014,-28)(2015,-26)(2015,-9)(2016,-5)(2016,-1.2)(2017,0.5)(2018,2)
}node[pos=0.17,above=3mm]{Speech recognition};
%Image recognition
\addplot[GreenD,mark=*,
mark size=2.0pt,line width=1.5pt,
] coordinates{
(2009,-100)(2012,-44)(2014,-11.5)(2014,-7)(2015,1)(2016,6)(2018,11.5)(2019,9)(2020,16)
}node[pos=0.13,left=2mm,anchor=north east]{Image recognition};
%Reading comprehension
\addplot[cyan!90!black,mark=*,
mark size=2.0pt,line width=1.5pt,
] coordinates{
(2016,-100)(2016,-34)(2017,-30)(2017,-9)(2018,6)(2019,18)(2020,19)
}node[pos=0.23,left=2mm,anchor=north east,align=right]{Reading\\ comprehension};
%Language understanding
\addplot[red,mark=*,
mark size=2.0pt,line width=1.5pt,
] coordinates{
(2018,-100)(2018,-68)(2019,-64)(2019,-25)(2019,0)(2019,4)(2020,8)(2020,12)
}node[pos=0.23,right=2mm,anchor=north west,align=left]{Language\\ understanding};
%
\draw[font=\fontsize{7pt}{9}\selectfont\usefont{T1}{phv}{m}{n},latex-](axis cs:1996.5,-104)to[bend right=25]++(320:9mm)
node[align=left,below right=2mm and 1mm,anchor=west]{The capability of each AI system is normalized\\
to an initial performance $-$100};
\draw[font=\fontsize{7pt}{9}\selectfont\usefont{T1}{phv}{m}{n},latex-](axis cs:1996.8,2)to[bend left=25]++(30:6mm)
node[align=left, right=1mm,anchor=west]{Human performance, as the benchmark, is set to zero};
%
\draw[red,line width=2pt,{Triangle[width=6pt,length=5pt]}-{Triangle[width=5pt,length=6pt]}](axis cs:2021,-1)--
node[font=\fontsize{7pt}{9}\selectfont\usefont{T1}{phv}{m}{n},right,text=black]{AI systems perform worse}
(axis cs:2021,-19);
\draw[red,line width=2pt,{Triangle[width=6pt,length=5pt]}-{Triangle[width=5pt,length=6pt]}](axis cs:2021,1)--
node[align=left,font=\fontsize{7pt}{9}\selectfont\usefont{T1}{phv}{m}{n},right,text=black]{AI systems perform better than\\
the humans who did these tests}
(axis cs:2021,19);
\coordinate(A)at(axis cs:1994,-100);
\coordinate(B)at(axis cs:2027.5,-100);
\coordinate(C)at(axis cs:2027.5,0);
\coordinate(C1)at(axis cs:2027.5,25);
\coordinate(D)at(axis cs:1994,0);
\coordinate(D1)at(axis cs:1994,25);
\scoped[on background layer]
\fill[fill=magenta!5](A)--(B)--(C)--(D)--cycle;
\scoped[on background layer]
\fill[fill=green!5](D)--(C)--(C1)--(D1)--cycle;
\end{axis}
\end{tikzpicture}
```
:::

#### Dataset Saturation and Dynamic Benchmarks {.unnumbered}

@fig-dataset-saturation raises a critical methodological question: when models surpass human performance on benchmarks, does this reflect genuine capability advances or optimization to static evaluation sets? MNIST illustrates the concern: certain test images, though nearly illegible to humans, were assigned specific labels during dataset creation in 1994. Models correctly predicting these labels may be memorizing dataset artifacts rather than learning digit recognition. The question "Are we done with ImageNet?" [@beyer2020we] generalizes this concern.

Dynamic benchmarking approaches like Dynabench[^fn-dynabench] [@kiela2021dynabench] address saturation by continuously evolving test data based on model performance, ensuring that benchmarks remain challenging as capabilities improve. However, dynamic benchmarks complement rather than replace the coverage, quality, and distribution metrics described above: they prevent saturation but do not diagnose its causes.

[^fn-dynabench]: **Dynabench**: Facebook AI Research's 2021 platform for dynamic benchmark generation, where humans craft adversarial inputs that fool current best models. Dynabench addresses the saturation problem (95%+ accuracy on static benchmarks may reflect memorization), but introduces its own trade-off: dynamic benchmarks are not reproducible across time, making longitudinal performance tracking impossible. Static and dynamic benchmarks serve complementary diagnostic roles. \index{Dynabench!dynamic evaluation}

### Holistic System-Model-Data Evaluation {#sec-benchmarking-holistic-systemmodeldata-evaluation-8fda}

We have now examined all three dimensions of our benchmarking framework: system benchmarks that validate hardware performance, model benchmarks that verify compression preserved quality, and data benchmarks that assess training set representativeness. Each dimension, evaluated independently, might pass with excellent scores. Yet AI benchmarking has traditionally evaluated these dimensions as separate entities, and this isolation creates blind spots where failures hide. Real-world AI performance emerges from the interaction of all three dimensions, and optimizing one can expose weaknesses in another.

Consider a concrete failure cascade: a team achieves excellent MLPerf Inference scores by deploying an INT8-quantized model on optimized hardware. System benchmarks pass. But the quantized model was validated only on ImageNet-distributed test data; deployment reveals accuracy degradation on factory-floor images with different lighting characteristics. Model quality benchmarks would have caught the quantization sensitivity. Further investigation shows the training data contained no images with industrial lighting—a data quality gap that no amount of system or model optimization can address.

This interdependence means that benchmark results from one dimension can be invalidated by failures in another:

- **System success + Model failure**: Hardware delivers promised throughput, but compression degraded accuracy below deployment thresholds
- **System success + Data failure**: Fast inference on representative inputs, but training data bias causes failures on demographic subgroups
- **Model success + System failure**: Accurate predictions, but latency variance under load violates SLA requirements
- **Model success + Data failure**: High accuracy on held-out test set, but distribution shift in production causes silent degradation

This interdependence is precisely the AI Triad introduced in @sec-introduction (@fig-ai-triad): System corresponds to Machine, Model corresponds to Algorithm, and Data remains Data. Holistic evaluation requires not just passing benchmarks in each dimension, but verifying that assumptions made in one dimension hold across the others. The Part III optimization pipeline (data → model → hardware) creates implicit dependencies that benchmarking must validate explicitly.

The D·A·M taxonomy provides a diagnostic framework for systematically identifying which axis limits performance. @tbl-dam-bottleneck formalizes this approach by crossing each D·A·M axis with the three fundamental bottleneck types (see @sec-dam-taxonomy for the full diagnostic guide, @tbl-dam-tooling for profiling utilities, and @tbl-dam-scorecard for efficiency grading rubrics).

| **Component** | **Compute-Bound**                                    | **Memory-Bound**                                     | **I/O-Bound**                                            |
|:--------------|:-----------------------------------------------------|:-----------------------------------------------------|:---------------------------------------------------------|
| **Data**      | Preprocessing too slow (augmentation, tokenization)  | Dataset exceeds RAM (spills to disk)                 | Storage cannot feed GPU (disk throughput limit)          |
| **Algorithm** | Model too large for hardware (FLOPs exceed capacity) | Activations exceed memory (batch size limited)       | Gradient sync slower than compute (distributed training) |
| **Machine**   | GPU utilization saturated (need faster accelerator)  | Memory bandwidth saturated (need more HBM bandwidth) | Network/PCIe bandwidth saturated (need faster links)     |

: **D·A·M$\times$ Bottleneck Diagnostic Matrix.** Each cell describes a performance constraint symptom, and the row identifies which D·A·M axis to address. When performance stalls, ask: *"Where is the flow blocked? Check the D·A·M."* {#tbl-dam-bottleneck}

The diagnostic power of this matrix becomes clear when benchmarks reveal unexpected results—particularly when performance falls short of expectations. If system benchmarks show low GPU utilization despite adequate hardware (Machine row, Compute-Bound column: "GPU utilization saturated" is *not* the symptom), the bottleneck likely lies elsewhere. For example, a team observing only 30% GPU utilization during training might initially suspect an inefficient model architecture (Algorithm row), but profiling reveals that image augmentation runs on CPU and cannot keep up with GPU consumption (Data row, I/O-Bound column: "Storage cannot feed GPU"). Systematic diagnosis using this matrix prevents the common mistake of optimizing the wrong component.

Yet validation under controlled laboratory conditions differs profoundly from validation under production reality. In the laboratory, data distributions stay fixed, request patterns remain uniform, and systems run in isolation. In production, all three assumptions break simultaneously—data drifts, traffic spikes unpredictably, and system components interact in ways that isolated benchmarks cannot capture. The final dimension of benchmarking asks whether systems validated in the lab survive contact with the real world.

## Production Considerations {#sec-benchmarking-production-considerations-084b}

A system that passes all three benchmark categories can still fail in production. The three-dimensional framework validated hardware performance, model quality, and data representativeness under controlled conditions—but production violates those conditions continuously. This gap between benchmark success and deployment success motivates a final benchmarking concern: validating systems under conditions that match operational reality.

### From Laboratory to Production {#sec-benchmarking-laboratory-production-d22a}

Laboratory benchmarks establish *what* a system is capable of under ideal conditions. Production validation determines *whether* that system is performing correctly right now, under real conditions.

This distinction matters because laboratory benchmarks assume conditions that production systematically violates. Silent degradation poses the most insidious challenge: models can produce plausible but incorrect outputs without obvious error signals, and a recommendation system returning "reasonable" but suboptimal suggestions has no built-in error indicator. Dynamic workloads present a different failure mode: a system benchmarked at steady 1,000 QPS may fail when flash traffic events spike to 10,000 QPS, revealing that benchmark "throughput" assumed uniform request arrival rather than bursty production patterns. Data distribution shift compounds these problems over time, as production data evolves and diverges from training distributions—an image classifier trained on professional photos degrades gradually as users submit smartphone images with different lighting, angles, and compression artifacts. Finally, production imposes multi-objective constraints that benchmarks treat independently: accuracy, latency, cost, and resource utilization must all be satisfied simultaneously, and optimizing any one at the expense of others leads to deployment failure.

### Bridging Benchmark to Deployment {#sec-benchmarking-bridging-benchmark-deployment-7427}

Before deployment, validate your benchmarking conclusions against production-representative conditions. The following *pre-deployment benchmark checklist* summarizes the key validation steps:

| **Benchmark Assumption**       | **Production Reality**       | **Validation Approach**                |
|:-------------------------------|:-----------------------------|:---------------------------------------|
| **Uniform request arrival**    | Bursty traffic patterns      | Load test with production trace replay |
| **Clean, preprocessed inputs** | Variable quality inputs      | Evaluate on production data sample     |
| **Warm system state**          | Cold starts, cache misses    | Measure cold-start performance         |
| **Isolated execution**         | Resource contention          | Benchmark under realistic system load  |
| **Fixed model version**        | A/B testing, gradual rollout | Establish baseline for comparison      |

::: {.callout-important title="Pre-Deployment Benchmark Checklist"}
Before deploying a model based on benchmark results:

1. **Replay production traces**: Use logged request patterns to validate throughput/latency under realistic conditions
2. **Test with production data**: Sample recent production inputs (respecting privacy) to verify accuracy holds
3. **Stress test edge cases**: Identify worst-case inputs and verify graceful degradation
4. **Establish monitoring baselines**: Document expected metric ranges for anomaly detection
5. **Define rollback criteria**: Specify quantitative thresholds that trigger automatic rollback
:::

### Production Monitoring as Continuous Benchmarking {#sec-benchmarking-production-monitoring-continuous-benchmarking-6ac8}

Production monitoring extends benchmarking from a one-time gate to a continuous process. The same principles apply (standardized metrics, reproducible measurement, statistical rigor) but the context shifts from "will this work?" to "is this working?"

These production monitoring challenges—including A/B testing frameworks, canary deployment strategies, shadow scoring, and continuous validation pipelines—are examined comprehensively in @sec-ml-operations. That section extends the benchmarking principles established here into the dynamic operational contexts that characterize real-world ML system deployment, establishing infrastructure for detecting silent failures, tracking performance degradation, and validating system behavior under production conditions. Where this section asks "how fast is my system under controlled conditions?", @sec-ml-operations asks "is my system performing correctly right now?"—transitioning from offline evaluation to continuous production verification.

With benchmarking principles, methodologies, and production considerations established, we can now identify the most common misconceptions that lead practitioners astray—and the pitfalls that turn benchmark success into deployment failure.

## Fallacies and Pitfalls {#sec-benchmarking-fallacies-pitfalls-9781}

```{python}
#| label: fallacies-pitfalls-setup
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACIES AND PITFALLS SETUP
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacies and Pitfalls section — provides quantitative backing
# │          for all examples in the chapter's F&P discussion
# │
# │ Goal: Provide quantitative backing for benchmarking misconceptions.
# │ Show: How incorrect benchmarks lead to real-world engineering failures.
# │ How: Pre-compute comparative stats for accuracy, latency, and throughput.
# │
# │ Imports: (none — uses only Python builtins and f-strings)
# │ Exports: benchmark_accuracy_pct_str, production_accuracy_range_str,
# │          accuracy_drop_pct_str, benchmark_latency_mean_ms_str,
# │          production_p99_range_str, latency_multiplier_str,
# │          pre_quant_latency_ms_str, post_quant_latency_ms_str,
# │          ranking_improvement_str, energy_increase_pct_str,
# │          ood_degradation_pct_str, accuracy_improvement_pct_str,
# │          rec_accuracy_pct_str, rec_p99_latency_ms_str,
# │          slo_p99_requirement_ms_str, high_throughput_qps_str,
# │          low_throughput_qps_str, high_power_w_str, low_power_w_str,
# │          power_ratio_str, throughput_loss_pct_str, battery_multiplier_str,
# │          imagenet_error_2010_pct_str, imagenet_error_2015_pct_str,
# │          imagenet_competition_end_year_str, imagenet_teams_above_95_str,
# │          imagenet_total_teams_str, mnist_accuracy_pct_str,
# │          edge_memory_constraint_x_str, edge_power_constraint_x_str,
# │          isolated_throughput_qps_str, production_throughput_range_str,
# │          production_utilization_pct_str, throughput_degradation_pct_str,
# │          model_inference_range_str, e2e_latency_range_str,
# │          availability_nines_str, downtime_minutes_month_str
# └─────────────────────────────────────────────────────────────────────────────

class FallaciesPitfallsSetup:
    """Quantitative backing for all Fallacy/Pitfall items in the benchmarking chapter."""
    # ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
    # Fallacy 1: Benchmark vs. production accuracy gap
    benchmark_accuracy_pct = 92
    production_accuracy_low_pct = 78
    production_accuracy_high_pct = 82
    # Fallacy 1: Latency gap (mean vs. p99)
    benchmark_latency_mean_ms = 15
    production_p99_low_ms = 150
    production_p99_high_ms = 200
    # Pitfall 1 (Goodhart): Quantization speedup
    pre_quant_latency_ms = 12
    post_quant_latency_ms = 8
    ranking_improvement = 15
    energy_increase_pct = 40
    ood_degradation_pct = 25
    accuracy_improvement_pct = 2.1
    # Fallacy 2: Single-metric trade-offs
    rec_accuracy_pct = 94
    rec_p99_latency_ms = 180
    slo_p99_requirement_ms = 100
    high_throughput_qps = 1200
    low_throughput_qps = 1000
    high_power_w = 420
    low_power_w = 180
    battery_multiplier = 2.3
    # Pitfall 2: Benchmark saturation (ImageNet)
    imagenet_error_2010_pct = 28.2
    imagenet_error_2015_pct = 3.57
    imagenet_competition_end_year = 2017
    imagenet_teams_above_95 = 29
    imagenet_total_teams = 38
    mnist_accuracy_pct = 99.8
    edge_memory_constraint_x = 10
    edge_power_constraint_x = 100
    # Pitfall 3: Research vs. production
    isolated_throughput_qps = 800
    production_throughput_low_qps = 400
    production_throughput_high_qps = 500
    production_utilization_pct = 90
    model_inference_low_ms = 5
    model_inference_high_ms = 10
    e2e_latency_low_ms = 50
    e2e_latency_high_ms = 100
    availability_nines = 99.9
    downtime_minutes_month = 43
    # ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
    accuracy_drop_pct = benchmark_accuracy_pct - production_accuracy_high_pct
    latency_multiplier = production_p99_low_ms / benchmark_latency_mean_ms
    power_ratio = high_power_w / low_power_w
    throughput_loss_pct = round((1 - low_throughput_qps / high_throughput_qps) * 100)
    throughput_degradation_pct = round((1 - production_throughput_high_qps / isolated_throughput_qps) * 100)
    # ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
    benchmark_accuracy_pct_str = f"{benchmark_accuracy_pct}"
    production_accuracy_range_str = f"{production_accuracy_low_pct}-{production_accuracy_high_pct}"
    accuracy_drop_pct_str = f"{accuracy_drop_pct}"
    benchmark_latency_mean_ms_str = f"{benchmark_latency_mean_ms}"
    production_p99_range_str = f"{production_p99_low_ms}-{production_p99_high_ms}"
    latency_multiplier_str = f"{latency_multiplier:.0f}"
    pre_quant_latency_ms_str = f"{pre_quant_latency_ms}"
    post_quant_latency_ms_str = f"{post_quant_latency_ms}"
    ranking_improvement_str = f"{ranking_improvement}"
    energy_increase_pct_str = f"{energy_increase_pct}"
    ood_degradation_pct_str = f"{ood_degradation_pct}"
    accuracy_improvement_pct_str = f"{accuracy_improvement_pct}"
    rec_accuracy_pct_str = f"{rec_accuracy_pct}"
    rec_p99_latency_ms_str = f"{rec_p99_latency_ms}"
    slo_p99_requirement_ms_str = f"{slo_p99_requirement_ms}"
    high_throughput_qps_str = f"{high_throughput_qps:,}"
    low_throughput_qps_str = f"{low_throughput_qps:,}"
    high_power_w_str = f"{high_power_w}"
    low_power_w_str = f"{low_power_w}"
    power_ratio_str = f"{power_ratio:.1f}"
    throughput_loss_pct_str = f"{throughput_loss_pct}"
    battery_multiplier_str = f"{battery_multiplier}"
    imagenet_error_2010_pct_str = f"{imagenet_error_2010_pct}"
    imagenet_error_2015_pct_str = f"{imagenet_error_2015_pct}"
    imagenet_competition_end_year_str = f"{imagenet_competition_end_year}"
    imagenet_teams_above_95_str = f"{imagenet_teams_above_95}"
    imagenet_total_teams_str = f"{imagenet_total_teams}"
    mnist_accuracy_pct_str = f"{mnist_accuracy_pct}"
    edge_memory_constraint_x_str = f"{edge_memory_constraint_x}"
    edge_power_constraint_x_str = f"{edge_power_constraint_x}"
    isolated_throughput_qps_str = f"{isolated_throughput_qps}"
    production_throughput_range_str = f"{production_throughput_low_qps}-{production_throughput_high_qps}"
    production_utilization_pct_str = f"{production_utilization_pct}"
    throughput_degradation_pct_str = f"{throughput_degradation_pct}"
    model_inference_range_str = f"{model_inference_low_ms}-{model_inference_high_ms}"
    e2e_latency_range_str = f"{e2e_latency_low_ms}-{e2e_latency_high_ms}"
    availability_nines_str = f"{availability_nines}"
    downtime_minutes_month_str = f"{downtime_minutes_month}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
benchmark_accuracy_pct_str = FallaciesPitfallsSetup.benchmark_accuracy_pct_str
production_accuracy_range_str = FallaciesPitfallsSetup.production_accuracy_range_str
accuracy_drop_pct_str = FallaciesPitfallsSetup.accuracy_drop_pct_str
benchmark_latency_mean_ms_str = FallaciesPitfallsSetup.benchmark_latency_mean_ms_str
production_p99_range_str = FallaciesPitfallsSetup.production_p99_range_str
latency_multiplier_str = FallaciesPitfallsSetup.latency_multiplier_str
pre_quant_latency_ms_str = FallaciesPitfallsSetup.pre_quant_latency_ms_str
post_quant_latency_ms_str = FallaciesPitfallsSetup.post_quant_latency_ms_str
ranking_improvement_str = FallaciesPitfallsSetup.ranking_improvement_str
energy_increase_pct_str = FallaciesPitfallsSetup.energy_increase_pct_str
ood_degradation_pct_str = FallaciesPitfallsSetup.ood_degradation_pct_str
accuracy_improvement_pct_str = FallaciesPitfallsSetup.accuracy_improvement_pct_str
rec_accuracy_pct_str = FallaciesPitfallsSetup.rec_accuracy_pct_str
rec_p99_latency_ms_str = FallaciesPitfallsSetup.rec_p99_latency_ms_str
slo_p99_requirement_ms_str = FallaciesPitfallsSetup.slo_p99_requirement_ms_str
high_throughput_qps_str = FallaciesPitfallsSetup.high_throughput_qps_str
low_throughput_qps_str = FallaciesPitfallsSetup.low_throughput_qps_str
high_power_w_str = FallaciesPitfallsSetup.high_power_w_str
low_power_w_str = FallaciesPitfallsSetup.low_power_w_str
power_ratio_str = FallaciesPitfallsSetup.power_ratio_str
throughput_loss_pct_str = FallaciesPitfallsSetup.throughput_loss_pct_str
battery_multiplier_str = FallaciesPitfallsSetup.battery_multiplier_str
imagenet_error_2010_pct_str = FallaciesPitfallsSetup.imagenet_error_2010_pct_str
imagenet_error_2015_pct_str = FallaciesPitfallsSetup.imagenet_error_2015_pct_str
imagenet_competition_end_year_str = FallaciesPitfallsSetup.imagenet_competition_end_year_str
imagenet_teams_above_95_str = FallaciesPitfallsSetup.imagenet_teams_above_95_str
imagenet_total_teams_str = FallaciesPitfallsSetup.imagenet_total_teams_str
mnist_accuracy_pct_str = FallaciesPitfallsSetup.mnist_accuracy_pct_str
edge_memory_constraint_x_str = FallaciesPitfallsSetup.edge_memory_constraint_x_str
edge_power_constraint_x_str = FallaciesPitfallsSetup.edge_power_constraint_x_str
isolated_throughput_qps_str = FallaciesPitfallsSetup.isolated_throughput_qps_str
production_throughput_range_str = FallaciesPitfallsSetup.production_throughput_range_str
production_utilization_pct_str = FallaciesPitfallsSetup.production_utilization_pct_str
throughput_degradation_pct_str = FallaciesPitfallsSetup.throughput_degradation_pct_str
model_inference_range_str = FallaciesPitfallsSetup.model_inference_range_str
e2e_latency_range_str = FallaciesPitfallsSetup.e2e_latency_range_str
availability_nines_str = FallaciesPitfallsSetup.availability_nines_str
downtime_minutes_month_str = FallaciesPitfallsSetup.downtime_minutes_month_str
```

Benchmarking creates false confidence when standardized measurement obscures deployment realities. Teams assume controlled evaluations predict production performance, but real systems face variability, resource constraints, and multi-objective trade-offs that benchmarks cannot capture. These fallacies waste engineering effort and produce systems optimized for evaluation rather than deployment.

\index{Benchmarking!fallacy of direct performance translation}
**Fallacy:** *Benchmark performance directly translates to real-world application performance.*

The seductive clarity of benchmark rankings leads teams to select systems as though leaderboard position predicts production behavior. It rarely does. As @sec-benchmarking-ml-measurement-challenges-60ea demonstrates, ML systems exhibit inherent variability from data quality issues, distribution shifts, and resource constraints absent in controlled evaluation. A language model achieving `{python} benchmark_accuracy_pct_str`% benchmark accuracy drops to `{python} production_accuracy_range_str`% accuracy in production when processing user-generated text with spelling errors, informal language, and domain-specific terminology. An inference system with `{python} benchmark_latency_mean_ms_str` ms mean latency on MLPerf experiences `{python} production_p99_range_str` ms p99 latency in production (`{python} latency_multiplier_str`$\times$ degradation) due to concurrent load, garbage collection pauses, and network variability. Teams relying solely on benchmark rankings systematically underestimate deployment complexity, leading to failed launches and costly re-engineering.

**Pitfall:** *Optimizing exclusively for benchmark metrics without considering broader system requirements.*

Benchmark leaderboards incentivize aggressive optimization, but the optimizations that climb rankings often degrade the very characteristics production demands. As discussed in @sec-benchmarking-organizational-strategic-issues-d25a, this exemplifies Goodhart's Law: when benchmark scores become optimization targets, they cease to be meaningful measures of system quality. A team reduces inference latency from `{python} pre_quant_latency_ms_str` ms to `{python} post_quant_latency_ms_str` ms through aggressive quantization, improving MLPerf ranking by `{python} ranking_improvement_str` positions while degrading calibration such that prediction confidence scores become unreliable for downstream decision-making. Another team achieves `{python} accuracy_improvement_pct_str`% ImageNet accuracy improvement through extensive hyperparameter tuning but the optimized model consumes `{python} energy_increase_pct_str`% more energy and exhibits `{python} ood_degradation_pct_str`% worse performance on out-of-distribution images from production cameras. Organizations rewarding benchmark rankings over deployment success systematically produce systems that excel in evaluation but fail in production.

\index{Single-Metric Evaluation!fallacy}
**Fallacy:** *Single-metric evaluation provides sufficient insight into system performance.*

A single number is seductively simple: this system is "94% accurate" or "1,200 QPS fast." But production success requires balancing multiple competing objectives that any single metric obscures. As established in @sec-benchmarking-inference-metrics-78d4, modern inference systems demand evaluation across accuracy, latency, throughput, energy, and robustness dimensions. A recommendation model achieving `{python} rec_accuracy_pct_str`% accuracy with `{python} rec_p99_latency_ms_str` ms p99 latency fails service-level objectives requiring p99 < `{python} slo_p99_requirement_ms_str` ms despite excellent accuracy. Conversely, a system optimized for `{python} high_throughput_qps_str` QPS throughput achieves this rate while consuming `{python} high_power_w_str` W versus `{python} low_power_w_str` W for a slightly slower system at `{python} low_throughput_qps_str` QPS (`{python} power_ratio_str`$\times$ power difference). For battery-powered edge devices, the `{python} throughput_loss_pct_str`% throughput loss enables `{python} battery_multiplier_str`$\times$ longer operation time. Different stakeholders prioritize different metrics: ML engineers focus on accuracy, infrastructure teams on throughput and cost, product managers on latency percentiles. Single-metric optimization systematically produces systems that excel on one dimension while failing deployment requirements on others.

\index{Benchmark Saturation!outdated benchmarks pitfall}
**Pitfall:** *Using outdated benchmarks that no longer reflect current challenges and requirements.*

Benchmarks have inertia: teams continue reporting on established benchmarks long after those benchmarks cease to provide meaningful discrimination. Saturation occurs when multiple approaches achieve near-identical performance, eliminating useful comparison. ImageNet top-5 classification error decreased from `{python} imagenet_error_2010_pct_str`% in 2010 to `{python} imagenet_error_2015_pct_str`% by 2015, with the competition ending in `{python} imagenet_competition_end_year_str` when `{python} imagenet_teams_above_95_str` of `{python} imagenet_total_teams_str` teams exceeded 95% accuracy; further optimization beyond this threshold provides marginal value for most applications. Similarly, MNIST achieves `{python} mnist_accuracy_pct_str`% accuracy with simple models, yet teams still report improvements at the third decimal place. As discussed in @sec-benchmarking-statistical-methodological-issues-7aa5, statistical confidence intervals around these measurements often exceed the claimed improvements. Changing deployment contexts compound the problem: benchmarks designed for server hardware become misleading for edge devices with `{python} edge_memory_constraint_x_str`$\times$ memory constraints and `{python} edge_power_constraint_x_str`$\times$ power budgets. Effective benchmarking requires retiring saturated benchmarks and developing evaluation frameworks matching current deployment realities.

\index{Amdahl's Law!optimization ceiling for ML pipelines}
**Pitfall:** *Applying research-oriented benchmarks to evaluate production system performance.*

Research benchmarks exist to compare algorithms under controlled conditions; production systems exist to serve users under chaotic ones. Applying the former to evaluate the latter systematically overestimates performance, because research benchmarks assume unlimited computational resources, optimal data quality, and idealized conditions absent in production. As established in @sec-benchmarking-laboratorytodeployment-performance-gaps-16c8, production systems face concurrent user loads, varying input quality, network latency, and system failures that degrade performance. A system achieving `{python} isolated_throughput_qps_str` QPS throughput in isolated benchmarks sustains only `{python} production_throughput_range_str` QPS under production load with `{python} production_utilization_pct_str`% utilization (`{python} throughput_degradation_pct_str`% degradation) due to queue contention and garbage collection pauses. Research benchmarks report model inference time (`{python} model_inference_range_str` ms) while production end-to-end latency includes preprocessing, queuing, and postprocessing overhead totaling `{python} e2e_latency_range_str` ms. Production systems require `{python} availability_nines_str`% availability (`{python} downtime_minutes_month_str` minutes downtime per month) and graceful degradation under failures, characteristics research benchmarks ignore. Effective production evaluation requires operational metrics: sustained throughput under load, recovery time from failures, and complete latency breakdown.

## Summary {#sec-benchmarking-summary-5b23}

Benchmarking completes Part III's optimization pipeline by validating whether the efficiency gains from data selection (@sec-data-selection), model compression (@sec-model-compression), and hardware acceleration (@sec-hardware-acceleration) deliver in practice. Working backward through the optimization stack (hardware first, then model quality, then data representativeness), the three-dimensional framework catches failures at each layer before they cascade to production.

The validation sequence reflects how problems manifest: hardware issues surface immediately (wrong throughput, thermal throttling), model quality issues emerge under evaluation (accuracy degradation, calibration loss), and data issues often reveal themselves only in production (distribution shift, demographic bias). System benchmarks like MLPerf Training and Inference validate hardware claims with standardized workloads. Model quality benchmarks verify that compression preserved critical properties beyond top-line accuracy. Data benchmarks expose representativeness gaps that no amount of hardware optimization can compensate for.

::: {.callout-takeaways title="Measuring What Matters"}

* **Benchmarking is three-dimensional**: System, model, and data benchmarks each test different failure modes, and the full validation sequence must address all three.
* **Benchmarks are proxies, not truth**: Standardized results like MLPerf provide comparative baselines, but production performance depends on your specific data distribution, load patterns, and SLA constraints.
* **Granularity determines what you can diagnose**: Micro-benchmarks pinpoint which kernel is slow but miss system bottlenecks; end-to-end benchmarks capture production behavior but obscure root causes. Effective evaluation combines all three levels.
* **The tail determines the user experience**: Average latency obscures performance failures. Benchmarking for interactive systems must report p95 and p99 tail latencies to ensure SLO compliance under load.
* **Amdahl's Law sets the optimization ceiling**: Model speedup is limited by the non-model fraction of the pipeline. If preprocessing consumes 50% of the latency, even an infinite-speed model can only achieve 2$\times$ total system improvement.
* **Precision is an energy lever**: INT8 quantization provides 4$\times$ memory reduction but can deliver 10--20$\times$ energy reduction by shifting the balance from energy-intensive DRAM access to efficient integer arithmetic.
* **Compression validation requires more than accuracy**: INT8 quantization may preserve top-line accuracy while degrading calibration and edge-case robustness—failures invisible to aggregate metrics but critical for deployment.

:::

*Benchmarking discipline separates engineering from guesswork.* The practitioners who rigorously validate their optimizations (measuring wall-clock latency rather than trusting FLOP counts, profiling tail latencies rather than averages, testing on production-representative data rather than convenient benchmarks) build systems that perform as expected when deployed. As AI systems become increasingly influential in critical applications, this measurement rigor determines whether optimization claims translate into real-world impact.

::: {.callout-chapter-connection title="From Lab to Live"}

We have validated our optimizations in the lab, but a benchmark is a map, not the territory. Production reality includes traffic bursts, data drift, and cascading failures that no static benchmark can capture. In Part IV, we leave the controlled environment of the benchmark for the chaotic reality of production, beginning with @sec-model-serving where systems must survive contact with the real world.

:::

<!-- This is here to make sure that quizzes are inserted properly before a part begins. -->
::: { .quiz-end }
:::

```{=latex}
\part{key:vol1_deploy}
```

```{python}
#| echo: false
#| label: chapter-end
from mlsys.registry import end_chapter
end_chapter("vol1:benchmarking")
```