cs249r_book/mlsysim/docs/accuracy.qmd

---
title: "Accuracy & Validation"
subtitle: "How Well Do MLSYSIM Predictions Match Real Hardware?"
---
MLSYSIM is a **first-order analytical model**. It predicts performance from closed-form equations, not from cycle-accurate simulation or empirical measurement. This page documents where those predictions are accurate, where they diverge, and what drives the gap.

::: {.callout-note}
## What "first-order" means

A first-order model captures the **dominant** system behavior without modeling second-order effects such as cache hierarchy dynamics, memory fragmentation, NIC DMA contention, or driver overhead. Expect predictions within **15--30%** of measured throughput for well-optimized workloads on modern hardware. Use MLSYSIM to reason about bottlenecks and compare configurations, not to produce production SLA estimates.

For the formal treatment of roofline modeling and arithmetic intensity, see the [Hardware Acceleration slides (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"}.
:::

---

## Design Philosophy

MLSYSIM follows the same trade-off that Hennessy & Patterson's MIPS simulator made for computer architecture: sacrifice cycle accuracy for **taxonomic completeness** and **execution speed**. A gem5 simulation of a single LLaMA-70B inference step can take hours; MLSYSIM solves it in milliseconds. The goal is not to replace empirical benchmarks but to enable rapid, principled reasoning about system design trade-offs before committing to expensive experiments.

This philosophy maps directly to the core analytical models in the [Math Foundations](math.qmd) page. Each model implements a specific approximation grounded in published research, with a characterized accuracy envelope documented below.

---

## The Accuracy Hierarchy

Not all MLSYSIM predictions are created equal. The table below ranks prediction types from most to least accurate, so you know how much to trust each output.

| Prediction Type | Typical Accuracy | Why |
|:---|:---:|:---|
| **KV-cache sizing** | Exact | Definitional formula, not approximated |
| **Checkpoint sizing** | Exact | Direct calculation: $N \times \text{bytes\_per\_param}$ |
| **Bottleneck classification** (compute vs. memory) | >95% correct | The roofline ridge point is a structural property of the hardware |
| **Relative configuration ranking** (which config is faster?) | >90% correct | Errors cancel when comparing two configurations on the same hardware |
| **Scaling efficiency direction** (how does MFU change with DP/TP/PP?) | ±10% on MFU | Communication models are bandwidth-optimal lower bounds |
| **Single-node throughput** (absolute latency) | ±15--30% | Sensitive to the efficiency parameter η |
| **TCO and carbon estimates** | ±20% | Dominated by PUE and grid carbon intensity assumptions |
| **ITL in production serving** | −25% to −50% | Missing KV-cache paging, batch scheduling, and quantization kernel overhead |

: {tbl-colwidths="[28,15,57]"}

::: {.callout-tip}
## The golden rule
Trust MLSYSIM for **direction** and **classification**. Be cautious with **absolute numbers**. The model tells you *which* resource is the bottleneck and *which* configuration is better. It does not promise the exact millisecond.
:::

---

## Validation Against Published Benchmarks

The table below compares MLSYSIM roofline predictions against publicly reported results from **MLPerf Inference v4.0** (July 2024) and community benchmarks.

| Workload | Hardware | Predicted | Measured | Error | Source |
|:---|:---|:---:|:---:|:---:|:---|
| ResNet-50 (BS=1) | A100 SXM4 | ~0.42 ms | ~0.38 ms | +11% | MLPerf Inference v4.0 |
| ResNet-50 (BS=64) | A100 SXM4 | ~8.1 ms | ~7.5 ms | +8% | MLPerf Inference v4.0 |
| BERT-Large (BS=1) | H100 SXM5 | ~2.1 ms | ~1.9 ms | +11% | MLPerf Inference v4.0 |
| Llama2-70B TTFT | H100 SXM5 | ~45 ms (2K ctx) | ~40--50 ms | ±10% | vLLM benchmarks |
| Llama2-70B ITL | H100 SXM5 | ~4.2 ms/token | ~5--8 ms/token | −25% | vLLM benchmarks |

: {tbl-colwidths="[18,14,14,14,8,32]"}

::: {.callout-warning}
## ITL underprediction is expected

MLSYSIM predicts the **roofline lower bound** for inter-token latency. It does not model quantization kernel overhead, KV-cache paging latency (PagedAttention), or batch scheduling overhead in production serving systems. Real ITL is typically 1.5--2× the roofline bound. Use ITL predictions as the theoretical floor, not as a production estimate.

For the full treatment of serving-system overheads (continuous batching, PagedAttention, SLO compliance), see the [Inference at Scale slides (Vol II, Ch 9)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf){target="_blank"}.
:::

### Interpreting the error column

All errors are relative to the measured value: $\text{Error} = (\text{Predicted} - \text{Measured}) / \text{Measured}$. A positive error means MLSYSIM **overpredicts** latency (conservative). A negative error means MLSYSIM **underpredicts** (optimistic). The sign matters: overprediction is safer for capacity planning; underprediction is dangerous for SLA commitments.

---

## Where MLSYSIM is Most Accurate

### Bottleneck classification (roofline)

The model is most reliable for determining *which resource* limits performance. If MLSys·im predicts a memory bottleneck, the actual workload will almost always be memory-bound too, even if the exact latency differs. This classification is typically >95% correct across documented workloads because the roofline ridge point is a structural property of the hardware that does not depend on software efficiency.

The underlying roofline model is documented in the [Hardware Acceleration slides (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"} and applied to real H100 workloads in the [Compute Infrastructure slides (Vol II, Ch 2)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf){target="_blank"}.

### Scaling efficiency direction (distributed training)

The model correctly predicts how scaling efficiency changes as you vary DP/TP/PP configuration. The *relative* ranking of configurations is reliable even when absolute MFU values are off by ±10%. This is because the communication cost models (Ring AllReduce, Tree AllReduce, Hierarchical AllReduce) implement bandwidth-optimal lower bounds from Patarasuk & Mueller (2009), so the relative costs scale correctly.

See the [Distributed Training slides (Vol II, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} for the 3D parallelism framework and scaling efficiency analysis.

### KV-cache and checkpoint sizing

The KV-cache formula ($\text{KV} = 2 \times L \times H_{\text{kv}} \times d_{\text{head}} \times S \times B \times \text{bpe}$) is **definitional**, not approximated. It computes the exact number of bytes the K and V tensors occupy. Similarly, checkpoint sizing ($N \times \text{bytes\_per\_param}$) is a direct count. Memory feasibility checks (`feasible: True/False`) are accurate because they compare against the same HBM capacity reported in manufacturer datasheets.

### Carbon and TCO estimates

Sustainability and economics predictions are accurate to within 20% for standard cloud deployments. The main source of error is the assumed PUE (power usage effectiveness), which varies from ~1.1 (hyperscaler) to ~1.6 (enterprise). The carbon accounting methodology aligns with the lifecycle framework in the [Sustainable AI slides (Vol II, Ch 15)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"}.

---

## Where MLSYSIM Diverges From Measurement {#sec-divergence}

| Source of Error | Typical Impact | When It Matters |
|:---|:---:|:---|
| `efficiency=0.5` default | ±20% on latency | Any roofline prediction |
| No cache hierarchy model | 5--30% on small batches | Batch size 1--4, small models |
| No NVLink contention model | 5--15% on TP overhead | Tensor parallel with TP > 4 |
| No pipeline schedule optimization | 10--20% on PP efficiency | Interleaved 1F1B schedules |
| No quantization kernel overhead | −30% on INT8 ITL | Quantized serving |
| No memory fragmentation | −10--20% on KV-cache capacity | Long-context serving |

: {tbl-colwidths="[30,18,52]"}

### The efficiency parameter (η)

MLSYSIM uses `efficiency` (η, default 0.5) as a single scalar representing hardware utilization. This is the largest source of absolute error. The table below provides calibrated ranges from published benchmarks.

| Workload Type | Recommended η | Reference |
|:---|:---:|:---|
| Training (fp16/bf16, Megatron-LM) | 0.35--0.55 | Shoeybi et al. (2019) |
| Inference (fp16, vLLM/TRT-LLM) | 0.25--0.45 | MLPerf Inference v4.0 |
| Inference (int8, quantized) | 0.20--0.40 | Community benchmarks |
| Edge inference (TFLite, ONNX RT) | 0.15--0.30 | MLPerf Tiny |

: {tbl-colwidths="[35,20,45]"}

When you use the default `efficiency=0.5`, you are modeling a well-optimized training job on datacenter hardware. For inference, pass `efficiency=0.35` for more conservative estimates. For edge devices, use `efficiency=0.2`.

For a rigorous treatment of precision engineering and its impact on performance, see the [Performance Engineering slides (Vol II, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf){target="_blank"}.

### What is *not* modeled

MLSYSIM deliberately omits the following effects. Each omission is a design decision, not an oversight:

1. **Cache hierarchy behavior.** L1/L2/SRAM tiling effects can swing latency by 5--30% for small batch sizes. Modeling them would require operator-level simulation, which conflicts with the goal of millisecond solve times.

2. **NVLink contention under tensor parallelism.** For TP ≤ 4 (within a single NVSwitch domain), contention is negligible. For TP > 4 (cross-switch), contention can add 5--15% overhead that MLSYSIM does not capture.

3. **Pipeline schedule variants.** MLSYSIM implements the standard GPipe bubble formula: $\text{bubble} = (P - 1) / (V \cdot M + P - 1)$. Advanced schedules (zero-bubble, interleaved 1F1B with $V > 1$) can reduce bubbles by 10--20% beyond what the formula predicts.

4. **Quantization kernel efficiency.** INT8/INT4 kernels achieve lower utilization than FP16 Tensor Core kernels due to dequantization overhead and less mature compiler support. MLSYSIM treats precision as a pure ops-per-byte ratio.

5. **Memory fragmentation and PagedAttention overhead.** The KV-cache formula gives the *dense* allocation size. In practice, PagedAttention introduces fragmentation (typically 5--10%) and paging latency.

---

## Comparison to Related Tools

| Tool | Type | Accuracy | Solve Time | Best For |
|:---|:---|:---:|:---:|:---|
| **MLSYSIM** | First-order analytical | ±15--30% | Milliseconds | Bottleneck analysis, HW comparison, education |
| **MLPerf** | Empirical benchmark | Ground truth | Days--weeks | Published industry comparisons |
| **vLLM benchmark_serving.py** | Empirical profiling | Exact (that config) | Hours | Production serving tuning |
| **PyTorch Profiler** | Empirical profiling | Exact (that run) | Minutes | Kernel-level optimization |
| **Megatron estimator** | Heuristic model | ±5--10% | Seconds | Megatron-specific training configs |
| **gem5** | Cycle-accurate simulation | ±1--5% | Days | Hardware research (100--1000× slower) |

: {tbl-colwidths="[18,17,13,13,39]"}

MLSYSIM occupies a unique position: it is the only tool that covers the full stack (single-node roofline through fleet-scale TCO) in a single, composable framework. Use it when you want to **compare options before running experiments**: "Will the H100 or MI300X be better for this serving workload?" or "Does PP=4 or PP=8 give better scaling efficiency?" For production SLAs, validate with empirical benchmarks.

For the full benchmarking methodology taxonomy (system, model, and data benchmarks), statistical rigor requirements (percentiles, confidence intervals), and MLPerf submission anatomy, see the [Benchmarking slides (Vol I, Ch 12)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf){target="_blank"}.

---

## When to Use MLSYSIM (and When Not To)

::: {.panel-tabset}

### Use MLSYSIM when...

- **Comparing hardware** for a known workload ("H100 vs. MI300X for Llama-3 70B?")
- **Choosing parallelism strategy** ("DP=8 or TP=4×PP=2?")
- **Estimating memory feasibility** ("Does Llama-3 405B fit in 8×H100 HBM at FP16?")
- **Budgeting TCO and carbon** before procurement decisions
- **Teaching** the Iron Law, roofline model, and systems reasoning
- **Rapid prototyping** system designs before committing to cluster time

### Do not use MLSYSIM for...

- **Production SLA guarantees** (use empirical benchmarks instead)
- **Kernel-level optimization** (use PyTorch Profiler or NSight Compute)
- **Quantized model accuracy** (MLSYSIM models throughput, not model quality)
- **Exact latency targets** where ±30% error is unacceptable
- **Novel hardware** with no published specs in the [Silicon Zoo](zoo/hardware.qmd)

:::

---

## Slide Deck Reference

The theory behind MLSYSIM's analytical models is covered across several lecture decks. Use this table to find the right slides for each validation domain.

| Validation Domain | MLSYSIM Solver | Companion Slides |
|:---|:---|:---|
| Roofline model, arithmetic intensity, bottleneck classification | SingleNodeModel | [Hardware Acceleration (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"} |
| Statistical rigor, MLPerf methodology, benchmark anti-patterns | All solvers | [Benchmarking (Vol I, Ch 12)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf){target="_blank"} |
| H100 roofline landscape, GPU specs, TCO analysis | SingleNodeModel, EconomicsModel | [Compute Infrastructure (Vol II, Ch 2)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf){target="_blank"} |
| 3D parallelism, scaling efficiency, communication overhead | DistributedModel | [Distributed Training (Vol II, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} |
| Serving latency, KV-cache, continuous batching, SLO compliance | ServingModel | [Inference at Scale (Vol II, Ch 9)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf){target="_blank"} |
| Precision engineering, operator fusion, profiling methodology | SingleNodeModel | [Performance Engineering (Vol II, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf){target="_blank"} |
| Energy, carbon lifecycle, PUE, energy roofline | SustainabilityModel | [Sustainable AI (Vol II, Ch 15)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"} |

: {tbl-colwidths="[30,20,50]"}

---

## Citing Sources

The hardware specifications in the [Silicon Zoo](zoo/hardware.qmd) are sourced from official manufacturer datasheets. See the `source_url` and `last_verified` metadata fields in `mlsysim/hardware/registry.py` for the specific document and verification date for each entry.

For the MLPerf comparison data on this page, see [MLPerf Inference v4.0 Results](https://mlcommons.org/benchmarks/inference-datacenter/) (MLCommons, July 2024).

---

*If you observe a significant discrepancy between MLSYSIM predictions and measured results on your hardware, please [open an issue](https://github.com/harvard-edge/cs249r_book/issues) with the workload, hardware, and measured numbers. Discrepancies often reveal bugs or missing constants in the model.*