mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-23 07:23:03 -05:00
Two issues caused the deployed slide PDFs to be unusable:
1. Every chapter .tex declared `\setsansfont{Helvetica Neue}` — proprietary
to Apple, not installed on the Ubuntu CI runner. xelatex bombed mid-frame,
the workflow's `|| true` swallowed the error, and the resulting PDF had
most text never typeset (blank pages with only logos/rules surviving).
Switch all 35 decks to TeX Gyre Heros (sans) and TeX Gyre Cursor (mono),
both bundled with texlive-fonts-extra — no external font downloads needed.
Drop the JetBrains Mono wget step and fonts-liberation from both slide
workflows accordingly.
2. Vol1 and Vol2 each ship `00_course_overview.pdf` and `01_introduction.pdf`.
The publish workflow uploaded them to a flat GitHub Release namespace, so
the second upload silently overwrote the first — clicking Vol I's Course
Overview actually downloaded Vol II's deck. Stage prefixed copies
(vol1_*.pdf, vol2_*.pdf) before upload, and update slides/vol{1,2}.qmd
plus the mlsysim cross-links to point at the new prefixed URLs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
218 lines
16 KiB
Plaintext
218 lines
16 KiB
Plaintext
---
|
||
title: "Accuracy & Validation"
|
||
subtitle: "How Well Do MLSYSIM Predictions Match Real Hardware?"
|
||
---
|
||
MLSYSIM is a **first-order analytical model**. It predicts performance from closed-form equations, not from cycle-accurate simulation or empirical measurement. This page documents where those predictions are accurate, where they diverge, and what drives the gap.
|
||
|
||
::: {.callout-note}
|
||
## What "first-order" means
|
||
|
||
A first-order model captures the **dominant** system behavior without modeling second-order effects such as cache hierarchy dynamics, memory fragmentation, NIC DMA contention, or driver overhead. Expect predictions within **15--30%** of measured throughput for well-optimized workloads on modern hardware. Use MLSYSIM to reason about bottlenecks and compare configurations, not to produce production SLA estimates.
|
||
|
||
For the formal treatment of roofline modeling and arithmetic intensity, see the [Hardware Acceleration slides (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"}.
|
||
:::
|
||
|
||
---
|
||
|
||
## Design Philosophy
|
||
|
||
MLSYSIM follows the same trade-off that Hennessy & Patterson's MIPS simulator made for computer architecture: sacrifice cycle accuracy for **taxonomic completeness** and **execution speed**. A gem5 simulation of a single LLaMA-70B inference step can take hours; MLSYSIM solves it in milliseconds. The goal is not to replace empirical benchmarks but to enable rapid, principled reasoning about system design trade-offs before committing to expensive experiments.
|
||
|
||
This philosophy maps directly to the core analytical models in the [Math Foundations](math.qmd) page. Each model implements a specific approximation grounded in published research, with a characterized accuracy envelope documented below.
|
||
|
||
---
|
||
|
||
## The Accuracy Hierarchy
|
||
|
||
Not all MLSYSIM predictions are created equal. The table below ranks prediction types from most to least accurate, so you know how much to trust each output.
|
||
|
||
| Prediction Type | Typical Accuracy | Why |
|
||
|:---|:---:|:---|
|
||
| **KV-cache sizing** | Exact | Definitional formula, not approximated |
|
||
| **Checkpoint sizing** | Exact | Direct calculation: $N \times \text{bytes\_per\_param}$ |
|
||
| **Bottleneck classification** (compute vs. memory) | >95% correct | The roofline ridge point is a structural property of the hardware |
|
||
| **Relative configuration ranking** (which config is faster?) | >90% correct | Errors cancel when comparing two configurations on the same hardware |
|
||
| **Scaling efficiency direction** (how does MFU change with DP/TP/PP?) | ±10% on MFU | Communication models are bandwidth-optimal lower bounds |
|
||
| **Single-node throughput** (absolute latency) | ±15--30% | Sensitive to the efficiency parameter η |
|
||
| **TCO and carbon estimates** | ±20% | Dominated by PUE and grid carbon intensity assumptions |
|
||
| **ITL in production serving** | −25% to −50% | Missing KV-cache paging, batch scheduling, and quantization kernel overhead |
|
||
|
||
: {tbl-colwidths="[28,15,57]"}
|
||
|
||
::: {.callout-tip}
|
||
## The golden rule
|
||
Trust MLSYSIM for **direction** and **classification**. Be cautious with **absolute numbers**. The model tells you *which* resource is the bottleneck and *which* configuration is better. It does not promise the exact millisecond.
|
||
:::
|
||
|
||
---
|
||
|
||
## Validation Against Published Benchmarks
|
||
|
||
The table below compares MLSYSIM roofline predictions against publicly reported results from **MLPerf Inference v4.0** (July 2024) and community benchmarks.
|
||
|
||
| Workload | Hardware | Predicted | Measured | Error | Source |
|
||
|:---|:---|:---:|:---:|:---:|:---|
|
||
| ResNet-50 (BS=1) | A100 SXM4 | ~0.42 ms | ~0.38 ms | +11% | MLPerf Inference v4.0 |
|
||
| ResNet-50 (BS=64) | A100 SXM4 | ~8.1 ms | ~7.5 ms | +8% | MLPerf Inference v4.0 |
|
||
| BERT-Large (BS=1) | H100 SXM5 | ~2.1 ms | ~1.9 ms | +11% | MLPerf Inference v4.0 |
|
||
| Llama2-70B TTFT | H100 SXM5 | ~45 ms (2K ctx) | ~40--50 ms | ±10% | vLLM benchmarks |
|
||
| Llama2-70B ITL | H100 SXM5 | ~4.2 ms/token | ~5--8 ms/token | −25% | vLLM benchmarks |
|
||
|
||
: {tbl-colwidths="[18,14,14,14,8,32]"}
|
||
|
||
::: {.callout-warning}
|
||
## ITL underprediction is expected
|
||
|
||
MLSYSIM predicts the **roofline lower bound** for inter-token latency. It does not model quantization kernel overhead, KV-cache paging latency (PagedAttention), or batch scheduling overhead in production serving systems. Real ITL is typically 1.5--2× the roofline bound. Use ITL predictions as the theoretical floor, not as a production estimate.
|
||
|
||
For the full treatment of serving-system overheads (continuous batching, PagedAttention, SLO compliance), see the [Inference at Scale slides (Vol II, Ch 9)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf){target="_blank"}.
|
||
:::
|
||
|
||
### Interpreting the error column
|
||
|
||
All errors are relative to the measured value: $\text{Error} = (\text{Predicted} - \text{Measured}) / \text{Measured}$. A positive error means MLSYSIM **overpredicts** latency (conservative). A negative error means MLSYSIM **underpredicts** (optimistic). The sign matters: overprediction is safer for capacity planning; underprediction is dangerous for SLA commitments.
|
||
|
||
---
|
||
|
||
## Where MLSYSIM is Most Accurate
|
||
|
||
### Bottleneck classification (roofline)
|
||
|
||
The model is most reliable for determining *which resource* limits performance. If MLSys·im predicts a memory bottleneck, the actual workload will almost always be memory-bound too, even if the exact latency differs. This classification is typically >95% correct across documented workloads because the roofline ridge point is a structural property of the hardware that does not depend on software efficiency.
|
||
|
||
The underlying roofline model is documented in the [Hardware Acceleration slides (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"} and applied to real H100 workloads in the [Compute Infrastructure slides (Vol II, Ch 2)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf){target="_blank"}.
|
||
|
||
### Scaling efficiency direction (distributed training)
|
||
|
||
The model correctly predicts how scaling efficiency changes as you vary DP/TP/PP configuration. The *relative* ranking of configurations is reliable even when absolute MFU values are off by ±10%. This is because the communication cost models (Ring AllReduce, Tree AllReduce, Hierarchical AllReduce) implement bandwidth-optimal lower bounds from Patarasuk & Mueller (2009), so the relative costs scale correctly.
|
||
|
||
See the [Distributed Training slides (Vol II, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} for the 3D parallelism framework and scaling efficiency analysis.
|
||
|
||
### KV-cache and checkpoint sizing
|
||
|
||
The KV-cache formula ($\text{KV} = 2 \times L \times H_{\text{kv}} \times d_{\text{head}} \times S \times B \times \text{bpe}$) is **definitional**, not approximated. It computes the exact number of bytes the K and V tensors occupy. Similarly, checkpoint sizing ($N \times \text{bytes\_per\_param}$) is a direct count. Memory feasibility checks (`feasible: True/False`) are accurate because they compare against the same HBM capacity reported in manufacturer datasheets.
|
||
|
||
### Carbon and TCO estimates
|
||
|
||
Sustainability and economics predictions are accurate to within 20% for standard cloud deployments. The main source of error is the assumed PUE (power usage effectiveness), which varies from ~1.1 (hyperscaler) to ~1.6 (enterprise). The carbon accounting methodology aligns with the lifecycle framework in the [Sustainable AI slides (Vol II, Ch 15)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"}.
|
||
|
||
---
|
||
|
||
## Where MLSYSIM Diverges From Measurement {#sec-divergence}
|
||
|
||
| Source of Error | Typical Impact | When It Matters |
|
||
|:---|:---:|:---|
|
||
| `efficiency=0.5` default | ±20% on latency | Any roofline prediction |
|
||
| No cache hierarchy model | 5--30% on small batches | Batch size 1--4, small models |
|
||
| No NVLink contention model | 5--15% on TP overhead | Tensor parallel with TP > 4 |
|
||
| No pipeline schedule optimization | 10--20% on PP efficiency | Interleaved 1F1B schedules |
|
||
| No quantization kernel overhead | −30% on INT8 ITL | Quantized serving |
|
||
| No memory fragmentation | −10--20% on KV-cache capacity | Long-context serving |
|
||
|
||
: {tbl-colwidths="[30,18,52]"}
|
||
|
||
### The efficiency parameter (η)
|
||
|
||
MLSYSIM uses `efficiency` (η, default 0.5) as a single scalar representing hardware utilization. This is the largest source of absolute error. The table below provides calibrated ranges from published benchmarks.
|
||
|
||
| Workload Type | Recommended η | Reference |
|
||
|:---|:---:|:---|
|
||
| Training (fp16/bf16, Megatron-LM) | 0.35--0.55 | Shoeybi et al. (2019) |
|
||
| Inference (fp16, vLLM/TRT-LLM) | 0.25--0.45 | MLPerf Inference v4.0 |
|
||
| Inference (int8, quantized) | 0.20--0.40 | Community benchmarks |
|
||
| Edge inference (TFLite, ONNX RT) | 0.15--0.30 | MLPerf Tiny |
|
||
|
||
: {tbl-colwidths="[35,20,45]"}
|
||
|
||
When you use the default `efficiency=0.5`, you are modeling a well-optimized training job on datacenter hardware. For inference, pass `efficiency=0.35` for more conservative estimates. For edge devices, use `efficiency=0.2`.
|
||
|
||
For a rigorous treatment of precision engineering and its impact on performance, see the [Performance Engineering slides (Vol II, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf){target="_blank"}.
|
||
|
||
### What is *not* modeled
|
||
|
||
MLSYSIM deliberately omits the following effects. Each omission is a design decision, not an oversight:
|
||
|
||
1. **Cache hierarchy behavior.** L1/L2/SRAM tiling effects can swing latency by 5--30% for small batch sizes. Modeling them would require operator-level simulation, which conflicts with the goal of millisecond solve times.
|
||
|
||
2. **NVLink contention under tensor parallelism.** For TP ≤ 4 (within a single NVSwitch domain), contention is negligible. For TP > 4 (cross-switch), contention can add 5--15% overhead that MLSYSIM does not capture.
|
||
|
||
3. **Pipeline schedule variants.** MLSYSIM implements the standard GPipe bubble formula: $\text{bubble} = (P - 1) / (V \cdot M + P - 1)$. Advanced schedules (zero-bubble, interleaved 1F1B with $V > 1$) can reduce bubbles by 10--20% beyond what the formula predicts.
|
||
|
||
4. **Quantization kernel efficiency.** INT8/INT4 kernels achieve lower utilization than FP16 Tensor Core kernels due to dequantization overhead and less mature compiler support. MLSYSIM treats precision as a pure ops-per-byte ratio.
|
||
|
||
5. **Memory fragmentation and PagedAttention overhead.** The KV-cache formula gives the *dense* allocation size. In practice, PagedAttention introduces fragmentation (typically 5--10%) and paging latency.
|
||
|
||
---
|
||
|
||
## Comparison to Related Tools
|
||
|
||
| Tool | Type | Accuracy | Solve Time | Best For |
|
||
|:---|:---|:---:|:---:|:---|
|
||
| **MLSYSIM** | First-order analytical | ±15--30% | Milliseconds | Bottleneck analysis, HW comparison, education |
|
||
| **MLPerf** | Empirical benchmark | Ground truth | Days--weeks | Published industry comparisons |
|
||
| **vLLM benchmark_serving.py** | Empirical profiling | Exact (that config) | Hours | Production serving tuning |
|
||
| **PyTorch Profiler** | Empirical profiling | Exact (that run) | Minutes | Kernel-level optimization |
|
||
| **Megatron estimator** | Heuristic model | ±5--10% | Seconds | Megatron-specific training configs |
|
||
| **gem5** | Cycle-accurate simulation | ±1--5% | Days | Hardware research (100--1000× slower) |
|
||
|
||
: {tbl-colwidths="[18,17,13,13,39]"}
|
||
|
||
MLSYSIM occupies a unique position: it is the only tool that covers the full stack (single-node roofline through fleet-scale TCO) in a single, composable framework. Use it when you want to **compare options before running experiments**: "Will the H100 or MI300X be better for this serving workload?" or "Does PP=4 or PP=8 give better scaling efficiency?" For production SLAs, validate with empirical benchmarks.
|
||
|
||
For the full benchmarking methodology taxonomy (system, model, and data benchmarks), statistical rigor requirements (percentiles, confidence intervals), and MLPerf submission anatomy, see the [Benchmarking slides (Vol I, Ch 12)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf){target="_blank"}.
|
||
|
||
---
|
||
|
||
## When to Use MLSYSIM (and When Not To)
|
||
|
||
::: {.panel-tabset}
|
||
|
||
### Use MLSYSIM when...
|
||
|
||
- **Comparing hardware** for a known workload ("H100 vs. MI300X for Llama-3 70B?")
|
||
- **Choosing parallelism strategy** ("DP=8 or TP=4×PP=2?")
|
||
- **Estimating memory feasibility** ("Does Llama-3 405B fit in 8×H100 HBM at FP16?")
|
||
- **Budgeting TCO and carbon** before procurement decisions
|
||
- **Teaching** the Iron Law, roofline model, and systems reasoning
|
||
- **Rapid prototyping** system designs before committing to cluster time
|
||
|
||
### Do not use MLSYSIM for...
|
||
|
||
- **Production SLA guarantees** (use empirical benchmarks instead)
|
||
- **Kernel-level optimization** (use PyTorch Profiler or NSight Compute)
|
||
- **Quantized model accuracy** (MLSYSIM models throughput, not model quality)
|
||
- **Exact latency targets** where ±30% error is unacceptable
|
||
- **Novel hardware** with no published specs in the [Silicon Zoo](zoo/hardware.qmd)
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Slide Deck Reference
|
||
|
||
The theory behind MLSYSIM's analytical models is covered across several lecture decks. Use this table to find the right slides for each validation domain.
|
||
|
||
| Validation Domain | MLSYSIM Solver | Companion Slides |
|
||
|:---|:---|:---|
|
||
| Roofline model, arithmetic intensity, bottleneck classification | SingleNodeModel | [Hardware Acceleration (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"} |
|
||
| Statistical rigor, MLPerf methodology, benchmark anti-patterns | All solvers | [Benchmarking (Vol I, Ch 12)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf){target="_blank"} |
|
||
| H100 roofline landscape, GPU specs, TCO analysis | SingleNodeModel, EconomicsModel | [Compute Infrastructure (Vol II, Ch 2)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf){target="_blank"} |
|
||
| 3D parallelism, scaling efficiency, communication overhead | DistributedModel | [Distributed Training (Vol II, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} |
|
||
| Serving latency, KV-cache, continuous batching, SLO compliance | ServingModel | [Inference at Scale (Vol II, Ch 9)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf){target="_blank"} |
|
||
| Precision engineering, operator fusion, profiling methodology | SingleNodeModel | [Performance Engineering (Vol II, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf){target="_blank"} |
|
||
| Energy, carbon lifecycle, PUE, energy roofline | SustainabilityModel | [Sustainable AI (Vol II, Ch 15)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"} |
|
||
|
||
: {tbl-colwidths="[30,20,50]"}
|
||
|
||
---
|
||
|
||
## Citing Sources
|
||
|
||
The hardware specifications in the [Silicon Zoo](zoo/hardware.qmd) are sourced from official manufacturer datasheets. See the `source_url` and `last_verified` metadata fields in `mlsysim/hardware/registry.py` for the specific document and verification date for each entry.
|
||
|
||
For the MLPerf comparison data on this page, see [MLPerf Inference v4.0 Results](https://mlcommons.org/benchmarks/inference-datacenter/) (MLCommons, July 2024).
|
||
|
||
---
|
||
|
||
*If you observe a significant discrepancy between MLSYSIM predictions and measured results on your hardware, please [open an issue](https://github.com/harvard-edge/cs249r_book/issues) with the workload, hardware, and measured numbers. Discrepancies often reveal bugs or missing constants in the model.*
|