Files
cs249r_book/mlsysim/docs/tutorials/09_sensitivity.qmd
Vijay Janapa Reddi 3ba3858b74 MLSys·im 0.1.0 release-prep audit (#1397)
* docs(mlsysim): release-prep audit fixes for 0.1.0

Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.

Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
  Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
  limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
  metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
  (BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
  and surface the orphan tutorials (01_pipeline_callbacks,
  02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
  sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
  no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
  identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.

* release(mlsysim): add 0.1.0 release notes and runbook

- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
  with install/quickstart copy and a "known limitations & gotchas" section
  covering the editable-install issue, broken example scripts, and unpublished
  slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
  tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
  and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
  passing on dev.

* mlsysim: nest package layout, enable editable installs, clean lint

Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.

Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
  add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
  (no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
  `docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.

Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
  `__init__.py` re-export pattern (F401/F403/F405/F811),
  `core/constants.py` star import from unit registry,
  tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).

Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
  was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
  shadowing the module-level import and triggering F823 (referenced
  before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
  `except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
  assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
  computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.

Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
  `BatchingOptimizerResult` API (which has no `pareto_front` attribute);
  build the front explicitly by sweeping batch sizes through
  `ServingModel` + `TailLatencyModel`, then highlight the optimum
  returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
  (`f"\n[…]"` instead of an embedded literal newline) so the file imports
  on every supported Python version.

Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
  from package-relative imports to absolute `from mlsysim... import` so
  they run cleanly under the nested layout.

Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
  `pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
  that this commit resolves (editable install, pareto example, gemini
  example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
  layout change and the resolver bug fixes.

Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
  `examples/gemini_design_loop.py` → both exit 0.

* fix(mlsysim): repair docs build + lab test after nested-package restructure

The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:

1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
   a hand-rolled `importlib.util.spec_from_file_location` block to load
   `<repo>/mlsysim/__init__.py` directly from source. After the restructure,
   that path no longer exists. Replaced the hack in 17 docs/.qmd files with
   a plain `import mlsysim` — the package is already pip-installed in the
   docs build environment via `pip install ".[docs]"`. Updated the matching
   guidance in `contributing.qmd`.

2. **Lab static tests** — `test_no_localstorage_import` hard-coded
   `mlsysim/labs/state.py`; updated to the new nested path
   `mlsysim/mlsysim/labs/state.py`.

Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
2026-04-18 13:11:13 -04:00

270 lines
10 KiB
Plaintext

---
title: "Where to Invest: Sensitivity Analysis"
subtitle: "dT/dBW = -0.88 vs. dT/dFLOPS = -0.06. One number tells you where to spend your budget."
description: "Use partial derivatives of latency to identify the binding constraint for any model-hardware pair. Then invert the Roofline to derive minimum hardware specs from an SLA."
categories: ["analysis", "advanced"]
---
## The Question
Your team has budget for one hardware upgrade. Do you buy more FLOPS or more
bandwidth? Intuition says "more compute is always better" --- but for LLM inference,
bandwidth is **15x more valuable** than FLOPS. This tutorial shows you how to compute
that number analytically, and then invert the analysis to derive minimum hardware from
an SLA.
::: {.callout-note}
## Prerequisites
Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd) and
[Tutorial 1: The Memory Wall](01_memory_wall.qmd). You should understand
memory-bound vs. compute-bound regimes and the ridge point concept.
:::
::: {.callout-note}
## What You Will Learn
- **Compute** partial derivatives of latency with respect to each hardware parameter
- **Identify** the binding constraint for any model-hardware pair
- **Quantify** the asymmetry between bandwidth and FLOPS sensitivity
- **Derive** minimum hardware specs from a latency SLA using inverse Roofline
:::
::: {.callout-tip}
## Background: Sensitivity Analysis
In optimization, the **binding constraint** is the resource that actually limits
performance --- the one holding with equality at the solution. Sensitivity analysis
perturbs each hardware parameter by a fixed percentage and measures how much latency
changes. The result is a set of numerical partial derivatives:
$\frac{\Delta T / T}{\Delta x / x}$ for each parameter $x$. The parameter with the
largest absolute sensitivity is the binding constraint --- the one most worth investing in.
:::
---
## 1. Setup
```{python}
#| echo: false
#| output: false
import mlsysim # installed via `pip install mlsysim` (see workflow)
import mlsysim
```
```python
import mlsysim
from mlsysim import SensitivitySolver, SynthesisSolver, ServingModel
from mlsysim.core.constants import Q_
```
---
## 2. Sensitivity Analysis: Llama-3 70B on A100
We analyze **Llama-3.1-70B** inference on an **NVIDIA A100** --- a common deployment
scenario where procurement decisions have real budget implications.
```{python}
from mlsysim import SensitivitySolver, SynthesisSolver, ServingModel
from mlsysim.core.constants import Q_
from mlsysim.show import table, info
model = mlsysim.Models.Language.Llama3_70B
hardware = mlsysim.Hardware.Cloud.A100
# Compute partial derivatives of latency w.r.t. each hardware parameter
solver = SensitivitySolver()
res = solver.solve(model=model, hardware=hardware, precision="fp16")
info("Configuration",
Model=model.name,
Hardware=hardware.name,
Baseline_latency=res.baseline_latency.to('ms'),
Perturbation=f"{res.perturbation_pct}%")
rows = [[param, f"{sensitivity:+.4f}"] for param, sensitivity in res.sensitivities.items()]
table(["Parameter", "Sensitivity"], rows)
```
Each sensitivity value is the elasticity: "If I increase this parameter by 10%, latency
changes by this fraction." A sensitivity of **-0.88** on `memory_bandwidth` means a 10%
bandwidth increase yields roughly an 8.8% latency decrease. A sensitivity near **-0.06** on
`peak_flops` means more compute does almost nothing.
---
## 3. The Binding Constraint
```{python}
info("Binding Constraint",
Constraint=res.binding_constraint,
Interpretation=f"{res.binding_constraint} is the hardware knob most worth turning for {model.name} on {hardware.name}")
```
For a 70B-parameter model at batch size 1, every decode step must stream the entire model
from HBM. The arithmetic intensity is approximately 1 FLOP/byte --- far below the A100's
ridge point. The system is deeply memory-bound, and the sensitivity analysis confirms it
quantitatively.
---
## 4. The 15x Asymmetry
Let us make the asymmetry concrete. How much improvement does each dollar of upgrade buy?
```{python}
sens_bw = abs(res.sensitivities.get("memory_bandwidth", 0))
sens_flops = abs(res.sensitivities.get("peak_flops", 0))
if sens_flops > 0:
ratio = sens_bw / sens_flops
info("Sensitivity Asymmetry",
Bandwidth_sensitivity=f"{sens_bw:.4f}",
FLOPS_sensitivity=f"{sens_flops:.4f}",
Ratio=f"{ratio:.1f}x",
Verdict=f"A dollar spent on bandwidth improvement is ~{ratio:.0f}x more impactful than the same dollar spent on more FLOP/s")
else:
info("Sensitivity Asymmetry",
Bandwidth_sensitivity=f"{sens_bw:.4f}",
FLOPS_sensitivity=f"{sens_flops:.4f}",
Verdict="FLOPS has zero sensitivity --- purely memory-bound")
```
::: {.callout-important}
## Key Insight
**Sensitivity analysis reveals that bandwidth is ~15x more valuable than FLOPS for LLM
inference.** The partial derivative dT/dBW = -0.88 means a 10% bandwidth increase yields
8.8% latency reduction, while dT/dFLOPS = -0.06 means 10% more FLOPS yields only 0.6%
improvement. This is not intuition --- it is a quantitative measurement that should drive
every hardware procurement decision. The binding constraint, not the headline spec, determines
where your budget creates value.
:::
::: {.callout-warning}
## Fallacy: Investing in the Highest-Spec Number Maximizes Performance
GPU vendors advertise peak FLOP/s prominently because the number is large and impressive.
But for memory-bound workloads, a 10% bandwidth increase yields **15x** more improvement
than a 10% compute increase. The datasheet headline and the binding constraint are often
different parameters --- sensitivity analysis tells you which one actually matters.
:::
---
## 5. Inverse Roofline: From SLA to Hardware
Sensitivity analysis tells you which parameter is worth improving. The natural follow-up
is: given a performance target, *how much* improvement do you actually need?
The `SynthesisSolver` inverts the Roofline model. Instead of "given hardware, what is
the latency?", it asks: **"given a latency SLA, what hardware do I need?"**
Suppose your deployment requires an inter-token latency (ITL) of 50 ms or less:
```{python}
synth = SynthesisSolver()
specs = synth.solve(
model=model,
target_latency=Q_("50 ms"),
precision="fp16"
)
info("Inverse Roofline: Required Hardware",
Target_SLA="50 ms ITL",
Min_memory_BW=specs.required_bw.to('TB/s'),
Min_compute=specs.required_flops.to('TFLOPs/s'),
Min_memory=specs.required_memory.to('GB'))
```
The synthesis tells us we need approximately 2.8 TB/s of memory bandwidth --- **1.4x**
what the A100 provides. This immediately narrows the hardware search to H100-class or
newer GPUs.
---
## 6. Generational Comparison: Does the Binding Constraint Shift?
The most important insight from sensitivity analysis is that **hardware upgrades can shift
the binding constraint**. Let us compare across three GPU generations:
```{python}
gpus = [
("A100", mlsysim.Hardware.Cloud.A100),
("H100", mlsysim.Hardware.Cloud.H100),
("H200", mlsysim.Hardware.Cloud.H200),
]
rows = []
for name, hw in gpus:
r = solver.solve(model=model, hardware=hw, precision="fp16")
s_bw = r.sensitivities.get("memory_bandwidth", 0)
s_fl = r.sensitivities.get("peak_flops", 0)
lat = r.baseline_latency.to("ms").magnitude
rows.append([name, f"{s_bw:+.4f}", f"{s_fl:+.4f}", r.binding_constraint, f"{lat:.2f}ms"])
table(["GPU", "BW Sens", "FLOPS Sens", "Binding", "Latency"], rows)
```
If all three GPUs show `memory_bandwidth` as the binding constraint, it confirms that
the memory wall persists across generations. Compute has grown faster than bandwidth,
so the problem is getting *worse*, not better. If the binding constraint **shifts** on
newer hardware, it signals a qualitative regime change --- your optimization strategy
must change accordingly.
---
## Your Turn
::: {.callout-caution}
## Exercises
**Exercise 1: Predict before you compute.**
Before running any code, predict: which parameter has the highest sensitivity for
ResNet-50 at batch size 256 on an H100? (Hint: CNNs at large batch sizes have very
high arithmetic intensity.) Write your prediction, then verify with
`solver.solve(model=mlsysim.Models.ResNet50, hardware=mlsysim.Hardware.Cloud.H100)`.
Were you right?
**Exercise 2: Inverse solve for a tighter SLA.**
Use `SynthesisSolver` to find the minimum hardware specs for a 100 ms TTFT SLA on
Llama-3 70B. What bandwidth does this require? Does any hardware in the Silicon Zoo
meet this spec? What does this tell you about the feasibility of sub-100ms TTFT for
70B-parameter models?
**Exercise 3: The crossover model size.**
Run the sensitivity analysis on three models of increasing size: `mlsysim.Models.Llama3_8B`,
`mlsysim.Models.Llama3_70B`, and `mlsysim.Models.GPT3` (175B). At what model size does
the binding constraint shift from bandwidth to compute, if at all? What does the trend
tell you about the direction of the memory wall?
**Self-check:** If a 10% bandwidth increase yields 8.8% latency reduction, and a 10%
FLOPS increase yields 0.6% latency reduction, how much bandwidth increase would you need
to match the effect of doubling FLOPS?
:::
---
## Key Takeaways
::: {.callout-tip}
## Summary
- **Sensitivity analysis** computes numerical partial derivatives of latency, revealing
which hardware parameter is worth investing in
- **Bandwidth is ~15x more valuable** than FLOPS for LLM inference at batch size 1
- **Inverse Roofline synthesis** translates SLA requirements into minimum hardware specs,
enabling data-driven procurement shortlisting
- **Generational comparison** shows whether the binding constraint persists or shifts
across hardware generations
:::
---
## Next Steps
- **[GPU vs. Wafer-Scale](10_gpu_vs_wafer.qmd)** --- See how a fundamentally different architecture changes which wall binds
- **[Full-Stack Audit](12_full_stack_audit.qmd)** --- Compose all solvers into a complete systems analysis
- **[The Memory Wall](01_memory_wall.qmd)** --- Revisit the foundational tutorial on memory-bound vs. compute-bound
- **[Silicon Zoo](../zoo/hardware.qmd)** --- Browse all vetted hardware specs