mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 09:57:21 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
270 lines
10 KiB
Plaintext
270 lines
10 KiB
Plaintext
---
|
|
title: "Where to Invest: Sensitivity Analysis"
|
|
subtitle: "dT/dBW = -0.88 vs. dT/dFLOPS = -0.06. One number tells you where to spend your budget."
|
|
description: "Use partial derivatives of latency to identify the binding constraint for any model-hardware pair. Then invert the Roofline to derive minimum hardware specs from an SLA."
|
|
categories: ["analysis", "advanced"]
|
|
---
|
|
|
|
## The Question
|
|
|
|
Your team has budget for one hardware upgrade. Do you buy more FLOPS or more
|
|
bandwidth? Intuition says "more compute is always better" --- but for LLM inference,
|
|
bandwidth is **15x more valuable** than FLOPS. This tutorial shows you how to compute
|
|
that number analytically, and then invert the analysis to derive minimum hardware from
|
|
an SLA.
|
|
|
|
::: {.callout-note}
|
|
## Prerequisites
|
|
Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd) and
|
|
[Tutorial 1: The Memory Wall](01_memory_wall.qmd). You should understand
|
|
memory-bound vs. compute-bound regimes and the ridge point concept.
|
|
:::
|
|
|
|
::: {.callout-note}
|
|
## What You Will Learn
|
|
|
|
- **Compute** partial derivatives of latency with respect to each hardware parameter
|
|
- **Identify** the binding constraint for any model-hardware pair
|
|
- **Quantify** the asymmetry between bandwidth and FLOPS sensitivity
|
|
- **Derive** minimum hardware specs from a latency SLA using inverse Roofline
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
## Background: Sensitivity Analysis
|
|
|
|
In optimization, the **binding constraint** is the resource that actually limits
|
|
performance --- the one holding with equality at the solution. Sensitivity analysis
|
|
perturbs each hardware parameter by a fixed percentage and measures how much latency
|
|
changes. The result is a set of numerical partial derivatives:
|
|
$\frac{\Delta T / T}{\Delta x / x}$ for each parameter $x$. The parameter with the
|
|
largest absolute sensitivity is the binding constraint --- the one most worth investing in.
|
|
:::
|
|
|
|
---
|
|
|
|
## 1. Setup
|
|
|
|
```{python}
|
|
#| echo: false
|
|
#| output: false
|
|
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
|
import mlsysim
|
|
```
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import SensitivitySolver, SynthesisSolver, ServingModel
|
|
from mlsysim.core.constants import Q_
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Sensitivity Analysis: Llama-3 70B on A100
|
|
|
|
We analyze **Llama-3.1-70B** inference on an **NVIDIA A100** --- a common deployment
|
|
scenario where procurement decisions have real budget implications.
|
|
|
|
```{python}
|
|
from mlsysim import SensitivitySolver, SynthesisSolver, ServingModel
|
|
from mlsysim.core.constants import Q_
|
|
from mlsysim.show import table, info
|
|
|
|
model = mlsysim.Models.Language.Llama3_70B
|
|
hardware = mlsysim.Hardware.Cloud.A100
|
|
|
|
# Compute partial derivatives of latency w.r.t. each hardware parameter
|
|
solver = SensitivitySolver()
|
|
res = solver.solve(model=model, hardware=hardware, precision="fp16")
|
|
|
|
info("Configuration",
|
|
Model=model.name,
|
|
Hardware=hardware.name,
|
|
Baseline_latency=res.baseline_latency.to('ms'),
|
|
Perturbation=f"{res.perturbation_pct}%")
|
|
|
|
rows = [[param, f"{sensitivity:+.4f}"] for param, sensitivity in res.sensitivities.items()]
|
|
table(["Parameter", "Sensitivity"], rows)
|
|
```
|
|
|
|
Each sensitivity value is the elasticity: "If I increase this parameter by 10%, latency
|
|
changes by this fraction." A sensitivity of **-0.88** on `memory_bandwidth` means a 10%
|
|
bandwidth increase yields roughly an 8.8% latency decrease. A sensitivity near **-0.06** on
|
|
`peak_flops` means more compute does almost nothing.
|
|
|
|
---
|
|
|
|
## 3. The Binding Constraint
|
|
|
|
```{python}
|
|
info("Binding Constraint",
|
|
Constraint=res.binding_constraint,
|
|
Interpretation=f"{res.binding_constraint} is the hardware knob most worth turning for {model.name} on {hardware.name}")
|
|
```
|
|
|
|
For a 70B-parameter model at batch size 1, every decode step must stream the entire model
|
|
from HBM. The arithmetic intensity is approximately 1 FLOP/byte --- far below the A100's
|
|
ridge point. The system is deeply memory-bound, and the sensitivity analysis confirms it
|
|
quantitatively.
|
|
|
|
---
|
|
|
|
## 4. The 15x Asymmetry
|
|
|
|
Let us make the asymmetry concrete. How much improvement does each dollar of upgrade buy?
|
|
|
|
```{python}
|
|
sens_bw = abs(res.sensitivities.get("memory_bandwidth", 0))
|
|
sens_flops = abs(res.sensitivities.get("peak_flops", 0))
|
|
|
|
if sens_flops > 0:
|
|
ratio = sens_bw / sens_flops
|
|
info("Sensitivity Asymmetry",
|
|
Bandwidth_sensitivity=f"{sens_bw:.4f}",
|
|
FLOPS_sensitivity=f"{sens_flops:.4f}",
|
|
Ratio=f"{ratio:.1f}x",
|
|
Verdict=f"A dollar spent on bandwidth improvement is ~{ratio:.0f}x more impactful than the same dollar spent on more FLOP/s")
|
|
else:
|
|
info("Sensitivity Asymmetry",
|
|
Bandwidth_sensitivity=f"{sens_bw:.4f}",
|
|
FLOPS_sensitivity=f"{sens_flops:.4f}",
|
|
Verdict="FLOPS has zero sensitivity --- purely memory-bound")
|
|
```
|
|
|
|
::: {.callout-important}
|
|
## Key Insight
|
|
|
|
**Sensitivity analysis reveals that bandwidth is ~15x more valuable than FLOPS for LLM
|
|
inference.** The partial derivative dT/dBW = -0.88 means a 10% bandwidth increase yields
|
|
8.8% latency reduction, while dT/dFLOPS = -0.06 means 10% more FLOPS yields only 0.6%
|
|
improvement. This is not intuition --- it is a quantitative measurement that should drive
|
|
every hardware procurement decision. The binding constraint, not the headline spec, determines
|
|
where your budget creates value.
|
|
:::
|
|
|
|
::: {.callout-warning}
|
|
## Fallacy: Investing in the Highest-Spec Number Maximizes Performance
|
|
|
|
GPU vendors advertise peak FLOP/s prominently because the number is large and impressive.
|
|
But for memory-bound workloads, a 10% bandwidth increase yields **15x** more improvement
|
|
than a 10% compute increase. The datasheet headline and the binding constraint are often
|
|
different parameters --- sensitivity analysis tells you which one actually matters.
|
|
:::
|
|
|
|
---
|
|
|
|
## 5. Inverse Roofline: From SLA to Hardware
|
|
|
|
Sensitivity analysis tells you which parameter is worth improving. The natural follow-up
|
|
is: given a performance target, *how much* improvement do you actually need?
|
|
|
|
The `SynthesisSolver` inverts the Roofline model. Instead of "given hardware, what is
|
|
the latency?", it asks: **"given a latency SLA, what hardware do I need?"**
|
|
|
|
Suppose your deployment requires an inter-token latency (ITL) of 50 ms or less:
|
|
|
|
```{python}
|
|
synth = SynthesisSolver()
|
|
specs = synth.solve(
|
|
model=model,
|
|
target_latency=Q_("50 ms"),
|
|
precision="fp16"
|
|
)
|
|
|
|
info("Inverse Roofline: Required Hardware",
|
|
Target_SLA="50 ms ITL",
|
|
Min_memory_BW=specs.required_bw.to('TB/s'),
|
|
Min_compute=specs.required_flops.to('TFLOPs/s'),
|
|
Min_memory=specs.required_memory.to('GB'))
|
|
```
|
|
|
|
The synthesis tells us we need approximately 2.8 TB/s of memory bandwidth --- **1.4x**
|
|
what the A100 provides. This immediately narrows the hardware search to H100-class or
|
|
newer GPUs.
|
|
|
|
---
|
|
|
|
## 6. Generational Comparison: Does the Binding Constraint Shift?
|
|
|
|
The most important insight from sensitivity analysis is that **hardware upgrades can shift
|
|
the binding constraint**. Let us compare across three GPU generations:
|
|
|
|
```{python}
|
|
gpus = [
|
|
("A100", mlsysim.Hardware.Cloud.A100),
|
|
("H100", mlsysim.Hardware.Cloud.H100),
|
|
("H200", mlsysim.Hardware.Cloud.H200),
|
|
]
|
|
|
|
rows = []
|
|
for name, hw in gpus:
|
|
r = solver.solve(model=model, hardware=hw, precision="fp16")
|
|
s_bw = r.sensitivities.get("memory_bandwidth", 0)
|
|
s_fl = r.sensitivities.get("peak_flops", 0)
|
|
lat = r.baseline_latency.to("ms").magnitude
|
|
rows.append([name, f"{s_bw:+.4f}", f"{s_fl:+.4f}", r.binding_constraint, f"{lat:.2f}ms"])
|
|
|
|
table(["GPU", "BW Sens", "FLOPS Sens", "Binding", "Latency"], rows)
|
|
```
|
|
|
|
If all three GPUs show `memory_bandwidth` as the binding constraint, it confirms that
|
|
the memory wall persists across generations. Compute has grown faster than bandwidth,
|
|
so the problem is getting *worse*, not better. If the binding constraint **shifts** on
|
|
newer hardware, it signals a qualitative regime change --- your optimization strategy
|
|
must change accordingly.
|
|
|
|
---
|
|
|
|
## Your Turn
|
|
|
|
::: {.callout-caution}
|
|
## Exercises
|
|
|
|
**Exercise 1: Predict before you compute.**
|
|
Before running any code, predict: which parameter has the highest sensitivity for
|
|
ResNet-50 at batch size 256 on an H100? (Hint: CNNs at large batch sizes have very
|
|
high arithmetic intensity.) Write your prediction, then verify with
|
|
`solver.solve(model=mlsysim.Models.ResNet50, hardware=mlsysim.Hardware.Cloud.H100)`.
|
|
Were you right?
|
|
|
|
**Exercise 2: Inverse solve for a tighter SLA.**
|
|
Use `SynthesisSolver` to find the minimum hardware specs for a 100 ms TTFT SLA on
|
|
Llama-3 70B. What bandwidth does this require? Does any hardware in the Silicon Zoo
|
|
meet this spec? What does this tell you about the feasibility of sub-100ms TTFT for
|
|
70B-parameter models?
|
|
|
|
**Exercise 3: The crossover model size.**
|
|
Run the sensitivity analysis on three models of increasing size: `mlsysim.Models.Llama3_8B`,
|
|
`mlsysim.Models.Llama3_70B`, and `mlsysim.Models.GPT3` (175B). At what model size does
|
|
the binding constraint shift from bandwidth to compute, if at all? What does the trend
|
|
tell you about the direction of the memory wall?
|
|
|
|
**Self-check:** If a 10% bandwidth increase yields 8.8% latency reduction, and a 10%
|
|
FLOPS increase yields 0.6% latency reduction, how much bandwidth increase would you need
|
|
to match the effect of doubling FLOPS?
|
|
:::
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
::: {.callout-tip}
|
|
## Summary
|
|
|
|
- **Sensitivity analysis** computes numerical partial derivatives of latency, revealing
|
|
which hardware parameter is worth investing in
|
|
- **Bandwidth is ~15x more valuable** than FLOPS for LLM inference at batch size 1
|
|
- **Inverse Roofline synthesis** translates SLA requirements into minimum hardware specs,
|
|
enabling data-driven procurement shortlisting
|
|
- **Generational comparison** shows whether the binding constraint persists or shifts
|
|
across hardware generations
|
|
:::
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
- **[GPU vs. Wafer-Scale](10_gpu_vs_wafer.qmd)** --- See how a fundamentally different architecture changes which wall binds
|
|
- **[Full-Stack Audit](12_full_stack_audit.qmd)** --- Compose all solvers into a complete systems analysis
|
|
- **[The Memory Wall](01_memory_wall.qmd)** --- Revisit the foundational tutorial on memory-bound vs. compute-bound
|
|
- **[Silicon Zoo](../zoo/hardware.qmd)** --- Browse all vetted hardware specs
|