mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-09 02:11:56 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
281 lines
10 KiB
Plaintext
281 lines
10 KiB
Plaintext
---
|
||
title: "Two Phases, One Request"
|
||
subtitle: "The same model on the same GPU hits two different ceilings — and that changes everything."
|
||
description: "Discover why LLM inference has two distinct performance regimes (prefill and decode) with different bottlenecks. The foundation for all LLM serving analysis."
|
||
categories: ["node", "intermediate"]
|
||
---
|
||
|
||
## The Question
|
||
|
||
A CNN processes one image in one pass. An LLM generates text one token at a time — but
|
||
the *first* token and the *hundredth* token are bottlenecked by completely different hardware
|
||
resources. **Why does the same model on the same GPU have two different speed limits?**
|
||
|
||
Understanding this two-phase structure is what separates a systems engineer who can
|
||
*predict* serving costs from one who has to *discover* them in production.
|
||
|
||
::: {.callout-note}
|
||
## Prerequisites
|
||
Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd) and
|
||
[Tutorial 1: The Memory Wall](01_memory_wall.qmd). You should understand memory-bound
|
||
vs. compute-bound regimes and the roofline model.
|
||
:::
|
||
|
||
::: {.callout-note}
|
||
## What You Will Learn
|
||
|
||
- **Distinguish** the two phases of LLM inference: prefill (TTFT) and decode (ITL)
|
||
- **Explain** why prefill is compute-bound and decode is memory-bound
|
||
- **Predict** which hardware spec (FLOP/s or bandwidth) matters for each phase
|
||
- **Compare** GPUs based on their serving characteristics, not just peak specs
|
||
:::
|
||
|
||
::: {.callout-tip}
|
||
## Background: How LLM Inference Works
|
||
|
||
Unlike a CNN that processes a fixed input in one forward pass, an LLM generates output
|
||
**autoregressively** — one token at a time:
|
||
|
||
1. **Prefill (Time to First Token — TTFT):** The model processes the entire input prompt
|
||
in a single forward pass. All prompt tokens are processed in parallel, saturating the
|
||
GPU's compute units. This is **compute-bound** — optimizing TTFT means getting more TFLOP/s.
|
||
|
||
2. **Decode (Inter-Token Latency — ITL):** Each subsequent token requires a full forward
|
||
pass through the model, but processes only *one* token of new input. The model weights
|
||
(8 billion params × 2 bytes per FP16 param = **16 GB**) must be loaded from HBM for each token, yet only a tiny
|
||
amount of arithmetic is performed. This is **memory-bound** — optimizing ITL means getting
|
||
more GB/s of HBM bandwidth.
|
||
|
||
The same GPU, the same model, two completely different bottlenecks.
|
||
:::
|
||
|
||
---
|
||
|
||
## 1. Setup
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| output: false
|
||
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
||
```
|
||
|
||
```python
|
||
import mlsysim
|
||
from mlsysim import ServingModel
|
||
```
|
||
|
||
In the previous tutorials, you used `Engine.solve`, which models inference as a single
|
||
forward pass. But LLM serving is not a single pass — it has two distinct phases with
|
||
different bottlenecks. The `ServingModel` models each phase separately, giving you TTFT
|
||
(time to first token) and ITL (inter-token latency) instead of a single latency number.
|
||
|
||
---
|
||
|
||
## 2. First Serving Prediction
|
||
|
||
```{python}
|
||
from mlsysim import ServingModel
|
||
|
||
# Llama-3 8B: 8B parameters, 32 layers, 4096 hidden_dim
|
||
model = mlsysim.Models.Llama3_8B
|
||
|
||
# NVIDIA H100: 989 TFLOP/s (FP16), 3.35 TB/s HBM3, 80 GB
|
||
hardware = mlsysim.Hardware.Cloud.H100
|
||
|
||
solver = ServingModel()
|
||
result = solver.solve(
|
||
model=model,
|
||
hardware=hardware,
|
||
seq_len=2048, # 2K token context window
|
||
batch_size=1, # single user
|
||
precision="fp16"
|
||
)
|
||
|
||
from mlsysim.show import table, info
|
||
|
||
info("Phase Analysis",
|
||
TTFT_prefill=result.ttft.to('ms'),
|
||
ITL_per_token=result.itl.to('ms'))
|
||
|
||
info("Memory Budget",
|
||
Model_weights=result.model_weights_size,
|
||
KV_cache=result.kv_cache_size,
|
||
Memory_utilization=f"{result.memory_utilization:.1%}")
|
||
```
|
||
|
||
Two numbers, two different stories:
|
||
|
||
- **TTFT** is tens of milliseconds — dominated by the 989 TFLOP/s compute ceiling
|
||
- **ITL** is a fraction of a millisecond — dominated by the 3.35 TB/s bandwidth ceiling
|
||
|
||
Why the asymmetry? Prefill processes all 2048 prompt tokens in parallel — that is 2048×
|
||
more arithmetic per weight load than decode, which processes one token at a time. Prefill's
|
||
arithmetic intensity is ~2048 FLOP/byte, well above the ridge point. Decode's intensity
|
||
is ~1 FLOP/byte, far below it. The same weights, loaded the same way, but two completely
|
||
different operating regimes.
|
||
|
||
---
|
||
|
||
## 3. Why They Respond to Different Optimizations
|
||
|
||
Now let's see how this asymmetry plays out across GPU generations. If TTFT and ITL are
|
||
in different regimes, they should respond to different hardware specs:
|
||
|
||
```{python}
|
||
gpus = [
|
||
("A100", mlsysim.Hardware.Cloud.A100),
|
||
("H100", mlsysim.Hardware.Cloud.H100),
|
||
("H200", mlsysim.Hardware.Cloud.H200),
|
||
]
|
||
|
||
rows = []
|
||
for name, hw in gpus:
|
||
r = solver.solve(model=model, hardware=hw, seq_len=2048, batch_size=1, precision="fp16")
|
||
rows.append([
|
||
name,
|
||
hw.compute.peak_flops.to("TFLOPs/s"),
|
||
hw.memory.bandwidth.to("TB/s"),
|
||
r.ttft.to('ms'),
|
||
r.itl.to('ms'),
|
||
])
|
||
|
||
table(["GPU", "TFLOP/s", "BW (TB/s)", "TTFT (ms)", "ITL (ms)"], rows)
|
||
```
|
||
|
||
Compare the ratios:
|
||
|
||
- **A100 → H100**: FLOP/s increases 3.2×, TTFT improves ~3×. Bandwidth increases 1.7×, ITL improves ~1.7×.
|
||
- **H100 → H200**: FLOP/s stays similar, TTFT stays similar. Bandwidth increases ~1.4×, ITL improves ~1.4×.
|
||
|
||
Each metric tracks its own ceiling. TTFT scales with compute. ITL scales with bandwidth.
|
||
|
||
---
|
||
|
||
## 4. The Asymmetry: Where Quantization Helps
|
||
|
||
Quantization (reducing numerical precision) shrinks the model weights. Since decode must
|
||
load all weights from HBM at every step, fewer bytes means faster decode. But prefill is
|
||
compute-bound — fewer bytes doesn't help if computation is the bottleneck.
|
||
|
||
```{python}
|
||
rows = []
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
r = solver.solve(model=model, hardware=hardware, seq_len=2048, batch_size=1, precision=prec)
|
||
rows.append([prec, r.ttft.to('ms'), r.itl.to('ms'), r.model_weights_size])
|
||
|
||
table(["Precision", "TTFT (ms)", "ITL (ms)", "Weights"], rows)
|
||
```
|
||
|
||
::: {.callout-important}
|
||
## Key Insight
|
||
|
||
**LLM serving is not one problem — it is two problems in sequence.** Prefill (TTFT) is
|
||
compute-bound and scales with FLOP/s. Decode (ITL) is memory-bound and scales with
|
||
bandwidth. This means:
|
||
|
||
- **Quantization** is a decode optimization (reduces bytes loaded per step)
|
||
- **More TFLOP/s** is a prefill optimization (processes prompt tokens faster)
|
||
- **The right GPU** depends on which phase dominates your latency budget
|
||
|
||
A chatbot (short prompts, long responses) is ITL-dominated → buy bandwidth.
|
||
A summarization service (long documents, short outputs) is TTFT-dominated → buy compute.
|
||
:::
|
||
|
||
::: {.callout-tip}
|
||
## Going Further: Speculative Decoding
|
||
|
||
This two-phase asymmetry also explains why **speculative decoding** works: a small draft
|
||
model generates candidate tokens cheaply, then the large model verifies them in a single
|
||
parallel pass (like prefill). It converts the large model's spare compute into reduced
|
||
memory loads — attacking the decode bottleneck at the algorithmic level.
|
||
:::
|
||
|
||
---
|
||
|
||
## 5. Putting It Together: SLA-Based Hardware Selection
|
||
|
||
If your production SLA is TTFT < 200 ms and ITL < 50 ms/token, which GPUs qualify?
|
||
|
||
```{python}
|
||
gpus_all = [
|
||
("T4", mlsysim.Hardware.Cloud.T4),
|
||
("A100", mlsysim.Hardware.Cloud.A100),
|
||
("H100", mlsysim.Hardware.Cloud.H100),
|
||
("H200", mlsysim.Hardware.Cloud.H200),
|
||
]
|
||
|
||
TTFT_SLA = 200 # ms
|
||
ITL_SLA = 50 # ms
|
||
|
||
rows = []
|
||
for name, hw in gpus_all:
|
||
r = solver.solve(model=model, hardware=hw, seq_len=4096, batch_size=1, precision="fp16")
|
||
ttft = r.ttft.to("ms").magnitude
|
||
itl = r.itl.to("ms").magnitude
|
||
ttft_ok = ttft <= TTFT_SLA
|
||
itl_ok = itl <= ITL_SLA
|
||
rows.append([
|
||
name,
|
||
f"{ttft:.1f} ms",
|
||
f"{itl:.2f} ms",
|
||
"✓" if ttft_ok else "✗",
|
||
"✓" if itl_ok else "✗",
|
||
"PASS" if ttft_ok and itl_ok else "FAIL",
|
||
])
|
||
|
||
table(["GPU", "TTFT", "ITL", "TTFT OK?", "ITL OK?", "Verdict"], rows)
|
||
```
|
||
|
||
This is the analysis every ML engineer should run before choosing serving infrastructure.
|
||
The answer depends not just on the GPU, but on the model size, context length, batch size,
|
||
and precision — all of which the `ServingModel` captures analytically.
|
||
|
||
---
|
||
|
||
## Your Turn
|
||
|
||
::: {.callout-caution}
|
||
## Exercises
|
||
|
||
**Exercise 1: Predict before you compute.**
|
||
Before running any code: for Llama-3 70B (~9× larger than 8B), predict whether TTFT or
|
||
ITL will be more affected by the model size increase. Will both grow by ~9×? Write your
|
||
reasoning, then solve with `mlsysim.Models.Llama3_70B` on the H100 and compare.
|
||
|
||
**Exercise 2: The chatbot vs. summarizer trade-off.**
|
||
A chatbot receives 50-token prompts and generates 500-token responses. A summarizer receives
|
||
4000-token documents and generates 100-token summaries. For each use case, calculate: what
|
||
fraction of total request time is TTFT vs. ITL? Which GPU spec matters more for each?
|
||
|
||
**Exercise 3: Find the phase crossover.**
|
||
Sweep `seq_len` from 128 to 32768 for Llama-3 8B on the H100. At what context length does
|
||
TTFT exceed the total decode time for a 256-token response (i.e., 256 × ITL)? This is where
|
||
the dominant phase shifts from decode to prefill.
|
||
|
||
**Self-check:** Your boss says "We need a faster GPU for our chatbot." Which metric matters
|
||
more: TTFT or ITL? What hardware spec should you prioritize?
|
||
:::
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
::: {.callout-tip}
|
||
## Summary
|
||
|
||
- **Prefill (TTFT)** is compute-bound — it scales with TFLOP/s
|
||
- **Decode (ITL)** is memory-bound — it scales with HBM bandwidth (GB/s)
|
||
- **Quantization** primarily accelerates decode (fewer bytes per weight load), not prefill
|
||
- **Hardware selection** depends on which phase dominates your workload
|
||
- **`ServingModel`** separates these two regimes analytically, enabling SLA-based hardware decisions
|
||
:::
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
- **[KV-Cache: The Hidden Tax](03_kv_cache.qmd)** — Discover what limits how many concurrent users you can serve
|
||
- **[The Memory Wall](01_memory_wall.qmd)** — Why GPU generations don't deliver the speedup you expect
|
||
- **[Quantization: Not a Free Lunch](05_quantization.qmd)** — When reducing precision helps and when it doesn't
|
||
- **[The $9M Question](08_nine_million_dollar.qmd)** — How chain-of-thought reasoning multiplies serving costs
|