mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 02:28:25 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
270 lines
10 KiB
Plaintext
270 lines
10 KiB
Plaintext
---
|
||
title: "KV-Cache: The Hidden Tax"
|
||
subtitle: "At 128K context, the cache alone fills an 80 GB GPU — room for exactly one user."
|
||
description: "Discover that KV-cache memory — not model weights, not compute — determines how many users you can serve concurrently. Sweep batch size and context length to find the real OOM boundary."
|
||
categories: ["node", "intermediate"]
|
||
---
|
||
|
||
## The Question
|
||
|
||
You deploy Llama-3 8B on an H100. The model weights take 16 GB. You have 64 GB left.
|
||
Surely you can serve dozens of users concurrently?
|
||
|
||
**Not if they have long contexts.** Every active user requires a KV-cache that grows
|
||
linearly with sequence length. At 128K context, a single user's cache can consume the
|
||
entire remaining memory. This tutorial shows you exactly where the real memory wall lives
|
||
and how to push it back.
|
||
|
||
::: {.callout-note}
|
||
## Prerequisites
|
||
Complete [Tutorial 1: The Memory Wall](01_memory_wall.qmd) and
|
||
[Tutorial 2: Two Phases, One Request](02_two_phases.qmd). You should understand
|
||
memory-bound vs. compute-bound regimes and the two-phase LLM serving model.
|
||
:::
|
||
|
||
::: {.callout-note}
|
||
## What You Will Learn
|
||
|
||
- **Calculate** the KV-cache size for any model, sequence length, and batch size
|
||
- **Identify** the OOM boundary where KV-cache exhausts GPU memory
|
||
- **Explain** why context length — not model size — is the binding memory constraint in serving
|
||
- **Compare** static batching vs. paged attention for maximizing concurrent users
|
||
:::
|
||
|
||
::: {.callout-tip}
|
||
## Background: What Is the KV-Cache?
|
||
|
||
During LLM decoding, every attention layer stores **Key** and **Value** matrices for all
|
||
tokens generated so far. If you have studied data structures, this is **memoization** applied
|
||
to the attention mechanism: store computed results instead of recomputing them. The names
|
||
come from a database-style lookup: the **Query** is what you search for, the **Key** is what
|
||
you match against, and the **Value** is what you retrieve. Without this cache, the model would
|
||
need to recompute attention over the entire context at every step — quadratic cost. The
|
||
KV-cache trades memory for compute:
|
||
|
||
| Factor | Effect on KV-Cache |
|
||
|:-------|:-------------------|
|
||
| More layers | Linear growth (one K + one V per layer) |
|
||
| Longer context | Linear growth (one entry per token) |
|
||
| More users (batch) | Linear growth (independent cache per user) |
|
||
| Lower precision | Proportional reduction (INT8 = half of FP16) |
|
||
|
||
The formula: `KV-cache = 2 x layers x kv_heads x head_dim x seq_len x batch x bytes_per_element`.
|
||
At short contexts this is negligible. At long contexts it dominates everything.
|
||
|
||
**Note on GQA (Grouped Query Attention):** Modern architectures like Llama-3 use GQA, where
|
||
`kv_heads < num_heads`. Llama-3 8B has 32 attention heads but only 8 KV-heads, reducing
|
||
KV-cache by 4× compared to standard multi-head attention. Using `num_heads` instead of
|
||
`kv_heads` in the formula is a common source of 4× overestimates.
|
||
:::
|
||
|
||
---
|
||
|
||
## 1. Setup
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| output: false
|
||
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
||
Engine = mlsysim.Engine
|
||
```
|
||
|
||
```python
|
||
import mlsysim
|
||
from mlsysim import ServingModel, ContinuousBatchingModel
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Single-User Baseline: Where Does the Memory Go?
|
||
|
||
Let's start with a single user at a modest 2K context and see how memory breaks down:
|
||
|
||
```{python}
|
||
from mlsysim import ServingModel
|
||
|
||
model = mlsysim.Models.Llama3_8B
|
||
hardware = mlsysim.Hardware.Cloud.H100
|
||
solver = ServingModel()
|
||
|
||
# Single user, 2K context — the easy case
|
||
r = solver.solve(model=model, hardware=hardware, seq_len=2048, batch_size=1, precision="fp16")
|
||
|
||
from mlsysim.show import table, info
|
||
|
||
info("Memory Breakdown",
|
||
Model_weights=r.model_weights_size,
|
||
KV_cache_1_user=r.kv_cache_size,
|
||
Total_memory=r.total_memory_required,
|
||
Memory_utilization=f"{r.memory_utilization:.1%}",
|
||
KV_as_pct_of_total=f"{r.kv_cache_size / r.total_memory_required * 100:.1f}%")
|
||
```
|
||
|
||
At 2K context with one user, the KV-cache is tiny — a rounding error compared to the model
|
||
weights. This is why many engineers assume memory pressure comes from model size. They are
|
||
about to be surprised.
|
||
|
||
---
|
||
|
||
## 3. Batch Size Sweep: The Concurrency Wall
|
||
|
||
Now let's add users. Each concurrent user needs their own KV-cache. Watch memory utilization
|
||
climb:
|
||
|
||
```{python}
|
||
rows = []
|
||
for batch in [1, 4, 8, 16, 32, 64, 128]:
|
||
r = solver.solve(
|
||
model=model, hardware=hardware,
|
||
seq_len=2048, batch_size=batch, precision="fp16"
|
||
)
|
||
rows.append([batch, r.kv_cache_size, r.total_memory_required,
|
||
f"{r.memory_utilization:.1%}", "OK" if r.feasible else "OOM"])
|
||
|
||
table(["Batch", "KV-Cache", "Total", "Util", "Feasible"], rows)
|
||
```
|
||
|
||
At 2K context, you can fit many users. The KV-cache per user is small enough that batch
|
||
size scales comfortably. But this picture changes dramatically when we extend the context.
|
||
|
||
---
|
||
|
||
## 4. Context Length Sweep: The Real Memory Wall
|
||
|
||
Fix batch size at 8 users and sweep context length from 512 tokens to 128K. This is where
|
||
the hidden tax reveals itself:
|
||
|
||
```{python}
|
||
rows = []
|
||
for ctx in [512, 2048, 4096, 8192, 16384, 32768, 65536, 131072]:
|
||
r = solver.solve(
|
||
model=model, hardware=hardware,
|
||
seq_len=ctx, batch_size=8, precision="fp16"
|
||
)
|
||
rows.append([ctx, r.kv_cache_size, r.model_weights_size,
|
||
r.total_memory_required, f"{r.memory_utilization:.1%}",
|
||
"OK" if r.feasible else "OOM"])
|
||
|
||
table(["Context", "KV-Cache", "Weights", "Total", "Util", "Status"], rows)
|
||
```
|
||
|
||
::: {.callout-important}
|
||
## Key Insight
|
||
|
||
**KV-cache grows linearly with sequence length and batch size. It is the hidden memory
|
||
consumer that determines your maximum concurrent users — not model size, not compute, but
|
||
cache state.** At 2K context, the cache is negligible. At 128K context, a single user's
|
||
cache can exceed the model weights. The same 80 GB GPU that serves 64 users at short
|
||
context can serve exactly one user at long context. The "context length" on the model card
|
||
is not a feature — it is a memory bill.
|
||
:::
|
||
|
||
Now let's see what happens when we try to serve even a single user at 128K:
|
||
|
||
```{python}
|
||
# Single user at 128K context — the extreme case
|
||
r_long = solver.solve(
|
||
model=model, hardware=hardware,
|
||
seq_len=131072, batch_size=1, precision="fp16"
|
||
)
|
||
|
||
info("Single User @ 128K Context",
|
||
Context="131,072 tokens (128K)",
|
||
Model_weights=r_long.model_weights_size,
|
||
KV_cache=r_long.kv_cache_size,
|
||
Total=r_long.total_memory_required,
|
||
Feasible=str(r_long.feasible),
|
||
KV_as_pct_of_total=f"{r_long.kv_cache_size / r_long.total_memory_required * 100:.0f}%")
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Paged Attention: Pushing Back the Wall
|
||
|
||
So the KV-cache fills memory fast, and at long contexts you hit OOM with just a handful of
|
||
users. Is the only option to buy more memory? No — the allocation strategy itself is
|
||
wasting space. Most sequences do not actually use the maximum context length, yet static
|
||
batching reserves memory for the worst case.
|
||
|
||
Static batching allocates contiguous memory for the maximum sequence length, wasting space
|
||
on incomplete sequences. **PagedAttention** (from vLLM) allocates KV-cache in small,
|
||
fixed-size pages — exactly like how an operating system uses virtual memory paging to
|
||
avoid physical memory fragmentation. Just as the OS maps virtual pages to physical frames
|
||
on demand, PagedAttention maps KV-cache blocks to GPU memory on demand, eliminating
|
||
fragmentation and fitting more concurrent requests:
|
||
|
||
```{python}
|
||
from mlsysim import ContinuousBatchingModel
|
||
|
||
cb_solver = ContinuousBatchingModel()
|
||
|
||
rows = []
|
||
for label, max_b, page in [("Static (baseline)", 32, 2048), ("Paged (16 tok)", 32, 16), ("Paged (64 tok)", 32, 64)]:
|
||
cb = cb_solver.solve(
|
||
model=model, hardware=hardware,
|
||
seq_len=4096, max_batch_size=max_b,
|
||
page_size=page, precision="fp16"
|
||
)
|
||
rows.append([label, cb.max_active_requests,
|
||
f"{cb.throughput_tokens_per_sec:.0f} t/s",
|
||
f"{cb.memory_fragmentation_pct:.1f}%",
|
||
f"{cb.speedup_vs_static:.1f}x"])
|
||
|
||
table(["System", "Max Users", "Throughput", "Frag", "Speedup"], rows)
|
||
```
|
||
|
||
Paged attention reduces fragmentation from ~50% to single digits, allowing more concurrent
|
||
requests from the same memory budget. This is why vLLM and TensorRT-LLM default to paged
|
||
KV-cache management in production.
|
||
|
||
---
|
||
|
||
## Your Turn
|
||
|
||
::: {.callout-caution}
|
||
## Exercises
|
||
|
||
**Exercise 1: Predict before you compute.**
|
||
Llama-3 70B has 80 layers (vs. 32 for the 8B model) and 8 KV-heads with 128 head_dim.
|
||
Before running any code, predict: at seq_len=4096 and FP16, what batch size will cause OOM
|
||
on an 80 GB H100? Write your prediction, then sweep batch sizes with
|
||
`mlsysim.Models.Llama3_70B` to find the actual limit. How close were you?
|
||
|
||
**Exercise 2: Maximum users at 128K context.**
|
||
Using the H200 (141 GB HBM3e), calculate the maximum number of concurrent users you can
|
||
serve with Llama-3 8B at 128K context in FP16. Then try INT8. How many additional users
|
||
does quantization buy you?
|
||
|
||
**Exercise 3: Paged vs. static at long context.**
|
||
Run the `ContinuousBatchingModel` for Llama-3 8B at seq_len=32768 with max_batch_size=16.
|
||
Compare page_size=16 vs. page_size=256. Which gives better throughput? Why does page size
|
||
matter more at long context?
|
||
|
||
**Self-check:** If a model has 32 layers, 8 KV-heads, 128 head_dim, and uses FP16
|
||
(2 bytes), how many bytes does the KV-cache consume per token per user?
|
||
(Answer: 2 x 32 x 8 x 128 x 2 = 131,072 bytes = 128 KB per token.)
|
||
:::
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
::: {.callout-tip}
|
||
## Summary
|
||
|
||
- **KV-cache size scales linearly** with layers, KV-heads, sequence length, and batch size
|
||
- **At short context, cache is negligible** — model weights dominate and you can serve many users
|
||
- **At long context, cache dominates** — a single 128K user's cache can exceed model weights
|
||
- **The OOM boundary depends on context length x batch size**, not just model size
|
||
- **Paged attention reduces fragmentation**, fitting more concurrent requests in the same memory
|
||
:::
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
- **[Quantization: Not a Free Lunch](05_quantization.qmd)** — Learn when reducing precision shrinks the KV-cache effectively vs. when it doesn't help
|
||
- **[Two Phases, One Request](02_two_phases.qmd)** — Revisit the prefill/decode split now that you understand the cache pressure
|
||
- **[Where to Invest](09_sensitivity.qmd)** — Use sensitivity analysis to quantify whether more memory or more bandwidth helps more
|
||
- **[Silicon Zoo](../zoo/hardware.qmd)** — Compare HBM capacity across H100, H200, MI300X, and see which GPUs tolerate long context
|