mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 18:01:20 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
316 lines
11 KiB
Plaintext
316 lines
11 KiB
Plaintext
---
|
|
title: "The $9M Question"
|
|
subtitle: "K=8 reasoning steps multiply your serving bill by 7.6x — a simple algorithm choice becomes a capital decision."
|
|
description: "Use InferenceScalingModel and EconomicsModel to quantify how chain-of-thought reasoning multiplies infrastructure cost from $1.2M to $9.1M annually."
|
|
categories: ["ops", "intermediate"]
|
|
---
|
|
|
|
## The Question
|
|
|
|
Your team wants to add chain-of-thought (CoT) reasoning to the production serving
|
|
pipeline. The accuracy improvement is clear: K=8 reasoning steps measurably improve
|
|
answer quality on hard queries. But what does K=8 *cost*? Not in tokens — in dollars,
|
|
GPUs, and annual infrastructure budget. Is this an algorithm decision or a capital
|
|
expenditure decision?
|
|
|
|
::: {.callout-note}
|
|
## Prerequisites
|
|
Complete [Tutorial 2: Two Phases, One Request](02_two_phases.qmd) and
|
|
[Tutorial 3: The KV Cache Wall](03_kv_cache.qmd). You should understand TTFT, ITL, and the
|
|
two-phase serving model.
|
|
:::
|
|
|
|
::: {.callout-note}
|
|
## What You Will Learn
|
|
|
|
- **Calculate** per-query latency and energy cost across reasoning depths K=1 to K=16
|
|
- **Compose** `InferenceScalingModel` with `EconomicsModel` for annualized fleet cost
|
|
- **Quantify** the cost multiplier of K=8 reasoning at 100 QPS fleet scale
|
|
- **Evaluate** a routing strategy that sends easy queries to a cheap model and hard queries to an expensive reasoning model
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
## Background: Inference-Time Compute Scaling
|
|
|
|
Standard LLM inference generates one answer directly: prefill the prompt (TTFT), then
|
|
decode tokens one at a time (ITL). Chain-of-thought reasoning changes this: the model
|
|
generates K intermediate "thinking" steps, each producing dozens of tokens, before
|
|
the final answer. The cost model becomes:
|
|
|
|
$$T_{\text{reason}} = \text{TTFT} + K \times T_{\text{step}}$$
|
|
|
|
where $T_{\text{step}} = \text{tokens\_per\_step} \times \text{ITL}$. Each step is
|
|
memory-bound (decoding), so the cost scales linearly with K but at the *decode* rate —
|
|
the expensive, memory-bandwidth-limited phase. A seemingly algorithmic choice (add more
|
|
reasoning) translates directly into GPU-hours and dollars.
|
|
:::
|
|
|
|
---
|
|
|
|
## 1. Setup
|
|
|
|
```{python}
|
|
#| echo: false
|
|
#| output: false
|
|
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
|
Engine = mlsysim.Engine
|
|
```
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import Engine
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Baseline Serving: Single-Query Cost
|
|
|
|
First, establish the baseline: a GPT-4 scale model served on an H100, no reasoning.
|
|
This gives us the per-query TTFT and ITL that everything else builds on.
|
|
|
|
```{python}
|
|
from mlsysim import Models, Hardware, ServingModel
|
|
from mlsysim.show import table, info
|
|
|
|
model = Models.GPT4
|
|
hardware = Hardware.Cloud.H100
|
|
|
|
serving = ServingModel()
|
|
baseline = serving.solve(
|
|
model=model, hardware=hardware,
|
|
seq_len=2048, batch_size=1, precision="fp16"
|
|
)
|
|
|
|
info("Baseline Serving",
|
|
Model=f"{model.name} ({model.parameters.to('Gcount'):.0f} params)",
|
|
Hardware=hardware.name,
|
|
TTFT=baseline.ttft.to('ms'),
|
|
ITL=baseline.itl.to('ms'),
|
|
Memory_Used=f"{baseline.memory_utilization:.0%}",
|
|
Feasible=f"{baseline.feasible}")
|
|
```
|
|
|
|
The ITL — the per-token decode latency — is the critical number. Every reasoning
|
|
step generates dozens of tokens at this rate. That is where the cost accumulates.
|
|
|
|
---
|
|
|
|
## 3. CoT Sweep: K=1 to K=16
|
|
|
|
Now sweep reasoning depth using the `InferenceScalingModel`. Each step generates
|
|
tokens of intermediate reasoning. Watch how the cost multiplier grows.
|
|
|
|
```{python}
|
|
from mlsysim import InferenceScalingModel
|
|
|
|
cot_solver = InferenceScalingModel()
|
|
K_values = [1, 4, 8, 16]
|
|
|
|
baseline_time = None
|
|
results = {}
|
|
rows = []
|
|
|
|
for K in K_values:
|
|
result = cot_solver.solve(
|
|
model=model, hardware=hardware,
|
|
reasoning_steps=K, context_length=2048, precision="fp16"
|
|
)
|
|
total_ms = result.total_reasoning_time.to("ms").magnitude
|
|
energy_j = result.energy_per_query.to("J").magnitude
|
|
|
|
if baseline_time is None:
|
|
baseline_time = total_ms
|
|
|
|
multiplier = total_ms / baseline_time
|
|
results[K] = result
|
|
|
|
rows.append([K, f"{total_ms:.1f}ms", result.tokens_generated, f"{energy_j:.1f} J", f"{multiplier:.1f}x"])
|
|
|
|
table(["K", "Total Time", "Tokens", "Energy (J)", "Multiplier"], rows)
|
|
```
|
|
|
|
K=8 does not cost exactly 8x the baseline. The actual multiplier reflects the
|
|
structure of the cost: one TTFT (fixed) plus K decode phases (scaling). Since
|
|
TTFT is a small fraction of total time for high K, the multiplier approaches K
|
|
as K grows.
|
|
|
|
---
|
|
|
|
## 4. The $9M Question: Annualized Fleet Cost
|
|
|
|
Per-query cost is interesting. Fleet-level cost is what matters. Let's compute
|
|
the annual infrastructure cost of serving 100 queries per second at K=1 vs. K=8.
|
|
|
|
```{python}
|
|
from mlsysim import EconomicsModel
|
|
from mlsysim.systems.types import Fleet, Node, NetworkFabric
|
|
from mlsysim.core.constants import Q_
|
|
|
|
econ = EconomicsModel()
|
|
target_qps = 100
|
|
|
|
fleet_results = {}
|
|
fleet_objects = {}
|
|
rows = []
|
|
for K in K_values:
|
|
r = results[K]
|
|
qt_s = r.total_reasoning_time.to("s").magnitude
|
|
|
|
# Each GPU serves one query at a time (batch_size=1)
|
|
qps_per_gpu = 1.0 / qt_s if qt_s > 0 else 0
|
|
gpus_needed = int(target_qps / qps_per_gpu) + 1
|
|
|
|
# Build fleet: 8 GPUs per node
|
|
fleet = Fleet(
|
|
name=f"K={K} Serving",
|
|
node=Node(
|
|
name="H100 Node",
|
|
accelerator=hardware,
|
|
accelerators_per_node=8,
|
|
intra_node_bw=Q_("900 GB/s"),
|
|
),
|
|
count=max((gpus_needed + 7) // 8, 1),
|
|
fabric=NetworkFabric(
|
|
name="IB NDR",
|
|
bandwidth=Q_("400 Gbps").to("GB/s"),
|
|
)
|
|
)
|
|
|
|
tco = econ.solve(fleet=fleet, duration_days=365)
|
|
fleet_results[K] = tco
|
|
fleet_objects[K] = fleet
|
|
|
|
rows.append([K, f"{qt_s * 1000:.1f}ms", f"{qps_per_gpu:.2f}", fleet.total_accelerators, f"${tco.tco_usd:,.0f}"])
|
|
|
|
table(["K", "Query (ms)", "QPS/GPU", "GPUs", "Annual TCO ($)"], rows)
|
|
```
|
|
|
|
The jump from K=1 to K=8 is not just a latency increase — it propagates through
|
|
the entire infrastructure stack: more GPUs, more power, more cooling, more
|
|
network fabric, more capital expenditure.
|
|
|
|
::: {.callout-important}
|
|
## Key Insight
|
|
|
|
**A seemingly algorithmic decision — "add more reasoning steps" — is actually
|
|
an infrastructure spending decision.** K=8 chain-of-thought reasoning multiplies
|
|
per-query latency by approximately 7-8x, which means you need 7-8x more GPUs to
|
|
maintain the same QPS. Annual TCO scales proportionally. The decision to add CoT
|
|
reasoning is not a model architecture choice — it is a capital expenditure decision
|
|
that belongs in the CFO's budget, not just the ML engineer's notebook.
|
|
:::
|
|
|
|
---
|
|
|
|
## 5. The Routing Argument
|
|
|
|
The $9M annual TCO reframes the conversation from model architecture to capital planning.
|
|
But it also raises an obvious question: must every query pay the full reasoning cost?
|
|
|
|
In production, the answer is no — and the optimization is **routing**. Smart routing
|
|
sends easy queries to a fast, cheap model and only routes hard queries to the expensive
|
|
reasoning pipeline. Let's model a 70/30 split.
|
|
|
|
```{python}
|
|
# Scenario: 70% of queries go to Llama-3 70B (no reasoning, K=1)
|
|
# 30% go to GPT-4 with K=8 reasoning
|
|
model_cheap = Models.Llama3_70B
|
|
|
|
# Cheap path: Llama-3 70B, K=1
|
|
r_cheap = cot_solver.solve(
|
|
model=model_cheap, hardware=hardware,
|
|
reasoning_steps=1, context_length=2048, precision="fp16"
|
|
)
|
|
|
|
# Expensive path: GPT-4, K=8 (already computed)
|
|
r_expensive = results[8]
|
|
|
|
# Weighted average query time
|
|
qt_cheap = r_cheap.total_reasoning_time.to("s").magnitude
|
|
qt_expensive = r_expensive.total_reasoning_time.to("s").magnitude
|
|
qt_blended = 0.70 * qt_cheap + 0.30 * qt_expensive
|
|
|
|
# Compare: all queries to GPT-4 K=8 vs. routed
|
|
qps_blended = 1.0 / qt_blended if qt_blended > 0 else 0
|
|
gpus_blended = int(target_qps / qps_blended) + 1
|
|
|
|
fleet_routed = Fleet(
|
|
name="Routed Serving",
|
|
node=Node(name="H100", accelerator=hardware,
|
|
accelerators_per_node=8, intra_node_bw=Q_("900 GB/s")),
|
|
count=max((gpus_blended + 7) // 8, 1),
|
|
fabric=NetworkFabric(name="IB", bandwidth=Q_("400 Gbps").to("GB/s")),
|
|
)
|
|
|
|
tco_routed = econ.solve(fleet=fleet_routed, duration_days=365)
|
|
tco_all_k8 = fleet_results[8]
|
|
|
|
savings = tco_all_k8.tco_usd - tco_routed.tco_usd
|
|
pct_savings = savings / tco_all_k8.tco_usd * 100 if tco_all_k8.tco_usd > 0 else 0
|
|
|
|
table(
|
|
["Strategy", "GPUs", "Annual TCO ($)"],
|
|
[
|
|
["All queries -> GPT-4 K=8", fleet_objects[8].total_accelerators, f"${tco_all_k8.tco_usd:,.0f}"],
|
|
["70/30 routed", fleet_routed.total_accelerators, f"${tco_routed.tco_usd:,.0f}"],
|
|
]
|
|
)
|
|
info(Savings=f"${savings:,.0f} ({pct_savings:.0f}%)")
|
|
```
|
|
|
|
Routing reduces the fleet but does not eliminate the infrastructure commitment. The 30%
|
|
of queries that still need full reasoning require dedicated GPU capacity, and the routing
|
|
classifier itself introduces latency and complexity. The $9M is the ceiling; routing
|
|
contains it but does not make the capital planning question go away.
|
|
|
|
---
|
|
|
|
## Your Turn
|
|
|
|
::: {.callout-caution}
|
|
## Exercises
|
|
|
|
**Exercise 1: Predict before you compute.**
|
|
If K=8 costs approximately 7.6x the baseline, predict: does K=16 cost exactly 16x?
|
|
More? Less? Write your reasoning (consider the fixed TTFT cost), then check the
|
|
actual numbers from the sweep table. Explain the gap.
|
|
|
|
**Exercise 2: Replace H100 with B200.**
|
|
Use `Hardware.Cloud.B200` (roughly 2x the memory bandwidth of H100) and re-run the
|
|
K=8 analysis. Predict first: will the *absolute* cost multiplier (K=8 vs. K=1) be the
|
|
same, larger, or smaller on the B200? Will the *fleet size* for 100 QPS change? Run
|
|
the numbers and explain.
|
|
|
|
**Exercise 3: At what K does 70B + reasoning exceed GPT-4 + no reasoning?**
|
|
Use `Models.Llama3_70B` with increasing K values and `Models.GPT4` with K=1. Find
|
|
the K value at which the 70B model's per-query latency exceeds GPT-4's K=1 latency.
|
|
What does this tell you about the trade-off between model size and reasoning depth?
|
|
|
|
**Self-check:** If ITL is 5ms and each reasoning step generates 50 tokens, what is
|
|
the total decode time for K=8? (Answer: 8 x 50 x 5ms = 2000ms = 2 seconds of pure
|
|
decode time, plus TTFT.)
|
|
:::
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
::: {.callout-tip}
|
|
## Summary
|
|
|
|
- **K reasoning steps multiply per-query latency** by approximately K (slightly less due to fixed TTFT)
|
|
- **The cost multiplier propagates through the stack**: more latency means more GPUs, more power, more cost
|
|
- **Annual TCO scales linearly with fleet size**: K=8 reasoning can turn a $1M serving bill into $9M
|
|
- **Routing is the production answer**: send easy queries to cheap models, hard queries to expensive reasoning
|
|
- **Algorithm choices are infrastructure decisions**: adding CoT reasoning belongs in the budget planning process
|
|
:::
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
- **[Geography is a Systems Variable](07_geography.qmd)** -- See how location changes the carbon cost of your serving fleet
|
|
- **[Scaling to 1000 GPUs](06_scaling_1000_gpus.qmd)** -- Discover reliability costs when training the models that do the reasoning
|
|
- **[Quantization: Not a Free Lunch](05_quantization.qmd)** -- Learn when reducing precision helps with serving costs
|
|
- **[Sensitivity Analysis](09_sensitivity.qmd)** -- Sweep parameters to find which lever matters most for your deployment
|