mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 02:28:25 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
290 lines
11 KiB
Plaintext
290 lines
11 KiB
Plaintext
---
|
||
title: "Quantization: Not a Free Lunch"
|
||
subtitle: "INT4 gives 4x speedup for decode. For training, it gives 0x."
|
||
description: "Discover that quantization only helps when you are memory-bound. Compare the effect of INT4 on LLM decode (dramatic) vs. training (negligible). The regime determines whether the optimization works."
|
||
categories: ["algorithm", "intermediate"]
|
||
---
|
||
|
||
## The Question
|
||
|
||
INT4 quantization reduces model size by 4x. Intuitively, 4x fewer bytes should mean 4x
|
||
faster. Blogs claim massive speedups. **Does INT4 always give 4x speedup?**
|
||
|
||
You already have the tools to predict the answer. Before reading further, think about
|
||
what Tutorial 1 taught you: *which ceiling determines performance?* If quantization
|
||
reduces bytes but not FLOPs, when would it help — and when would it do nothing?
|
||
|
||
This tutorial makes the prediction, runs the experiment, and reveals that the answer
|
||
depends entirely on the regime.
|
||
|
||
::: {.callout-note}
|
||
## Prerequisites
|
||
Complete [Tutorial 1: The Memory Wall](01_memory_wall.qmd) and
|
||
[Tutorial 2: Two Phases, One Request](02_two_phases.qmd). You should understand
|
||
memory-bound vs. compute-bound regimes and why LLM decode is memory-bound.
|
||
:::
|
||
|
||
::: {.callout-note}
|
||
## What You Will Learn
|
||
|
||
- **Predict** whether quantization will help a given workload based on its regime
|
||
- **Measure** the speedup from INT8 and INT4 for both memory-bound and compute-bound workloads
|
||
- **Explain** why the same optimization yields 4x in one case and 0x in another
|
||
- **Evaluate** the accuracy-compression trade-off using `CompressionModel`
|
||
:::
|
||
|
||
::: {.callout-tip}
|
||
## Background: What Quantization Actually Does
|
||
|
||
Quantization reduces the number of bytes per parameter:
|
||
|
||
| Precision | Bytes/Param | Relative to FP16 |
|
||
|:----------|:------------|:-----------------|
|
||
| FP16 | 2 | 1x (baseline) |
|
||
| INT8 | 1 | 0.5x |
|
||
| INT4 | 0.5 | 0.25x |
|
||
|
||
For **memory-bound** workloads, performance scales with bytes loaded from HBM per step.
|
||
Halving bytes halves latency. For **compute-bound** workloads, performance scales with
|
||
FLOP/s. Fewer bytes does not change the number of FLOPs, so latency stays the same.
|
||
|
||
The key question is always: *which ceiling am I hitting?*
|
||
:::
|
||
|
||
---
|
||
|
||
## 1. Setup
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| output: false
|
||
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
||
Engine = mlsysim.Engine
|
||
```
|
||
|
||
```python
|
||
import mlsysim
|
||
from mlsysim import ServingModel, SingleNodeModel, CompressionModel
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Memory-Bound Case: LLM Decode at Batch 1
|
||
|
||
LLM decoding at batch 1 is the textbook memory-bound workload. Each token generation
|
||
must reload the entire model from HBM. Fewer bytes per parameter means fewer bytes to
|
||
load means lower inter-token latency:
|
||
|
||
```{python}
|
||
from mlsysim import ServingModel
|
||
from mlsysim.show import table, info
|
||
|
||
model = mlsysim.Models.Llama3_8B
|
||
hardware = mlsysim.Hardware.Cloud.H100
|
||
solver = ServingModel()
|
||
|
||
rows = []
|
||
baseline_itl = None
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
r = solver.solve(
|
||
model=model, hardware=hardware,
|
||
seq_len=2048, batch_size=1, precision=prec
|
||
)
|
||
itl_ms = r.itl.to("ms").magnitude
|
||
if baseline_itl is None:
|
||
baseline_itl = itl_ms
|
||
speedup = baseline_itl / itl_ms
|
||
rows.append([prec, r.itl.to('ms'), r.model_weights_size, f"{speedup:.1f}x"])
|
||
|
||
table(["Precision", "ITL", "Weights", "Speedup vs FP16"], rows)
|
||
```
|
||
|
||
INT8 gives roughly 2x speedup. INT4 gives roughly 4x. The speedup tracks the byte
|
||
reduction almost exactly — because the workload is purely memory-bound. Every byte
|
||
you eliminate directly reduces the time to load model weights.
|
||
|
||
---
|
||
|
||
## 3. Compute-Bound Case: Training at Large Batch
|
||
|
||
Now let's try the same optimization on a compute-bound workload — ResNet-50 training
|
||
at batch 256 on the A100:
|
||
|
||
```{python}
|
||
from mlsysim import SingleNodeModel
|
||
|
||
train_model = mlsysim.Models.ResNet50
|
||
train_hw = mlsysim.Hardware.Cloud.A100
|
||
train_solver = SingleNodeModel()
|
||
|
||
rows = []
|
||
baseline_lat = None
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
p = train_solver.solve(
|
||
model=train_model, hardware=train_hw,
|
||
batch_size=256, precision=prec
|
||
)
|
||
lat_ms = p.latency.to("ms").magnitude
|
||
if baseline_lat is None:
|
||
baseline_lat = lat_ms
|
||
speedup = baseline_lat / lat_ms if lat_ms > 0 else 0
|
||
rows.append([prec, p.latency.to('ms'), f"{p.throughput:.0f} img/s", f"{speedup:.1f}x"])
|
||
|
||
table(["Precision", "Latency", "Throughput", "Speedup"], rows)
|
||
```
|
||
|
||
The speedup is negligible. Why? Because at batch 256, ResNet-50 training is
|
||
**compute-bound**. The bottleneck is arithmetic throughput (FLOP/s), not memory bandwidth.
|
||
Reducing bytes per parameter does not change the number of FLOPs in the forward and backward
|
||
passes. The GPU is already saturated with compute — loading weights faster does not help.
|
||
|
||
::: {.callout-warning}
|
||
## Nuance: INT8 Tensor Cores
|
||
|
||
In practice, GPUs with dedicated INT8/INT4 Tensor Cores (like A100 and H100) also gain
|
||
*higher compute throughput* at lower precision — e.g., the A100 does 624 TFLOP/s INT8 vs.
|
||
312 TFLOP/s FP16, a 2× compute boost. This means quantization simultaneously changes
|
||
*both* the memory ceiling (fewer bytes) and the compute ceiling (more INT ops/sec). For
|
||
workloads near the ridge point, this dual effect can shift the regime classification itself.
|
||
MLSys·im's first-order model captures the memory effect; the compute boost is a
|
||
second-order effect that depends on hardware-specific Tensor Core support.
|
||
:::
|
||
|
||
---
|
||
|
||
## 4. The Reveal: Same Optimization, Two Regimes
|
||
|
||
Let's put the results side by side to make the contrast stark:
|
||
|
||
```{python}
|
||
rows = []
|
||
|
||
# Memory-bound: LLM decode
|
||
decode_row = ["Llama-3 8B decode"]
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
r = solver.solve(model=model, hardware=hardware, seq_len=2048, batch_size=1, precision=prec)
|
||
decode_row.append(r.itl.to('ms'))
|
||
decode_row.append("Memory-bound")
|
||
rows.append(decode_row)
|
||
|
||
# Compute-bound: training
|
||
train_row = ["ResNet-50 train bs=256"]
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
p = train_solver.solve(model=train_model, hardware=train_hw, batch_size=256, precision=prec)
|
||
train_row.append(p.latency.to('ms'))
|
||
train_row.append("Compute-bound")
|
||
rows.append(train_row)
|
||
|
||
table(["Workload", "FP16", "INT8", "INT4", "Regime"], rows)
|
||
```
|
||
|
||
::: {.callout-important}
|
||
## Key Insight
|
||
|
||
**Quantization reduces bytes loaded from memory. If you are memory-bound, fewer bytes
|
||
means proportional speedup. If you are compute-bound, fewer bytes means nothing — compute,
|
||
not memory, is the ceiling.** The regime determines whether the optimization works. INT4
|
||
gives ~4x for LLM decode (memory-bound) and ~0x for large-batch training (compute-bound).
|
||
The same technique, applied to different regimes, yields completely different results. Always
|
||
check the regime before choosing an optimization.
|
||
:::
|
||
|
||
---
|
||
|
||
## 5. The Accuracy Tax: CompressionModel
|
||
|
||
Quantization is not free — it trades accuracy for speed. The `CompressionModel` quantifies
|
||
this trade-off:
|
||
|
||
```{python}
|
||
from mlsysim import CompressionModel
|
||
|
||
comp_solver = CompressionModel()
|
||
|
||
rows = []
|
||
for bits in [16, 8, 4]:
|
||
c = comp_solver.solve(
|
||
model=model, hardware=hardware,
|
||
method="quantization", target_bitwidth=bits
|
||
)
|
||
rows.append([
|
||
bits,
|
||
c.compressed_size_gb,
|
||
f"{c.compression_ratio:.1f}x",
|
||
f"{c.estimated_accuracy_delta:+.1%}",
|
||
f"{c.memory_savings_pct:.1f}%"
|
||
])
|
||
|
||
table(["Bits", "Compressed", "Compression", "Accuracy", "Savings"], rows)
|
||
```
|
||
|
||
INT8 has minimal accuracy loss (< 1%). INT4 can degrade accuracy by 2-5% depending on the
|
||
model and calibration method. The decision is not "always quantize" — it is "quantize when
|
||
you are memory-bound AND the accuracy cost is acceptable for your application."
|
||
|
||
::: {.callout-warning}
|
||
## When NOT to Quantize
|
||
|
||
- **Training**: You are compute-bound at large batch sizes. Quantization does not help and
|
||
introduces gradient noise that can harm convergence.
|
||
- **High-accuracy applications**: Medical, financial, and safety-critical systems may not
|
||
tolerate even 1% accuracy loss.
|
||
- **Already compute-bound inference**: If your inference workload runs at large batch sizes
|
||
(e.g., offline batch processing), you are likely compute-bound already.
|
||
:::
|
||
|
||
---
|
||
|
||
## Your Turn
|
||
|
||
::: {.callout-caution}
|
||
## Exercises
|
||
|
||
**Exercise 1: Predict before you compute.**
|
||
Llama-3 70B has 5x more parameters than Llama-3 8B, making it even more memory-bound at
|
||
batch 1. Before running code, predict: will INT4 give a *larger* or *smaller* speedup for
|
||
the 70B model compared to the 8B? Write your prediction, then verify with
|
||
`mlsysim.Models.Llama3_70B`. (Hint: think about what determines speedup in the
|
||
memory-bound regime.)
|
||
|
||
**Exercise 2: Find the crossover batch size.**
|
||
At some batch size, LLM inference transitions from memory-bound to compute-bound. At that
|
||
point, quantization stops helping. Sweep batch sizes from 1 to 256 for Llama-3 8B on the
|
||
H100 and compare FP16 vs. INT4 ITL. At what batch size does the INT4 speedup drop below
|
||
2x? Below 1.5x?
|
||
|
||
**Exercise 3: Accuracy-compression frontier.**
|
||
Use `CompressionModel` to compare quantization (INT8, INT4) vs. **pruning** — a technique
|
||
that removes parameters entirely (setting them to zero), reducing both model size and
|
||
computation. Try sparsity levels of 0.5, 0.75, and 0.9 for Llama-3 8B. Build a table
|
||
showing compression ratio vs. accuracy delta for each method. Which method gives the best
|
||
compression-to-accuracy trade-off?
|
||
|
||
**Self-check:** A workload has arithmetic intensity of 5 FLOP/byte and the hardware ridge
|
||
point is 150 FLOP/byte. Is this workload memory-bound or compute-bound? Will quantization
|
||
help? (Answer: Memory-bound. Yes, quantization will help proportionally.)
|
||
:::
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
::: {.callout-tip}
|
||
## Summary
|
||
|
||
- **Quantization primarily reduces bytes loaded from memory**: it helps memory-bound workloads proportionally and compute-bound workloads negligibly (though dedicated INT8/INT4 Tensor Cores also increase compute throughput)
|
||
- **LLM decode at batch 1** is the ideal case for quantization: ~2x for INT8, ~4x for INT4
|
||
- **Large-batch training** is compute-bound: quantization provides near-zero speedup
|
||
- **The regime determines the outcome**: always check whether you are memory-bound or compute-bound before applying quantization
|
||
- **Accuracy is the tax**: INT8 costs < 1%, INT4 costs 2-5% — acceptable for some applications, not for others
|
||
:::
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
- **[KV-Cache: The Hidden Tax](03_kv_cache.qmd)** — Quantization also shrinks the KV-cache, allowing more concurrent users
|
||
- **[The Memory Wall](01_memory_wall.qmd)** — Revisit the memory wall to see how quantization shifts the bandwidth bottleneck
|
||
- **[Starving the GPU](04_starving_the_gpu.qmd)** — Another case where the bottleneck is not where you expect
|
||
- **[Where to Invest](09_sensitivity.qmd)** — Quantify exactly how much quantization buys you compared to hardware upgrades
|