mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 18:01:20 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
264 lines
9.5 KiB
Plaintext
264 lines
9.5 KiB
Plaintext
---
|
||
title: "The Memory Wall"
|
||
subtitle: "Why 3.2× more FLOPS gives only 1.7× speedup — and how to know in advance."
|
||
description: "Compare A100 and H100 GPUs to discover that for memory-bound workloads, bandwidth — not compute — determines performance. The most important fallacy in ML systems."
|
||
categories: ["node", "beginner"]
|
||
---
|
||
|
||
## The Question
|
||
|
||
NVIDIA's H100 has **3.2× more FLOP/s** than the A100. So upgrading should give you a 3.2×
|
||
speedup, right?
|
||
|
||
**Wrong.** For the workloads that matter most in production — LLM inference, recommendation
|
||
models, any memory-bound task — you get closer to **1.7×**. This tutorial shows you exactly
|
||
why, and teaches you to predict the actual speedup before spending a dollar on hardware.
|
||
|
||
::: {.callout-note}
|
||
## Prerequisites
|
||
Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd). You should understand
|
||
memory-bound vs. compute-bound and the ridge point concept.
|
||
:::
|
||
|
||
::: {.callout-note}
|
||
## What You Will Learn
|
||
|
||
- **Calculate** the actual speedup between two GPUs for a given workload
|
||
- **Explain** why the binding constraint determines which spec matters
|
||
- **Predict** whether a hardware upgrade will help a specific model
|
||
- **Apply** the roofline model to hardware procurement decisions
|
||
:::
|
||
|
||
::: {.callout-tip}
|
||
## Background: The Two Specs That Matter
|
||
|
||
GPU vendors advertise peak FLOP/s prominently. But every GPU also has a memory bandwidth
|
||
spec (in TB/s) that is equally important. Which spec determines your actual performance
|
||
depends entirely on which **regime** your workload is in:
|
||
|
||
| Regime | Binding Constraint | Speedup Scales With |
|
||
|:-------|:-------------------|:--------------------|
|
||
| Memory-bound | HBM bandwidth (TB/s) | Bandwidth ratio between GPUs |
|
||
| Compute-bound | Peak arithmetic (FLOP/s) | FLOP/s ratio between GPUs |
|
||
|
||
The key numbers for this tutorial:
|
||
|
||
| Spec | A100 | H100 | Ratio |
|
||
|:-----|:-----|:-----|:------|
|
||
| Peak FP16 | 312 TFLOP/s | 989 TFLOP/s | **3.2×** |
|
||
| HBM Bandwidth | 2.0 TB/s | 3.35 TB/s | **1.7×** |
|
||
|
||
If your workload is memory-bound, the speedup ceiling is 1.7×, regardless of the 3.2× compute improvement.
|
||
:::
|
||
|
||
---
|
||
|
||
## 1. Setup
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| output: false
|
||
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
||
Engine = mlsysim.Engine
|
||
```
|
||
|
||
```python
|
||
import mlsysim
|
||
from mlsysim import Engine
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Side-by-Side Hardware Comparison
|
||
|
||
Let's load both GPUs from the Silicon Zoo and confirm the specs:
|
||
|
||
```{python}
|
||
from mlsysim.show import table
|
||
|
||
a100 = mlsysim.Hardware.Cloud.A100
|
||
h100 = mlsysim.Hardware.Cloud.H100
|
||
|
||
flops_a = a100.compute.peak_flops.to("TFLOPs/s").magnitude
|
||
flops_h = h100.compute.peak_flops.to("TFLOPs/s").magnitude
|
||
bw_a = a100.memory.bandwidth.to("TB/s").magnitude
|
||
bw_h = h100.memory.bandwidth.to("TB/s").magnitude
|
||
|
||
table(
|
||
["Spec", "A100", "H100", "Ratio"],
|
||
[
|
||
["Peak FP16 (TFLOP/s)", flops_a, flops_h, f"{flops_h/flops_a:.1f}x"],
|
||
["HBM BW (TB/s)", bw_a, bw_h, f"{bw_h/bw_a:.1f}x"],
|
||
["Ridge (FLOP/byte)", flops_a*1e12/(bw_a*1e12), flops_h*1e12/(bw_h*1e12), ""]
|
||
]
|
||
)
|
||
```
|
||
|
||
The FLOP/s ratio is 3.2× but the bandwidth ratio is only 1.7×. The ridge point also
|
||
shifts: the H100 has a *higher* ridge, meaning more workloads fall into the memory-bound
|
||
regime on the H100 than on the A100.
|
||
|
||
---
|
||
|
||
## 3. The Fallacy: LLM Inference Speedup
|
||
|
||
Let's test the "3.2× speedup" claim with a workload that dominates production
|
||
today — Llama-3 8B inference at batch size 1:
|
||
|
||
```{python}
|
||
model = mlsysim.Models.Llama3_8B
|
||
|
||
# Solve on both GPUs
|
||
prof_a100 = Engine.solve(model=model, hardware=a100, batch_size=1, precision="fp16")
|
||
prof_h100 = Engine.solve(model=model, hardware=h100, batch_size=1, precision="fp16")
|
||
|
||
lat_a = prof_a100.latency.to("ms").magnitude
|
||
lat_h = prof_h100.latency.to("ms").magnitude
|
||
speedup = lat_a / lat_h
|
||
|
||
table(
|
||
["", "A100", "H100", "Speedup"],
|
||
[
|
||
["Bottleneck", prof_a100.bottleneck, prof_h100.bottleneck, ""],
|
||
["Latency", lat_a * mlsysim.core.constants.ureg.ms, lat_h * mlsysim.core.constants.ureg.ms, f"{speedup:.1f}x"],
|
||
["Throughput", prof_a100.throughput, prof_h100.throughput, ""]
|
||
]
|
||
)
|
||
```
|
||
|
||
Both GPUs report **memory-bound**. The actual speedup is approximately **1.7×** — matching
|
||
the bandwidth ratio, not the FLOP/s ratio. The extra 1.5× compute power of the H100 is
|
||
entirely wasted for this workload.
|
||
|
||
**Sanity check:** Llama-3 8B at FP16 = 8B params × 2 bytes = 16 GB of weights. On the A100
|
||
(2.0 TB/s), minimum decode latency ≈ 16 GB ÷ 2.0 TB/s = **8.0 ms**. On the H100
|
||
(3.35 TB/s), it is 16 GB ÷ 3.35 TB/s = **4.8 ms**. Speedup = 8.0 / 4.8 = **1.67×** —
|
||
matching the bandwidth ratio, as expected for a memory-bound workload.
|
||
|
||
---
|
||
|
||
## 4. When DOES 3.2× Matter? The Batch Size Crossover
|
||
|
||
The FLOP/s advantage only kicks in when you cross into the compute-bound regime.
|
||
Let's sweep batch size on both GPUs and find the crossover:
|
||
|
||
```{python}
|
||
rows = []
|
||
for batch in [1, 4, 16, 32, 64, 128, 256]:
|
||
pa = Engine.solve(model=model, hardware=a100, batch_size=batch, precision="fp16")
|
||
ph = Engine.solve(model=model, hardware=h100, batch_size=batch, precision="fp16")
|
||
|
||
la = pa.latency.to("ms").magnitude
|
||
lh = ph.latency.to("ms").magnitude
|
||
sp = la / lh if lh > 0 else 0
|
||
|
||
rows.append([batch, pa.bottleneck, ph.bottleneck, f"{sp:.1f}x"])
|
||
|
||
table(["Batch", "A100 Bottleneck", "H100 Bottleneck", "Speedup"], rows)
|
||
```
|
||
|
||
We can visualize where Llama-3 8B sits on the H100's Roofline model. Note the high ridge point:
|
||
|
||
```{python}
|
||
from mlsysim.viz.plots import plot_roofline
|
||
|
||
# Plot the H100 roofline and see where Llama-3 8B (batch 1) falls
|
||
fig, ax = plot_roofline(h100, workloads=[model])
|
||
fig.show()
|
||
```
|
||
|
||
::: {.callout-important}
|
||
## Key Insight
|
||
|
||
**The binding constraint determines which hardware spec matters.** When you are memory-bound,
|
||
speedup scales with the bandwidth ratio (1.7×). When you are compute-bound, speedup scales
|
||
with the FLOP/s ratio (up to 3.2×). The transition happens at different batch sizes on each
|
||
GPU because the H100's higher ridge point means it *stays memory-bound longer*. If you are
|
||
making a procurement decision, the first question is not "how many FLOP/s?" but "which regime
|
||
will my production workload operate in?"
|
||
:::
|
||
|
||
---
|
||
|
||
## 5. The Procurement Table: Three Generations
|
||
|
||
Let's extend the analysis across three GPU generations to see the trend:
|
||
|
||
```{python}
|
||
gpus = [
|
||
("V100", mlsysim.Hardware.Cloud.V100),
|
||
("A100", mlsysim.Hardware.Cloud.A100),
|
||
("H100", mlsysim.Hardware.Cloud.H100),
|
||
]
|
||
|
||
rows = []
|
||
for name, hw in gpus:
|
||
p = Engine.solve(model=model, hardware=hw, batch_size=1, precision="fp16")
|
||
flops = hw.compute.peak_flops.to("TFLOPs/s").magnitude
|
||
bw = hw.memory.bandwidth.to("TB/s").magnitude
|
||
ridge = flops * 1e12 / (bw * 1e12)
|
||
lat = p.latency.to("ms").magnitude
|
||
rows.append([name, flops, bw, ridge, p.latency, p.bottleneck])
|
||
|
||
table(["GPU", "TFLOP/s", "BW (TB/s)", "Ridge", "Latency", "Bottleneck"], rows)
|
||
```
|
||
|
||
Across three generations, compute has grown faster than bandwidth. The ridge point keeps
|
||
rising, which means **more workloads are memory-bound on newer hardware**. This is the
|
||
memory wall — and it is getting worse, not better.
|
||
|
||
---
|
||
|
||
## Your Turn
|
||
|
||
::: {.callout-caution}
|
||
## Exercises
|
||
|
||
**Exercise 1: Predict before you compute.**
|
||
The B200 has ~8 TB/s HBM3e bandwidth and ~2250 TFLOP/s (FP16 dense). Before running any
|
||
code, predict: *write the speedup as a ratio (e.g., 2.3×)* for Llama-3 8B going from
|
||
H100 → B200 at batch size 1. Record your reasoning in one sentence. Then verify with
|
||
`mlsysim.Hardware.Cloud.B200`. How close were you?
|
||
|
||
**Exercise 2: Find the crossover batch size.**
|
||
For the A100, at what exact batch size does Llama-3 8B transition from memory-bound to
|
||
compute-bound? Write a loop that sweeps batch sizes from 1 to 512 in steps of 1 and
|
||
prints the first compute-bound batch size. Do the same for the H100. Why is the crossover
|
||
different?
|
||
|
||
**Exercise 3: Validate against published benchmarks.**
|
||
Look up the MLPerf Inference results for LLM workloads on A100 vs. H100 (available at
|
||
mlcommons.org). Compare the measured throughput ratio to our analytical prediction from
|
||
Section 3. What accounts for the difference? (Hint: real systems include software
|
||
optimizations like FlashAttention and continuous batching that our first-order model
|
||
does not capture. The gap between analytical prediction and measured performance is
|
||
itself informative.)
|
||
|
||
**Self-check:** If GPU-A has 500 TFLOP/s and 2 TB/s bandwidth, and GPU-B has 1000 TFLOP/s
|
||
and 4 TB/s bandwidth, what speedup do you expect for a memory-bound workload? For a
|
||
compute-bound workload? *(Write each answer as a ratio.)*
|
||
:::
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
::: {.callout-tip}
|
||
## Summary
|
||
|
||
- **The memory wall is real**: HBM bandwidth has grown slower than compute across GPU generations
|
||
- **Speedup depends on regime**: memory-bound workloads scale with bandwidth ratio, not FLOP/s ratio
|
||
- **The ridge point rises each generation**: more production workloads are memory-bound on newer GPUs
|
||
- **Procurement decisions require regime analysis**: always check which wall binds before comparing specs
|
||
- **The roofline model predicts this**: `Engine.solve` tells you the regime before you spend a dollar
|
||
:::
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
- **[Two Phases, One Request](02_two_phases.qmd)** — Discover that LLM serving hits *both* ceilings in a single request
|
||
- **[Quantization: Not a Free Lunch](05_quantization.qmd)** — Learn when reducing precision helps (memory-bound) vs. when it doesn't (compute-bound)
|
||
- **[Where to Invest](09_sensitivity.qmd)** — Use sensitivity analysis to quantify exactly how much each spec matters
|
||
- **[Silicon Zoo](../zoo/hardware.qmd)** — Browse all GPU specs including V100, A100, H100, H200, B200, MI300X, and TPUs
|