Files
cs249r_book/mlsysim/docs/tutorials/00_hello_roofline.qmd
Vijay Janapa Reddi 3ba3858b74 MLSys·im 0.1.0 release-prep audit (#1397)
* docs(mlsysim): release-prep audit fixes for 0.1.0

Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.

Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
  Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
  limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
  metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
  (BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
  and surface the orphan tutorials (01_pipeline_callbacks,
  02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
  sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
  no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
  identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.

* release(mlsysim): add 0.1.0 release notes and runbook

- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
  with install/quickstart copy and a "known limitations & gotchas" section
  covering the editable-install issue, broken example scripts, and unpublished
  slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
  tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
  and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
  passing on dev.

* mlsysim: nest package layout, enable editable installs, clean lint

Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.

Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
  add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
  (no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
  `docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.

Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
  `__init__.py` re-export pattern (F401/F403/F405/F811),
  `core/constants.py` star import from unit registry,
  tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).

Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
  was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
  shadowing the module-level import and triggering F823 (referenced
  before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
  `except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
  assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
  computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.

Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
  `BatchingOptimizerResult` API (which has no `pareto_front` attribute);
  build the front explicitly by sweeping batch sizes through
  `ServingModel` + `TailLatencyModel`, then highlight the optimum
  returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
  (`f"\n[…]"` instead of an embedded literal newline) so the file imports
  on every supported Python version.

Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
  from package-relative imports to absolute `from mlsysim... import` so
  they run cleanly under the nested layout.

Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
  `pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
  that this commit resolves (editable install, pareto example, gemini
  example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
  layout change and the resolver bug fixes.

Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
  `examples/gemini_design_loop.py` → both exit 0.

* fix(mlsysim): repair docs build + lab test after nested-package restructure

The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:

1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
   a hand-rolled `importlib.util.spec_from_file_location` block to load
   `<repo>/mlsysim/__init__.py` directly from source. After the restructure,
   that path no longer exists. Replaced the hack in 17 docs/.qmd files with
   a plain `import mlsysim` — the package is already pip-installed in the
   docs build environment via `pip install ".[docs]"`. Updated the matching
   guidance in `contributing.qmd`.

2. **Lab static tests** — `test_no_localstorage_import` hard-coded
   `mlsysim/labs/state.py`; updated to the new nested path
   `mlsysim/mlsysim/labs/state.py`.

Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
2026-04-18 13:11:13 -04:00

267 lines
12 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Hello, Roofline"
subtitle: "Five lines of code to predict whether your model is memory-bound or compute-bound."
description: "Learn to use MLSys·im's analytical roofline model to predict ML model performance on any hardware. The foundation for all ML systems reasoning."
categories: ["start", "beginner"]
---
## The Question
You have a model and a GPU. Before you write any code, train anything, or rent any
cloud instance — **can you predict which hardware resource will be the bottleneck?**
This tutorial teaches the single most important skill in ML systems: using the
**roofline model** to answer that question in under a second.
::: {.callout-note}
## Prerequisites
None. This is the starting point for all MLSys·im tutorials.
:::
::: {.callout-tip}
## Key Terms for These Tutorials
If you are new to ML, here are the essential terms used throughout this tutorial series:
| Term | Meaning |
|:-----|:--------|
| **Model** | A mathematical function with learned **parameters** (weights) that maps inputs to outputs — e.g., an image to a label |
| **Parameters** | The numbers a model learns during training. A "25M parameter" model stores 25 million numbers |
| **Inference** | Running a trained model on new input to get a prediction (as opposed to *training*, which learns the parameters) |
| **CNN** | Convolutional Neural Network — a model architecture for images (e.g., ResNet-50) |
| **LLM** | Large Language Model — a model that generates text one **token** (roughly one word) at a time (e.g., GPT-4, Llama-3) |
| **FP16** | 16-bit floating point ("half precision") — uses 2 bytes per parameter. ML often uses reduced precision for speed |
| **FLOP/s** | Floating-point operations per second — a measure of compute speed. **TFLOP/s** = trillion FLOP/s |
| **HBM** | High Bandwidth Memory — the fast DRAM attached to a GPU (e.g., HBM2e on A100, HBM3 on H100) |
| **Batch size** | How many inputs are processed together in one pass. Larger batches amortize the cost of loading weights |
See the [Glossary](../glossary.qmd) for a complete list of terms.
:::
::: {.callout-note}
## What You Will Learn
- **Identify** the performance bottleneck (memory-bound vs. compute-bound) for any model-hardware pair
- **Predict** how batch size shifts the operating point along the roofline
- **Interpret** the ridge point as the boundary between two performance regimes
- **Use** `Engine.solve` as the foundational API for all MLSys·im analyses
:::
::: {.callout-tip}
## Background: The Roofline Model
The roofline model (Williams, Waterman, and Patterson, 2009) is the foundational
analytical tool for predicting hardware bottlenecks. Every accelerator has two speed limits:
1. **Compute ceiling** — how fast it can do arithmetic (measured in FLOP/s)
2. **Memory bandwidth ceiling** — how fast it can load data from memory (measured in bytes/s)
The roofline model reduces to four lines of algebra:
$$T_{\text{compute}} = \frac{\text{FLOPs}}{\text{Peak FLOP/s}}
\qquad
T_{\text{memory}} = \frac{\text{Bytes}}{\text{Peak BW}}$$
$$T = \max(T_{\text{compute}},\; T_{\text{memory}})$$
$$\text{Ridge point} = \frac{\text{Peak FLOP/s}}{\text{Peak BW}} \quad [\text{FLOP/byte}]$$
Your model's **arithmetic intensity** (FLOPs ÷ Bytes) determines which ceiling you hit. If it is below the ridge point, you are **memory-bound** (starved for data). Above it, you are **compute-bound** (saturating the arithmetic units). This single classification drives every optimization decision downstream.
**Important caveat:** The roofline is an *upper bound*. Real performance is always below it due to scheduling overhead, memory access patterns, and imperfect utilization. Achieving 4060% of the roofline ceiling is considered good in practice. The model's value is not in predicting exact latency — it is in identifying the **binding constraint** (which resource limits you).
:::
::: {.callout-note}
## Conventions Used in These Tutorials
- **FLOP counting:** We count one multiply-accumulate (MAC) as **1 FLOP**, consistent with the MLSys Zoo constants. Industry and vendor datasheets typically count 1 MAC = 2 FLOPs. This factor of 2 shifts the ridge point: the A100's ridge is ~156 FLOP/byte in our convention but ~312 in the 2-FLOP convention. Always check which convention a paper uses before comparing numbers.
- **Peak specs:** We use vendor-published peak Tensor Core throughput and peak HBM bandwidth. Real sustained performance is typically 7090% of these peaks.
- **Units:** `Q_` creates physical quantities with units (e.g., `Q_("2 TB/s")`). The `~` in format strings like `:~.2f` shows abbreviated unit names.
:::
---
## 1. Setup
```{python}
#| echo: false
#| output: false
# Build-system path setup — hidden from readers
import mlsysim # installed via `pip install mlsysim` (see workflow)
Engine = mlsysim.Engine
```
After `pip install mlsysim`, the import is two lines:
```python
import mlsysim
from mlsysim import Engine
```
---
## 2. Pick a Model and a GPU
Pull vetted specifications from the **MLSys Zoo** — no need to search datasheets.
```{python}
from mlsysim.show import table, info
# ResNet-50: 25M parameters, ~4.1 GFLOP per inference (counting multiply-accumulate as 1 FLOP)
model = mlsysim.Models.ResNet50
# NVIDIA A100: 312 TFLOP/s (FP16), 2.0 TB/s HBM2e, 80 GB
hardware = mlsysim.Hardware.Cloud.A100
info("Model",
Name=f"{model.name} ({model.architecture})",
Parameters=model.parameters,
FLOPs_per_inf=model.inference_flops)
info("Hardware",
Name=hardware.name,
Peak_FP16=hardware.compute.peak_flops.to('TFLOPs/s'),
HBM_BW=hardware.memory.bandwidth.to('TB/s'))
```
---
## 3. Solve: One Line, One Answer
The `Engine.solve` method applies the roofline model — it calculates which of the two
speed limits you hit first, and returns latency, throughput, and the bottleneck classification.
```{python}
# One line: model + hardware + config → performance prediction
profile = Engine.solve(
model=model,
hardware=hardware,
batch_size=1, # Single image inference
precision="fp16" # Half-precision (16-bit floating point)
)
info(Bottleneck=profile.bottleneck,
Latency=profile.latency.to('ms'),
Throughput=f"{profile.throughput:.0f} images/sec")
```
At batch size 1, ResNet-50 performs ~4.1 GFLOP but must load ~50 MB of weights (25M params × 2 bytes). That gives an arithmetic intensity of ~82 FLOP/byte — close to the A100's ridge point of ~156 FLOP/byte. At this operating point, the two ceilings are nearly balanced, and the bottleneck label depends on exact assumptions. The important takeaway: **most of the A100's 312 TFLOP/s is idle** — you need larger batches to exploit it.
**Sanity check:** We can verify this with the equation from the Background. Note: we use our 1-FLOP-per-MAC convention here, so the A100's peak is 156 TFLOP/s (the vendor-reported 312 TFLOP/s uses the 2-FLOP convention):
- $T_{\text{memory}} = 50\;\text{MB} \div 2.0\;\text{TB/s} = 0.025\;\text{ms}$
- $T_{\text{compute}} = 4.1\;\text{GFLOP} \div 156\;\text{TFLOP/s} = 0.026\;\text{ms}$
- $T = \max(0.025, 0.026) = 0.026\;\text{ms}$ → the two ceilings are nearly equal ✓
ResNet-50 at batch 1 sits right at the ridge point. When $T_{\text{compute}} \approx T_{\text{memory}}$, the regime label is ambiguous — and that is the point: the ridge is a *boundary*, not a wall. Small differences in convention or measurement can flip the label. `Engine.solve` handles the convention internally, so its reported latency may differ slightly from this back-of-envelope estimate. The skill that matters is computing the ratio and knowing *where you stand*.
**Computing arithmetic intensity from first principles:** This is the skill that lets you
reason about *any* model, not just ones in the Zoo. The formula is FLOPs ÷ Bytes. Compare two very different workloads:
- **ResNet-50 (batch 1):** 4.1 GFLOP ÷ 50 MB = **82 FLOP/byte** → near the A100 ridge (156) — balanced
- **LLM decode (batch 1):** Each token does ~2 FLOP per parameter but loads 2 bytes per parameter = **1 FLOP/byte** → deeply memory-bound (you will explore this in [Tutorial 2](02_two_phases.qmd))
When you encounter an unfamiliar model, compute this ratio first. It tells you the regime
before you touch any code.
---
## 4. Sweep Batch Size: Watch the Regime Shift
Now let's increase the batch size and see when the bottleneck changes. More images per
batch means more computation per weight load — which increases arithmetic intensity.
```{python}
rows = []
for batch in [1, 4, 16, 32, 64, 128, 256]:
p = Engine.solve(
model=model,
hardware=hardware,
batch_size=batch,
precision="fp16"
)
rows.append([batch, p.bottleneck, f"{p.throughput:.0f}/s", p.latency.to('ms')])
table(["Batch", "Bottleneck", "Throughput", "Latency"], rows)
```
We can visualize this transition on the Roofline model. Notice where the model sits relative to the "ridge point" (the crossover between memory-bound and compute-bound regimes).
```{python}
from mlsysim.viz.plots import plot_roofline
# The plot_roofline function takes the hardware node and a list of workloads
fig, ax = plot_roofline(hardware, workloads=[model])
fig.show()
```
::: {.callout-important}
## Key Insight
**The roofline model lets you predict performance without running a single experiment.**
The answer is determined by two ratios: your workload's arithmetic intensity and the
hardware's ridge point. Batch size is the primary knob that moves you along the roofline —
at small batches you are memory-bound, at large batches compute-bound. The ridge point is
the most efficient operating point. Every optimization decision starts with knowing which
side of the ridge you are on.
:::
::: {.callout-warning}
## Pitfall: Assuming Peak FLOP/s Determines Inference Speed
A common mistake is selecting hardware based on peak FLOP/s alone. At batch size 1,
ResNet-50 is memory-bound — the 312 TFLOP/s compute ceiling is irrelevant. A GPU with
half the FLOP/s but the same bandwidth would deliver *identical* inference latency. Always
check the regime before comparing specs.
:::
---
## Your Turn
::: {.callout-caution}
## Exercises
**Exercise 1: Predict before you compute.**
Before running any code: will ResNet-50 at `batch_size=64` be memory-bound or compute-bound
on the A100? *Write your answer as one of: "memory-bound" or "compute-bound", plus one
sentence of reasoning.* Then verify with `Engine.solve(...)`.
Were you right? What would you need to know to predict correctly?
**Exercise 2: Change the hardware.**
Run the same batch size sweep on the H100 (`mlsysim.Hardware.Cloud.H100`). The H100 has
3.2× more FLOP/s than the A100 but only 1.7× more bandwidth. How does the ridge point
shift? At what batch size does the crossover happen on the H100 vs. the A100?
**Exercise 3: Change the model.**
Replace ResNet-50 with Llama-3 8B (`mlsysim.Models.Llama3_8B`). At batch size 1, is it
memory-bound or compute-bound? Does the answer surprise you? Why do large language models
behave differently from CNNs at the same batch size?
**Self-check:** If a model's arithmetic intensity is 50 FLOP/byte and the hardware's ridge
point is 156 FLOP/byte, is the model memory-bound or compute-bound?
:::
---
## Key Takeaways
::: {.callout-tip}
## Summary
- **The roofline model** predicts performance by comparing arithmetic intensity to the hardware's ridge point
- **Memory-bound** means the GPU is waiting for data; **compute-bound** means it is saturating arithmetic units
- **Batch size** is the primary knob for shifting between regimes — larger batches increase arithmetic intensity
- **The ridge point** ($\text{Peak FLOP/s} \div \text{Peak BW}$) is the crossover — the most efficient operating point
- **`Engine.solve`** is the foundational API: model + hardware + config → bottleneck, latency, throughput
:::
---
## Next Steps
- **[The Memory Wall](01_memory_wall.qmd)** — Discover why upgrading from A100 to H100 doesn't give the speedup you expect
- **[Two Phases, One Request](02_two_phases.qmd)** — Learn why LLM serving has two different bottlenecks in the same request
- **[Silicon Zoo](../zoo/hardware.qmd)** — Browse all vetted hardware specifications
- **[Math Foundations](../math.qmd)** — The complete equations behind the roofline model