mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 02:28:25 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
300 lines
11 KiB
Plaintext
300 lines
11 KiB
Plaintext
---
|
|
title: "GPU vs. Wafer-Scale"
|
|
subtitle: "Cerebras eliminates the memory wall --- then hits a completely different one."
|
|
description: "Compare conventional GPU inference to Cerebras weight-streaming silicon. The binding constraint shifts from HBM bandwidth to injection bandwidth --- a qualitative regime change, not just a quantitative improvement."
|
|
categories: ["analysis", "advanced"]
|
|
---
|
|
|
|
## The Question
|
|
|
|
Can a fundamentally different architecture change *which* wall binds? GPUs are
|
|
**weight-stationary**: weights live in HBM, and the bottleneck is HBM bandwidth.
|
|
The Cerebras WSE-3 takes the opposite approach: it is **activation-stationary**,
|
|
holding activations on 44 GB of on-wafer SRAM and streaming weights from external
|
|
MemoryX nodes. Does this eliminate the memory wall --- or just move it somewhere else?
|
|
|
|
::: {.callout-note}
|
|
## Prerequisites
|
|
Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd),
|
|
[Tutorial 1: The Memory Wall](01_memory_wall.qmd), and
|
|
[Tutorial 9: Sensitivity Analysis](09_sensitivity.qmd). You should understand
|
|
roofline analysis, binding constraints, and sensitivity-based investment decisions.
|
|
:::
|
|
|
|
::: {.callout-note}
|
|
## What You Will Learn
|
|
|
|
- **Compare** GPU and Cerebras architectures on the same workload using different solvers
|
|
- **Identify** that the binding constraint **shifts** from HBM bandwidth to injection bandwidth
|
|
- **Compute** the optimal batch size B* where injection and compute overlap perfectly
|
|
- **Explain** why this is a qualitative regime change, not just a quantitative speedup
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
## Background: Two Philosophies of Memory
|
|
|
|
Conventional GPUs use a two-level memory hierarchy: fast but small on-chip SRAM
|
|
(registers, L1/L2 cache) and large but slower off-chip HBM. The fundamental insight
|
|
of wafer-scale computing is: what if you made the chip large enough that SRAM alone
|
|
could hold the working set? The Cerebras WSE-3 is an entire silicon wafer — 46,225 mm²
|
|
vs. ~800 mm² for an H100 die — with 44 GB of on-wafer SRAM distributed across 900,000
|
|
cores.
|
|
|
|
**GPU (weight-stationary):** Model weights live in HBM. At each decode step, the entire
|
|
model streams from HBM to the compute units. Activations are small and transient.
|
|
Bottleneck: HBM bandwidth.
|
|
|
|
**Cerebras WSE-3 (activation-stationary):** Activations and KV-cache live on the 44 GB
|
|
of on-wafer SRAM. But 44 GB cannot hold a 350 GB model, so weights must stream in
|
|
layer-by-layer from external **MemoryX nodes** — dedicated memory boxes connected to the
|
|
wafer via a high-bandwidth interconnect. Bottleneck: injection bandwidth from MemoryX.
|
|
|
|
Same model, same math, completely different performance physics.
|
|
:::
|
|
|
|
---
|
|
|
|
## 1. Setup
|
|
|
|
```{python}
|
|
#| echo: false
|
|
#| output: false
|
|
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
|
import mlsysim
|
|
```
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import SingleNodeModel, WeightStreamingModel, SensitivitySolver
|
|
```
|
|
|
|
---
|
|
|
|
## 2. GPU Baseline: H100 Inference
|
|
|
|
We use **GPT-3 (175B)** --- a model large enough that architectural differences in how
|
|
weights reach compute become the dominant factor. At batch size 1, each decode step must
|
|
reload the entire model from HBM.
|
|
|
|
```{python}
|
|
from mlsysim import SingleNodeModel, WeightStreamingModel, SensitivitySolver
|
|
from mlsysim.show import table, info, banner
|
|
|
|
model = mlsysim.Models.Language.GPT3
|
|
gpu_hw = mlsysim.Hardware.Cloud.H100
|
|
|
|
gpu_solver = SingleNodeModel()
|
|
gpu_result = gpu_solver.solve(
|
|
model=model, hardware=gpu_hw,
|
|
batch_size=1, precision="fp16"
|
|
)
|
|
|
|
info("GPU Baseline",
|
|
Model=f"{model.name} ({model.parameters.to('Gparam'):.0f})",
|
|
Hardware=gpu_hw.name,
|
|
Bottleneck=gpu_result.bottleneck,
|
|
Latency=gpu_result.latency.to('ms'),
|
|
HBM_BW=gpu_hw.memory.bandwidth.to('TB/s'),
|
|
Peak_FLOPS=gpu_hw.compute.peak_flops.to('TFLOPs/s'))
|
|
```
|
|
|
|
At batch size 1, GPT-3 requires 2 FLOPs per parameter per token but must load all 175B
|
|
parameters (350 GB at fp16) from HBM. The arithmetic intensity is approximately
|
|
1 FLOP/byte --- far below the H100's ridge point. The 3.35 TB/s HBM bandwidth, not the
|
|
989 TFLOP/s compute, determines the decode latency.
|
|
|
|
---
|
|
|
|
## 3. Cerebras Path: Weight Streaming on WSE-3
|
|
|
|
Now analyze the same model on the Cerebras CS-3. Instead of loading weights from HBM,
|
|
the WSE-3 streams them from MemoryX nodes over a dedicated interconnect.
|
|
|
|
```{python}
|
|
ws_hw = mlsysim.Hardware.Cloud.Cerebras_CS3
|
|
ws_solver = WeightStreamingModel()
|
|
|
|
ws_result = ws_solver.solve(
|
|
model=model, hardware=ws_hw,
|
|
seq_len=2048, batch_size=1, precision="fp16"
|
|
)
|
|
|
|
info("Cerebras WSE-3",
|
|
Hardware=ws_hw.name,
|
|
Feasible=ws_result.feasible,
|
|
Bottleneck=ws_result.bottleneck,
|
|
Throughput=f"{ws_result.throughput_tokens_per_sec:.0f} tokens/sec",
|
|
Layer_compute_time=ws_result.layer_compute_time.to('ms'),
|
|
Layer_injection_time=ws_result.layer_injection_time.to('ms'),
|
|
Optimal_batch_size=ws_result.optimal_batch_size,
|
|
SRAM_utilization=f"{ws_result.wafer_memory_utilization:.1%}")
|
|
```
|
|
|
|
The WSE-3 reports two times per layer: how long the wafer takes to **compute** the layer's
|
|
output, and how long it takes to **inject** the layer's weights from MemoryX. The bottleneck
|
|
is whichever is slower.
|
|
|
|
---
|
|
|
|
## 4. Side-by-Side: Where the Wall Shifts
|
|
|
|
```{python}
|
|
gpu_lat_ms = gpu_result.latency.to('ms').magnitude
|
|
# Cerebras total decode: max(inject, compute) per layer * num_layers
|
|
ws_layer_time = max(
|
|
ws_result.layer_injection_time.to('ms').magnitude,
|
|
ws_result.layer_compute_time.to('ms').magnitude
|
|
)
|
|
ws_total_ms = ws_layer_time * model.layers
|
|
speedup = gpu_lat_ms / ws_total_ms if ws_total_ms > 0 else 0
|
|
|
|
table(
|
|
["Metric", "H100 (GPU)", "CS-3 (WSE)"],
|
|
[
|
|
["Bottleneck", gpu_result.bottleneck, ws_result.bottleneck],
|
|
["Total decode time (ms)", f"{gpu_lat_ms:.2f}", f"{ws_total_ms:.2f}"],
|
|
["Speedup", "1.0x", f"{speedup:.1f}x"],
|
|
["Optimal batch B*", "N/A", ws_result.optimal_batch_size],
|
|
]
|
|
)
|
|
```
|
|
|
|
The GPU and WSE-3 hit **fundamentally different walls**:
|
|
|
|
- **GPU**: Limited by HBM bandwidth (~3.35 TB/s)
|
|
- **WSE-3**: Limited by MemoryX injection bandwidth (~1.2 TB/s)
|
|
|
|
This means the optimization strategies are completely different. For the GPU, you optimize
|
|
by reducing bytes loaded (quantization, smaller models). For the WSE-3, you optimize by
|
|
overlapping injection with compute (increasing batch size toward B*).
|
|
|
|
::: {.callout-important}
|
|
## Key Insight
|
|
|
|
**The binding constraint is not a property of the model --- it is a property of the
|
|
model-architecture pair.** GPUs are bound by HBM bandwidth. Cerebras WSE-3 eliminates
|
|
the HBM wall entirely (weights never touch HBM) but introduces an injection bandwidth
|
|
wall from MemoryX. This is a **qualitative regime change**: the wall *shifted*, it did
|
|
not disappear. When evaluating any novel architecture, the question is not "is it faster?"
|
|
but "which wall does it move, and what new wall does it create?"
|
|
:::
|
|
|
|
---
|
|
|
|
## 5. The SRAM Ceiling: Finding B*
|
|
|
|
The WSE-3 has a unique optimization knob: batch size controls whether compute or injection
|
|
dominates. At the optimal batch size B*, the two pipelines overlap perfectly. But
|
|
activations must fit in 44 GB of on-wafer SRAM --- this is the SRAM ceiling.
|
|
|
|
```{python}
|
|
rows = []
|
|
for batch in [1, 2, 4, 8, 16, 32, 64, 128]:
|
|
r = ws_solver.solve(
|
|
model=model, hardware=ws_hw,
|
|
seq_len=2048, batch_size=batch, precision="fp16"
|
|
)
|
|
rows.append([
|
|
batch, r.bottleneck,
|
|
f"{r.throughput_tokens_per_sec:.0f}/s",
|
|
f"{r.wafer_memory_utilization:.1%}",
|
|
"YES" if r.feasible else "OOM"
|
|
])
|
|
|
|
table(["Batch", "Bottleneck", "Throughput", "SRAM Util", "Feasible"], rows)
|
|
```
|
|
|
|
Watch for where the bottleneck transitions from injection-bound to compute-bound. At that
|
|
transition (B*), neither pipeline is idle, and throughput per token is maximized. Beyond B*,
|
|
SRAM fills up and the configuration eventually becomes infeasible (OOM).
|
|
|
|
---
|
|
|
|
## 6. Sensitivity Confirmation: Different Walls, Different Levers
|
|
|
|
Use the `SensitivitySolver` on the GPU to confirm that the binding constraint is
|
|
bandwidth, then contrast with the Cerebras architecture conceptually.
|
|
|
|
```{python}
|
|
sens_solver = SensitivitySolver()
|
|
gpu_sens = sens_solver.solve(
|
|
model=model, hardware=gpu_hw, precision="fp16"
|
|
)
|
|
|
|
banner(f"GPU Sensitivity ({gpu_hw.name})")
|
|
info(Baseline_latency=gpu_sens.baseline_latency.to('ms'),
|
|
Binding_constraint=gpu_sens.binding_constraint)
|
|
|
|
sens_rows = [[param, f"{val:+.4f}"] for param, val in gpu_sens.sensitivities.items()]
|
|
table(["Parameter", "Sensitivity"], sens_rows)
|
|
|
|
banner("Cerebras WSE-3")
|
|
info(Binding_constraint="injection bandwidth (MemoryX -> wafer)",
|
|
Optimization_lever="increase batch size to overlap inject/compute")
|
|
|
|
print()
|
|
print("Different architectures -> different walls -> different strategies.")
|
|
```
|
|
|
|
::: {.callout-warning}
|
|
## The deeper lesson
|
|
|
|
When evaluating novel architectures (wafer-scale, photonic, analog, neuromorphic), do not
|
|
ask "Is it faster?" Ask: **"Which wall does it move, and what new wall does it create?"**
|
|
Every architecture eliminates one bottleneck by introducing another.
|
|
:::
|
|
|
|
---
|
|
|
|
## Your Turn
|
|
|
|
::: {.callout-caution}
|
|
## Exercises
|
|
|
|
**Exercise 1: Predict before you compute.**
|
|
Does the Cerebras advantage grow or shrink for smaller models? Before running code,
|
|
predict whether the WSE-3 speedup over H100 will be larger or smaller for
|
|
`mlsysim.Models.Llama3_8B` (8B parameters) compared to GPT-3 (175B). Then verify
|
|
with both solvers. Explain your finding in terms of injection bandwidth utilization.
|
|
|
|
**Exercise 2: The SRAM ceiling.**
|
|
At what model size does the 44 GB SRAM ceiling become the binding constraint on Cerebras?
|
|
Try `mlsysim.Models.Llama3_70B` at increasing sequence lengths (512, 1024, 2048, 4096,
|
|
8192). At what point does SRAM utilization exceed 100% (OOM)? What does this mean for
|
|
serving long-context models on wafer-scale silicon?
|
|
|
|
**Exercise 3: TCO comparison.**
|
|
If an H100 costs ~$30,000 and a Cerebras CS-3 costs ~$2,000,000, how many H100s would
|
|
you need to match the Cerebras throughput for GPT-3 inference? Use the throughput numbers
|
|
from this tutorial to compute the fleet size, then compare the total hardware cost.
|
|
Which is more cost-effective at 100 queries per second?
|
|
|
|
**Self-check:** If the WSE-3 injection bandwidth is 1.2 TB/s and GPT-3 weights are
|
|
350 GB (fp16), what is the minimum per-layer injection time for a 96-layer model?
|
|
:::
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
::: {.callout-tip}
|
|
## Summary
|
|
|
|
- **Weight streaming** inverts the GPU memory hierarchy: activations stay on-wafer (SRAM),
|
|
weights stream in from external memory nodes
|
|
- **The binding constraint shifts** from HBM bandwidth (GPU) to injection bandwidth
|
|
(WSE-3) --- a qualitative change in system physics
|
|
- **Optimal batch size B*** exists for weight-streaming architectures, perfectly overlapping
|
|
injection with compute
|
|
- **Architecture evaluation** requires asking "which wall moves?" not "which is faster?"
|
|
:::
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
- **[Sensitivity Analysis](09_sensitivity.qmd)** --- Dive deeper into partial derivatives and inverse synthesis
|
|
- **[Full-Stack Audit](12_full_stack_audit.qmd)** --- Compose all solvers into a complete systems analysis
|
|
- **[The Memory Wall](01_memory_wall.qmd)** --- Revisit the foundational GPU memory wall tutorial
|
|
- **[Silicon Zoo](../zoo/hardware.qmd)** --- Compare the Cerebras CS-3, GPU fleet, and other accelerators
|