mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 02:28:25 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
476 lines
18 KiB
Plaintext
476 lines
18 KiB
Plaintext
---
|
|
title: "Full-Stack Audit: LLaMA-70B Training"
|
|
subtitle: "One model, six domains, twelve walls --- a complete systems analysis in 60 seconds."
|
|
description: "Compose 6+ solvers across all six taxonomy domains to produce a holistic training analysis. Discover that the binding constraint is compute, but checkpoint overhead is the hidden cost."
|
|
categories: ["capstone", "advanced"]
|
|
---
|
|
|
|
## The Question
|
|
|
|
What does a **complete** systems analysis look like? No single solver captures the full
|
|
picture. Training a 70B-parameter model on 512 H100 GPUs involves compute walls, memory
|
|
walls, communication overhead, checkpoint I/O, energy costs, and carbon emissions ---
|
|
simultaneously. This tutorial traces all six taxonomy domains and exercises 12 of the 22
|
|
systems walls through a single workload.
|
|
|
|
::: {.callout-note}
|
|
## Prerequisites
|
|
Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd),
|
|
[Tutorial 1: The Memory Wall](01_memory_wall.qmd),
|
|
[Tutorial 6: Scaling to 1000 GPUs](06_scaling_1000_gpus.qmd), and
|
|
[Tutorial 9: Sensitivity Analysis](09_sensitivity.qmd). You should understand
|
|
roofline analysis, distributed training, and binding constraint identification.
|
|
:::
|
|
|
|
::: {.callout-note}
|
|
## What You Will Learn
|
|
|
|
- **Compose** six solver families across all taxonomy domains into a holistic analysis
|
|
- **Identify** which of the 22 systems walls bind for a real training workload
|
|
- **Quantify** the hidden costs: checkpoint overhead, carbon, water, and TCO
|
|
- **Produce** a summary table mapping domain -> solver -> binding wall
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
## Solver Quick Reference
|
|
|
|
This capstone uses solvers from all six domains. If you arrived via an accelerated
|
|
learning path, here is what each solver does:
|
|
|
|
| Solver | Domain | What It Computes |
|
|
|:-------|:-------|:-----------------|
|
|
| `SingleNodeModel` | Node | Roofline bottleneck, latency, throughput |
|
|
| `DataModel` | Data | Whether the data pipeline can sustain GPU demand |
|
|
| `ScalingModel` | Algorithm | Compute-optimal training budget (Chinchilla) |
|
|
| `DistributedModel` | Fleet | Communication overhead and scaling efficiency |
|
|
| `ReliabilityModel` | Fleet | Cluster MTBF and optimal checkpoint intervals |
|
|
| `EconomicsModel` | Ops | CapEx, OpEx, and total cost of ownership (TCO) |
|
|
| `SustainabilityModel` | Ops | Energy, carbon footprint, and water usage |
|
|
| `SensitivitySolver` | Analysis | Partial derivatives identifying the binding constraint |
|
|
| `SynthesisSolver` | Analysis | Minimum hardware specs from a latency target |
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
## Background: The Six Taxonomy Domains
|
|
|
|
The MLSys wall taxonomy organizes 22 systems walls into six domains:
|
|
|
|
| Domain | Walls | What It Covers |
|
|
|:-------|:------|:---------------|
|
|
| Node | 1--3 | Compute, memory capacity, memory bandwidth |
|
|
| Data | 8--10 | Storage throughput, data pipeline stalls |
|
|
| Algorithm | 11--13 | Scaling laws, compute-optimal training |
|
|
| Fleet | 14--16 | Communication, synchronization, reliability |
|
|
| Ops | 17--20 | TCO, energy, carbon, water, safety |
|
|
| Analysis | 21--22 | Sensitivity, inverse synthesis |
|
|
|
|
No single solver spans all six. The insight emerges from **composition**.
|
|
:::
|
|
|
|
---
|
|
|
|
## 1. Setup: Build the Fleet
|
|
|
|
We construct a 512-GPU training cluster: 64 DGX H100 nodes, 8 GPUs per node,
|
|
NVLink intra-node, InfiniBand NDR inter-node, powered by Quebec's hydroelectric grid.
|
|
|
|
```{python}
|
|
#| echo: false
|
|
#| output: false
|
|
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
|
import mlsysim
|
|
```
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim.systems.types import Fleet, Node, NetworkFabric
|
|
from mlsysim.core.constants import Q_
|
|
```
|
|
|
|
```{python}
|
|
from mlsysim.systems.types import Fleet, Node, NetworkFabric
|
|
from mlsysim.infra.registry import Grids
|
|
from mlsysim.core.constants import Q_, NVLINK_H100_BW, INFINIBAND_NDR_BW
|
|
|
|
model = mlsysim.Models.Language.Llama3_70B
|
|
h100 = mlsysim.Hardware.Cloud.H100
|
|
|
|
# Build the DGX H100 node: 8 GPUs connected by NVLink 4.0
|
|
node = Node(
|
|
name="DGX H100",
|
|
accelerator=h100,
|
|
accelerators_per_node=8,
|
|
intra_node_bw=NVLINK_H100_BW
|
|
)
|
|
|
|
# Build the cluster fabric: InfiniBand NDR (400 Gbps)
|
|
fabric = NetworkFabric(
|
|
name="InfiniBand NDR",
|
|
topology="fat-tree",
|
|
bandwidth=INFINIBAND_NDR_BW
|
|
)
|
|
|
|
# Build the fleet: 64 nodes = 512 GPUs, Quebec grid
|
|
fleet = Fleet(
|
|
name="Training Cluster",
|
|
node=node,
|
|
count=64,
|
|
fabric=fabric,
|
|
region=Grids.Quebec
|
|
)
|
|
|
|
from mlsysim.show import table, info, banner
|
|
|
|
info("Fleet Configuration",
|
|
Model=f"{model.name} ({model.parameters.to('Bparam'):.1f~})",
|
|
Fleet=f"{fleet.count} nodes x {node.accelerators_per_node} GPUs = {fleet.total_accelerators} GPUs",
|
|
Intra_node=f"NVLink 4.0 ({NVLINK_H100_BW.to('GB/s'):.0f~})",
|
|
Inter_node=f"IB NDR ({INFINIBAND_NDR_BW.to('Gbps'):.0f~})",
|
|
Region=Grids.Quebec.name)
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Node (Walls 1--3): Single-GPU Roofline
|
|
|
|
First, classify the per-GPU forward-backward pass. Is each GPU compute-bound or
|
|
memory-bound during training?
|
|
|
|
```{python}
|
|
from mlsysim import SingleNodeModel
|
|
|
|
node_solver = SingleNodeModel()
|
|
node_result = node_solver.solve(
|
|
model=model, hardware=h100,
|
|
batch_size=4, precision="fp16"
|
|
)
|
|
|
|
banner("Domain: Node (Walls 1-3)")
|
|
info(Bottleneck=node_result.bottleneck,
|
|
Per_GPU_latency=node_result.latency.to('ms'),
|
|
Throughput=f"{node_result.throughput:.0f} samples/s")
|
|
```
|
|
|
|
Training at batch size 4 per GPU puts us in the compute-bound regime --- unlike inference,
|
|
training has high arithmetic intensity due to the backward pass. Wall 1 (Compute) is the
|
|
binding constraint at the node level.
|
|
|
|
Compute-bound is good news --- it means the GPU is doing useful work, not waiting for data.
|
|
But can the data pipeline actually keep up with 512 GPUs demanding training samples?
|
|
|
|
---
|
|
|
|
## 3. Data (Walls 8--10): Can the Pipeline Keep Up?
|
|
|
|
The roofline tells us each GPU can consume data at a certain rate. But can the storage and
|
|
preprocessing pipeline actually deliver data that fast? If not, the GPUs stall --- and
|
|
"compute-bound" becomes a meaningless label.
|
|
|
|
```{python}
|
|
from mlsysim import DataModel
|
|
|
|
# Estimate data demand per step: 4 samples/GPU * 512 GPUs * 2048 tokens * 2 bytes ≈ 8 MB/step
|
|
# At ~1 step/sec, this is ~8 MB/s — tokenized text is compact
|
|
data_demand = Q_("8 MB/s")
|
|
|
|
data_solver = DataModel()
|
|
data_result = data_solver.solve(
|
|
workload_data_rate=data_demand,
|
|
hardware=h100
|
|
)
|
|
|
|
banner("Domain: Data (Walls 8-10)")
|
|
info(Data_demand=data_result.demand_bw,
|
|
Data_supply=data_result.supply_bw,
|
|
Utilization=f"{data_result.utilization:.1%}",
|
|
Stalled=data_result.is_stalled,
|
|
Bottleneck=data_result.bottleneck)
|
|
```
|
|
|
|
For text-based training, the data pipeline is rarely the bottleneck --- tokenized text
|
|
is compact. But for image or video training, this wall can dominate.
|
|
|
|
The data pipeline can keep up. The GPUs are compute-bound and well-fed. But are we
|
|
spending our compute budget wisely? A 30-day run on 512 GPUs is an enormous investment
|
|
--- the scaling laws tell us whether we are allocating it optimally.
|
|
|
|
---
|
|
|
|
## 4. Algorithm (Walls 11--13): Compute-Optimal Budget
|
|
|
|
Is our training budget compute-optimal? The Chinchilla scaling law says
|
|
D = 20P (tokens = 20x parameters) for optimal allocation.
|
|
|
|
```{python}
|
|
from mlsysim import ScalingModel
|
|
|
|
# MFU (Model FLOP Utilization): the fraction of peak hardware FLOP/s that goes
|
|
# to useful model computation (excluding communication, idle time, overhead).
|
|
# MFU = 0.4 means 40% of theoretical peak -- typical for large-scale LLM training.
|
|
# Published values: 0.30-0.45 (Llama-2/3), up to 0.50 (highly optimized runs).
|
|
# Compute budget: 512 GPUs * 989 TFLOPs * 30 days * 86400s * 0.4 MFU
|
|
gpu_flops = h100.compute.peak_flops.to("flop/s").magnitude
|
|
total_flops = 512 * gpu_flops * 30 * 86400 * 0.4
|
|
compute_budget = Q_(total_flops, "flop")
|
|
|
|
scaling_solver = ScalingModel()
|
|
scaling_result = scaling_solver.solve(
|
|
compute_budget=compute_budget,
|
|
target_model_size=model.parameters
|
|
)
|
|
|
|
banner("Domain: Algorithm (Walls 11-13)")
|
|
info(Compute_budget=compute_budget.to('EFLOP'),
|
|
Optimal_tokens=f"{scaling_result.optimal_tokens.magnitude:.2e}",
|
|
Tokens_per_parameter=f"{scaling_result.tokens_per_parameter:.1f}",
|
|
Chinchilla_ratio=f"{'OVER' if scaling_result.tokens_per_parameter > 20 else 'UNDER'}-trained")
|
|
```
|
|
|
|
If the tokens-per-parameter ratio is significantly above or below 20, the training
|
|
budget is not optimally allocated. Over-training wastes compute; under-training wastes
|
|
model capacity.
|
|
|
|
So far, everything looks manageable: compute-bound GPUs, adequate data pipeline,
|
|
reasonable training budget. If we throw 512 GPUs at this, we should scale linearly,
|
|
right? The fleet-level analysis reveals what single-node reasoning misses.
|
|
|
|
---
|
|
|
|
## 5. Fleet (Walls 14--16): Communication and Reliability
|
|
|
|
The distributed solver models AllReduce overhead and pipeline bubbles.
|
|
The reliability solver computes cluster MTBF and optimal checkpoint intervals.
|
|
|
|
```{python}
|
|
from mlsysim import DistributedModel, ReliabilityModel
|
|
|
|
# 3D parallelism: TP=8 (within node), PP=1, DP=64
|
|
dist_solver = DistributedModel()
|
|
dist_result = dist_solver.solve(
|
|
model=model, fleet=fleet,
|
|
batch_size=2048, precision="fp16",
|
|
tp_size=8, pp_size=1,
|
|
overlap_comm=True, seq_len=2048
|
|
)
|
|
|
|
banner("Domain: Fleet (Walls 14-16)")
|
|
info(Scaling_efficiency=f"{dist_result.scaling_efficiency:.2%}",
|
|
Step_latency=dist_result.step_latency_total.to('ms'),
|
|
DP_comm_latency=dist_result.dp_communication_latency.to('ms'),
|
|
TP_comm_latency=dist_result.tp_communication_latency.to('ms'),
|
|
Bubble_fraction=f"{dist_result.bubble_fraction:.2%}")
|
|
```
|
|
|
|
```{python}
|
|
# Reliability: 30-day training job
|
|
rel_solver = ReliabilityModel()
|
|
rel_result = rel_solver.solve(
|
|
fleet=fleet,
|
|
job_duration_hours=30*24,
|
|
checkpoint_time_s=120
|
|
)
|
|
|
|
info(Fleet_MTBF=rel_result.fleet_mtbf.to('hour'),
|
|
Failure_probability=f"{rel_result.failure_probability:.2%}",
|
|
Expected_failures=f"{rel_result.expected_failures:.1f}",
|
|
Optimal_ckpt_interval=rel_result.optimal_checkpoint_interval.to('minute'))
|
|
```
|
|
|
|
At 512 GPUs, the cluster MTBF shrinks significantly. Checkpoint overhead becomes a
|
|
non-trivial fraction of wall-clock time --- this is the "hidden cost" that single-node
|
|
analysis misses entirely.
|
|
|
|
The reliability analysis tells us HOW OFTEN the cluster fails. But failures cost money ---
|
|
and so does the energy to keep 512 GPUs running for 30 days. The operational domain
|
|
quantifies these costs.
|
|
|
|
---
|
|
|
|
## 6. Ops (Walls 17--20): TCO, Energy, Carbon, Water
|
|
|
|
The economics solver combines CapEx, OpEx, and sustainability into a single financial model.
|
|
|
|
```{python}
|
|
from mlsysim import EconomicsModel, SustainabilityModel
|
|
|
|
# 30-day training run
|
|
econ_solver = EconomicsModel()
|
|
econ_result = econ_solver.solve(
|
|
fleet=fleet,
|
|
duration_days=30,
|
|
grid=Grids.Quebec,
|
|
mfu=0.4
|
|
)
|
|
|
|
banner("Domain: Ops (Walls 17-20)")
|
|
info(CapEx=f"${econ_result.capex_usd:,.0f}",
|
|
OpEx_energy=f"${econ_result.opex_energy_usd:,.0f}",
|
|
OpEx_maintenance=f"${econ_result.opex_maintenance_usd:,.0f}",
|
|
Total_TCO=f"${econ_result.tco_usd:,.0f}")
|
|
```
|
|
|
|
```{python}
|
|
sust_solver = SustainabilityModel()
|
|
sust_result = sust_solver.solve(
|
|
fleet=fleet,
|
|
duration_days=30,
|
|
datacenter=Grids.Quebec,
|
|
mfu=0.4
|
|
)
|
|
|
|
info(IT_Energy=sust_result.it_energy_kwh.to('MWh'),
|
|
Total_Energy_PUE=sust_result.total_energy_kwh.to('MWh'),
|
|
Carbon_footprint=f"{sust_result.carbon_footprint_kg:.0f} kg CO2",
|
|
Water_usage=f"{sust_result.water_usage_liters:.0f} liters",
|
|
PUE=sust_result.pue,
|
|
Region=sust_result.region_name)
|
|
```
|
|
|
|
Quebec's hydroelectric grid makes this one of the lowest-carbon training locations in the
|
|
world. The same run in Poland (coal-heavy grid) would produce dramatically more CO2 ---
|
|
infrastructure geography is a first-class engineering variable.
|
|
|
|
---
|
|
|
|
## 7. Analysis (Walls 21--22): Sensitivity and Synthesis
|
|
|
|
Finally, confirm the binding constraint and derive minimum hardware for a 14-day completion target.
|
|
|
|
```{python}
|
|
from mlsysim import SensitivitySolver, SynthesisSolver
|
|
|
|
# Sensitivity: confirm compute is the binding constraint for training
|
|
sens_solver = SensitivitySolver()
|
|
sens_result = sens_solver.solve(
|
|
model=model, hardware=h100, precision="fp16"
|
|
)
|
|
|
|
banner("Domain: Analysis (Walls 21-22)")
|
|
info(Binding_constraint=sens_result.binding_constraint)
|
|
|
|
sens_rows = [[param, f"{val:+.4f}"] for param, val in sens_result.sensitivities.items()]
|
|
table(["Parameter", "Sensitivity"], sens_rows)
|
|
```
|
|
|
|
```{python}
|
|
# Synthesis: what per-GPU step latency is needed to finish in 14 days?
|
|
# Total training FLOPs / (N_GPUs * MFU * peak_FLOPS) = wall_clock_seconds
|
|
target_days = 14
|
|
target_seconds = target_days * 86400
|
|
# Per-GPU step target: total_steps * step_latency = target_seconds
|
|
# Approximate: we need each step to complete within a target latency
|
|
synth_solver = SynthesisSolver()
|
|
synth_result = synth_solver.solve(
|
|
model=model,
|
|
target_latency=Q_("200 ms"), # per-GPU training step target
|
|
precision="fp16"
|
|
)
|
|
|
|
info("Synthesis (200ms per-GPU training step target)",
|
|
Required_BW=synth_result.required_bw.to('TB/s'),
|
|
Required_FLOPS=synth_result.required_flops.to('TFLOPs/s'),
|
|
Required_memory=synth_result.required_memory.to('GB'))
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Summary Table: The Complete Picture
|
|
|
|
We have now traced a single workload through all six domains. Each solver answered one
|
|
question in isolation. But the systems engineer's job is synthesis: seeing the complete
|
|
picture at once. The table below is that picture --- and its most important property is
|
|
that no single row captures the full story.
|
|
|
|
```{python}
|
|
mtbf_hours = rel_result.fleet_mtbf.to('hour').magnitude
|
|
summary_rows = [
|
|
["Node", "SingleNodeModel", f"Bottleneck: {node_result.bottleneck}", "Wall 1: Compute"],
|
|
["Data", "DataModel", f"Util: {data_result.utilization:.0%}", "Not binding"],
|
|
["Algorithm", "ScalingModel", f"Tok/param: {scaling_result.tokens_per_parameter:.0f}","Wall 11"],
|
|
["Fleet", "DistributedModel", f"Efficiency: {dist_result.scaling_efficiency:.0%}", "Wall 14: Comm"],
|
|
["Fleet", "ReliabilityModel", f"MTBF: {mtbf_hours:.0f}h", "Wall 19: Ckpt"],
|
|
["Ops", "EconomicsModel", f"TCO: ${econ_result.tco_usd:,.0f}", "Wall 17: Cost"],
|
|
["Ops", "SustainabilityModel", f"CO2: {sust_result.carbon_footprint_kg:.0f} kg", "Wall 18: Energy"],
|
|
["Analysis", "SensitivitySolver", f"Binding: {sens_result.binding_constraint}", "Wall 21"],
|
|
]
|
|
|
|
table(["Domain", "Solver", "Key Metric", "Binding Wall"], summary_rows, "<<>>")
|
|
```
|
|
|
|
::: {.callout-important}
|
|
## Key Insight
|
|
|
|
**No single solver captures the full picture --- the systems view emerges from composition.**
|
|
This end-to-end trace exercises 12 of 22 walls through a single model. The per-GPU binding
|
|
constraint is compute (Wall 1), but the **hidden costs** only appear at fleet scale:
|
|
checkpoint overhead (Wall 19) consumes wall-clock time proportional to the MTBF-driven
|
|
checkpoint frequency, and infrastructure geography (Quebec vs. Poland) can change the
|
|
carbon footprint by 40x (as [Tutorial 7](07_geography.qmd) demonstrated). A complete
|
|
systems analysis is not one solver run --- it is the composition of all six domains.
|
|
:::
|
|
|
|
---
|
|
|
|
## Your Turn
|
|
|
|
::: {.callout-caution}
|
|
## Exercises
|
|
|
|
**Exercise 1: Predict before you compute.**
|
|
What if you train in Poland instead of Quebec? Before running code, predict how the
|
|
TCO and carbon footprint will change. (Hint: Poland's grid is coal-heavy with ~800 g
|
|
CO2/kWh vs. Quebec's ~20 g CO2/kWh, and Poland has a higher PUE.) Then re-run the
|
|
economics and sustainability solvers with `Grids.Poland` and compare. How close was
|
|
your prediction?
|
|
|
|
**Exercise 2: Double the cluster.**
|
|
Scale the fleet to 1024 GPUs (128 nodes). Re-run the distributed solver and reliability
|
|
solver. Does scaling efficiency hold? How does the MTBF change? At what cluster size does
|
|
the checkpoint overhead exceed 5% of wall-clock time?
|
|
|
|
**Exercise 3: Minimum viable cluster.**
|
|
What is the minimum cluster size to complete Llama-3 70B training in 14 days? Use the
|
|
scaling result to determine the required total FLOPS, then work backward to find the
|
|
number of H100 GPUs needed at 40% MFU. Verify with the distributed solver that the
|
|
communication overhead is acceptable at that scale.
|
|
|
|
**Exercise 4: Propose a design change.**
|
|
Using the full-stack analysis, identify the single highest-leverage change — hardware
|
|
upgrade, parallelism strategy, region change, or precision change — that would reduce
|
|
TCO by at least 20%. Re-run the relevant solvers with your proposed change and compute
|
|
the new TCO. *Write one paragraph justifying why this change has the largest impact,
|
|
referencing at least two domains from the summary table.*
|
|
|
|
**Self-check:** If the fleet MTBF is 4 hours and each checkpoint takes 2 minutes, what
|
|
fraction of wall-clock time is spent checkpointing? (Use the Young-Daly formula:
|
|
optimal interval = sqrt(2 * delta * MTBF).)
|
|
:::
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
::: {.callout-tip}
|
|
## Summary
|
|
|
|
- **Composition is the method**: no single solver spans all six taxonomy domains; the
|
|
systems view emerges only from composing 6+ solvers
|
|
- **Compute binds at the node level**, but checkpoint overhead and communication are the
|
|
hidden costs at fleet scale
|
|
- **Infrastructure geography matters**: Quebec vs. Poland can change carbon footprint by
|
|
40x and TCO by 20--30%
|
|
- **The summary table** is the deliverable: one row per domain, solver, key metric, and
|
|
binding wall
|
|
- **12 of 22 walls** are exercised through a single model-fleet pair --- this is what a
|
|
complete analysis looks like
|
|
:::
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
- **[Sensitivity Analysis](09_sensitivity.qmd)** --- Dive deeper into the Analysis domain solvers
|
|
- **[GPU vs. Wafer-Scale](10_gpu_vs_wafer.qmd)** --- See how architecture shifts the binding wall
|
|
- **[Geography of AI](07_geography.qmd)** --- Explore how datacenter location changes sustainability
|
|
- **[The \$9 Million GPU](08_nine_million_dollar.qmd)** --- Deep dive into TCO modeling
|