mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 18:01:20 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
274 lines
9.2 KiB
Plaintext
274 lines
9.2 KiB
Plaintext
---
|
|
title: "Geography is a Systems Variable"
|
|
subtitle: "Same cluster, same model, same duration — but does location change the cost?"
|
|
description: "Compare identical training runs across four grid regions to discover whether geography matters more than hardware choice or training duration for carbon footprint."
|
|
categories: ["ops", "intermediate"]
|
|
---
|
|
|
|
## The Question
|
|
|
|
You have a 256-GPU cluster training a model for 30 days. Does it matter *where* that
|
|
cluster is located? Not for latency or throughput — those are fixed by the hardware. But
|
|
for carbon emissions, water usage, and total cost of ownership, does geography matter —
|
|
and if so, by how much?
|
|
|
|
::: {.callout-note}
|
|
## Prerequisites
|
|
Complete [Tutorial 1: The Memory Wall](01_memory_wall.qmd). No other prerequisites
|
|
are required — this tutorial can be completed independently.
|
|
:::
|
|
|
|
::: {.callout-note}
|
|
## What You Will Learn
|
|
|
|
- **Calculate** the carbon footprint of identical training runs in different regions
|
|
- **Quantify** the gap between the cleanest and dirtiest electricity grids
|
|
- **Compare** geography vs. training duration as levers for sustainability
|
|
- **Apply** the `EconomicsModel` to show how carbon pricing changes the cheapest option
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
## Background: Grid Carbon Intensity
|
|
|
|
Every kilowatt-hour of electricity has a carbon cost, measured in grams of CO2 per kWh
|
|
(gCO2/kWh). This number depends entirely on how the electricity is generated:
|
|
|
|
| Region | Primary Source | Carbon Intensity |
|
|
|:-------|:---------------|:-----------------|
|
|
| Quebec | Hydroelectric | ~20 gCO2/kWh |
|
|
| Norway | Hydroelectric | ~29 gCO2/kWh |
|
|
| US Average | Mixed (gas, coal, renewables) | ~390 gCO2/kWh |
|
|
| Poland | Coal-dominated | ~820 gCO2/kWh |
|
|
|
|
The range is wide. How wide — and whether it matters more than other levers like
|
|
training duration or hardware choice — is what this tutorial quantifies.
|
|
:::
|
|
|
|
---
|
|
|
|
## 1. Setup
|
|
|
|
```{python}
|
|
#| echo: false
|
|
#| output: false
|
|
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
|
Engine = mlsysim.Engine
|
|
```
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import Engine
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Two-Region Comparison
|
|
|
|
Let's run the same training job in two locations: Quebec (hydroelectric) and Poland
|
|
(coal-dominated). Same fleet, same model, same 30-day duration. The only variable
|
|
is where the electricity comes from.
|
|
|
|
```{python}
|
|
from mlsysim import SustainabilityModel, Systems
|
|
from mlsysim.systems.types import Fleet, Node, NetworkFabric
|
|
from mlsysim.core.constants import Q_
|
|
from mlsysim.show import table, info
|
|
|
|
# 256-GPU cluster: 32 DGX H100 nodes
|
|
fleet = Fleet(
|
|
name="256-GPU Training Cluster",
|
|
node=Systems.Nodes.DGX_H100,
|
|
count=32,
|
|
fabric=Systems.Fabrics.InfiniBand_NDR
|
|
)
|
|
|
|
solver = SustainabilityModel()
|
|
|
|
# Quebec: hydroelectric grid
|
|
res_quebec = solver.solve(
|
|
fleet=fleet, duration_days=30,
|
|
datacenter=mlsysim.Infra.Grids.Quebec
|
|
)
|
|
|
|
# Poland: coal-heavy grid
|
|
res_poland = solver.solve(
|
|
fleet=fleet, duration_days=30,
|
|
datacenter=mlsysim.Infra.Grids.Poland
|
|
)
|
|
|
|
carbon_q = res_quebec.carbon_footprint_kg / 1000 # tonnes
|
|
carbon_p = res_poland.carbon_footprint_kg / 1000
|
|
ratio = carbon_p / carbon_q if carbon_q > 0 else 0
|
|
|
|
table(
|
|
["Region", "Carbon (tonnes CO2)"],
|
|
[
|
|
["Quebec (Hydro)", f"{carbon_q:.1f}"],
|
|
["Poland (Coal)", f"{carbon_p:.1f}"],
|
|
]
|
|
)
|
|
info(Ratio=f"{ratio:.0f}x")
|
|
```
|
|
|
|
Same cluster. Same model. Same duration. The carbon footprint differs by roughly
|
|
**40x** depending on the electricity grid. This is not an optimization — it is a
|
|
location decision.
|
|
|
|
---
|
|
|
|
## 3. All-Region Sweep
|
|
|
|
Let's expand the comparison to all four grid regions in the Infrastructure Zoo,
|
|
adding energy consumption, water usage, and PUE to the picture.
|
|
|
|
```{python}
|
|
grids = [
|
|
mlsysim.Infra.Grids.Quebec,
|
|
mlsysim.Infra.Grids.Norway,
|
|
mlsysim.Infra.Grids.US_Avg,
|
|
mlsysim.Infra.Grids.Poland,
|
|
]
|
|
|
|
region_results = {}
|
|
rows = []
|
|
for grid in grids:
|
|
r = solver.solve(fleet=fleet, duration_days=30, datacenter=grid)
|
|
energy_mwh = r.total_energy_kwh.magnitude / 1000
|
|
carbon_t = r.carbon_footprint_kg / 1000
|
|
water_kl = r.water_usage_liters / 1000
|
|
region_results[r.region_name] = r
|
|
rows.append([r.region_name, f"{energy_mwh:,.1f}", f"{carbon_t:,.1f}", f"{water_kl:,.1f}", f"{r.pue:.2f}"])
|
|
|
|
table(["Region", "Energy (MWh)", "Carbon (t)", "Water (kL)", "PUE"], rows)
|
|
```
|
|
|
|
Notice that energy consumption also varies between regions because of different PUE
|
|
values. A modern liquid-cooled facility (PUE 1.1) wastes less energy on cooling than
|
|
a legacy air-cooled datacenter (PUE 1.6). But the dominant factor is carbon intensity
|
|
— it creates the 40x gap.
|
|
|
|
---
|
|
|
|
## 4. Geography vs. Training Duration
|
|
|
|
Is it better to train longer in a clean region or shorter in a dirty region? Let's
|
|
compare 30 days in Quebec against just 10 days in Poland.
|
|
|
|
```{python}
|
|
# 30 days in Quebec
|
|
res_30d_quebec = solver.solve(
|
|
fleet=fleet, duration_days=30,
|
|
datacenter=mlsysim.Infra.Grids.Quebec
|
|
)
|
|
|
|
# 10 days in Poland (1/3 the training time)
|
|
res_10d_poland = solver.solve(
|
|
fleet=fleet, duration_days=10,
|
|
datacenter=mlsysim.Infra.Grids.Poland
|
|
)
|
|
|
|
c_q = res_30d_quebec.carbon_footprint_kg / 1000
|
|
c_p = res_10d_poland.carbon_footprint_kg / 1000
|
|
|
|
table(
|
|
["Scenario", "Carbon (tonnes CO2)"],
|
|
[
|
|
["30 days in Quebec", f"{c_q:.1f}"],
|
|
["10 days in Poland", f"{c_p:.1f}"],
|
|
]
|
|
)
|
|
info(Ratio=f"{c_p/c_q:.1f}x")
|
|
```
|
|
|
|
::: {.callout-important}
|
|
## Key Insight
|
|
|
|
**Geography is a larger lever than training duration for carbon footprint.** Even
|
|
training for one-third the time in Poland produces more carbon than the full 30-day
|
|
run in Quebec. The carbon intensity gap between hydro and coal grids is so large that
|
|
no reasonable reduction in training time can compensate. For any organization serious
|
|
about sustainable AI, datacenter location is not a logistics detail — it is a
|
|
first-order systems design decision with 40x impact.
|
|
:::
|
|
|
|
---
|
|
|
|
## 5. Economic Angle: When Carbon Has a Price
|
|
|
|
What happens when carbon emissions carry a financial cost? Carbon pricing (through
|
|
taxes or cap-and-trade) changes the economics of datacenter location. Let's compute
|
|
TCO with a carbon price of $50/tonne.
|
|
|
|
```{python}
|
|
from mlsysim import EconomicsModel
|
|
|
|
econ = EconomicsModel()
|
|
carbon_price = 50 # USD per tonne CO2
|
|
|
|
rows = []
|
|
for grid in grids:
|
|
tco = econ.solve(fleet=fleet, duration_days=30, grid=grid)
|
|
carbon_cost = (tco.carbon_footprint_kg / 1000) * carbon_price
|
|
total = tco.tco_usd + carbon_cost
|
|
rows.append([tco.region_name, f"${tco.tco_usd:,.0f}", f"${carbon_cost:,.0f}", f"${total:,.0f}"])
|
|
|
|
table(["Region", "TCO ($)", "Carbon Cost ($)", "Total ($)"], rows)
|
|
```
|
|
|
|
At $50/tonne, carbon pricing adds a visible cost differential between regions. At
|
|
higher carbon prices (some jurisdictions already charge $100+/tonne), the difference
|
|
becomes even more pronounced, potentially shifting which region offers the lowest TCO.
|
|
|
|
---
|
|
|
|
## Your Turn
|
|
|
|
::: {.callout-caution}
|
|
## Exercises
|
|
|
|
**Exercise 1: Predict before you compute.**
|
|
Training for 30 days in Quebec vs. 10 days in Poland — which produces more carbon?
|
|
Write your prediction, then run both scenarios. Were you right? What does this tell
|
|
you about the relative magnitude of grid carbon intensity vs. training duration?
|
|
|
|
**Exercise 2: At what carbon price does geography change the cheapest option?**
|
|
Sweep carbon price from $0 to $500/tonne in steps of $50. For each price, calculate
|
|
the total cost (TCO + carbon cost) for all four regions. At what price does a region
|
|
other than the default cheapest become the best option? Print a table showing the
|
|
crossover.
|
|
|
|
**Exercise 3: Sweep PUE from 1.0 to 2.0.**
|
|
Create custom grid profiles using `from mlsysim.infra.types import GridProfile` with
|
|
US Average carbon intensity but varying PUE. Sweep PUE from 1.0 to 2.0 in steps of
|
|
0.1. How much does total energy increase? At what PUE does facility overhead exceed
|
|
the IT energy itself?
|
|
|
|
**Self-check:** If you train for 30 days in Quebec (20 gCO2/kWh) vs. 15 days in
|
|
Poland (820 gCO2/kWh), and both use the same fleet and power, which produces more
|
|
total carbon? Show the mental calculation: the ratio of carbon intensities is 41x,
|
|
and the ratio of durations is 2x, so Poland is still 41/2 = ~20x worse.
|
|
:::
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
::: {.callout-tip}
|
|
## Summary
|
|
|
|
- **Grid carbon intensity creates a 40x gap** between the cleanest (Quebec, ~20 gCO2/kWh) and dirtiest (Poland, ~820 gCO2/kWh) regions
|
|
- **Geography dominates training duration** as a sustainability lever: 10 days in Poland emits more than 30 days in Quebec
|
|
- **PUE amplifies energy use** but carbon intensity is the dominant factor in emissions
|
|
- **Carbon pricing changes the economics**: at $50-100/tonne, location becomes a financial variable, not just an environmental one
|
|
- **Datacenter location is a systems design decision** with first-order impact on sustainability and, increasingly, on cost
|
|
:::
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
- **[The $9M Question](08_nine_million_dollar.qmd)** -- Quantify the infrastructure cost of chain-of-thought reasoning
|
|
- **[Scaling to 1000 GPUs](06_scaling_1000_gpus.qmd)** -- Discover the hidden reliability cost at scale
|
|
- **[Sensitivity Analysis](09_sensitivity.qmd)** -- Use sensitivity sweeps to find which parameter matters most
|
|
- **[Infrastructure Zoo](../zoo/infra.qmd)** -- Browse all regional grid profiles and datacenter configurations
|