feat(mlsysim): add documentation site, typed registries, and 6-solver core

Complete MLSYSIM v0.1.0 implementation with:

- Documentation website (Quarto): landing page with animated hero
  and capability carousel, 4 tutorials (hello world, LLM serving,
  distributed training, sustainability), hardware/model/fleet/infra
  catalogs, solver guide, whitepaper, math foundations, glossary,
  and full quartodoc API reference
- Typed registry system: Hardware (18 devices across 5 tiers),
  Models (15 workloads), Systems (fleets, clusters, fabrics),
  Infrastructure (grid profiles, rack configs, datacenters)
- Core types: Pint-backed Quantity, Metadata provenance tracking,
  custom exception hierarchy (OOMError, SLAViolation)
- SimulationConfig with YAML/JSON loading and pre-validation
- Scenario system tying workloads to systems with SLA constraints
- Multi-level evaluation scorecard (feasibility, performance, macro)
- Examples, tests, and Jetson Orin NX spec fix (100 → 25 TFLOP/s)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Vijay Janapa Reddi
2026-03-07 15:59:51 -05:00
parent 3a6e5c5ef6
commit a78f1bd8b0
116 changed files with 10319 additions and 606 deletions

View File

@@ -0,0 +1,359 @@
---
title: "Distributed Training: 3D Parallelism and Scaling Efficiency"
subtitle: "Discover why 1024 GPUs rarely deliver 1024× speedup — and how to minimize the gap."
---
::: {.callout-note}
## Background: Why distributed training?
Some models are too large to fit in a single GPU's memory, and some training jobs would take months on one GPU. **Distributed training** splits the work across many GPUs. This tutorial explores the three main ways to split work and the overhead each one introduces. You should complete the Hello World and LLM Serving tutorials before this one.
:::
Scaling a training job from 1 GPU to 1024 GPUs incurs overhead at every step.
Communication, pipeline stalls, and coordination each chip away at theoretical speedup.
Understanding where that efficiency goes, and how to recover it, is what separates
a well-tuned distributed training job from an expensive waste of cluster time.
By the end of this tutorial you will understand:
- How **Data Parallelism**, **Tensor Parallelism**, and **Pipeline Parallelism** decompose work across GPUs
- Why synchronization (ring all-reduce) overhead depends on model size and network bandwidth
- Why **pipeline bubbles** reduce effective GPU utilization
- How to calculate **scaling efficiency** for a real cluster
::: {.callout-tip}
## 3D Parallelism at a Glance
Modern distributed training uses three orthogonal strategies simultaneously:
| Strategy | What it splits | Main overhead |
|:---|:---|:---|
| **Data Parallelism (DP)** | Batch across GPUs | All-reduce gradients after backward pass |
| **Tensor Parallelism (TP)** | Individual matrix ops within a layer | All-gather within each forward/backward |
| **Pipeline Parallelism (PP)** | Layer groups across nodes | Pipeline bubble at start/end of batch |
The product $\text{DP} \times \text{TP} \times \text{PP} = \text{total GPUs}$.
:::
---
## 1. Setup
```{python}
#| echo: false
#| output: false
# Build-system path setup — hidden from students
import sys, os, importlib.util
current_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
if not os.path.exists(os.path.join(root_path, "mlsysim")):
root_path = os.path.abspath("../../")
package_path = os.path.join(root_path, "mlsysim")
init_file = os.path.join(package_path, "__init__.py")
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
mlsysim_mod = importlib.util.module_from_spec(spec)
sys.modules["mlsysim"] = mlsysim_mod
spec.loader.exec_module(mlsysim_mod)
import mlsysim
```
```python
import mlsysim
from mlsysim import DistributedSolver
```
```{python}
from mlsysim import DistributedSolver
# Llama-3.1-70B: the model requires distributed training — too large for a single GPU
model = mlsysim.Models.Llama3_70B
# A research-scale cluster: 32 DGX H100 nodes × 8 GPUs = 256 H100s
# (DGX is NVIDIA's pre-built server containing 8 H100 GPUs connected via NVLink)
cluster = mlsysim.Systems.Clusters.Research_256
print(f"Model: {model.name} ({model.parameters.to('Gparam'):.0f} params)")
print(f"Cluster: {cluster.name}")
print(f" Nodes: {cluster.count} × {cluster.node.accelerators_per_node} GPUs/node")
print(f" Total: {cluster.total_accelerators} accelerators")
print(f" Fabric: {cluster.fabric.name} @ {cluster.fabric.bandwidth.to('GB/s'):.0f} GB/s/link")
```
---
## 2. Visualizing 3D Parallelism
Before working through the numbers, consider how 3D parallelism decomposes a training job across a cluster. Each dimension splits work differently and introduces a different type of overhead:
**Data Parallelism (DP=4)** — each GPU holds a full model copy and processes 1/4 of the batch. After the backward pass, gradients are synchronized via All-Reduce.
```{mermaid}
%%| fig-cap: "Data Parallelism: replicate the model, split the batch, synchronize gradients."
flowchart LR
R1["Replica 1<br/>Batch 1/4"] <-.->|"All-Reduce"| R2["Replica 2<br/>Batch 2/4"]
R2 <-.->|"All-Reduce"| R3["Replica 3<br/>Batch 3/4"]
R3 <-.->|"All-Reduce"| R4["Replica 4<br/>Batch 4/4"]
```
**Tensor Parallelism (TP=2)** — each layer is split across GPUs. Requires fast interconnect (NVLink).
```{mermaid}
%%| fig-cap: "Tensor Parallelism: split each layer across GPUs, communicate via NVLink."
flowchart LR
G1["GPU 0<br/>Left half of each layer"] <-->|"All-Gather<br/>(NVLink)"| G2["GPU 1<br/>Right half of each layer"]
```
**Pipeline Parallelism (PP=4)** — model layers are partitioned across stages. Activations flow forward; gradients flow backward.
```{mermaid}
%%| fig-cap: "Pipeline Parallelism: partition layers across stages, activations flow forward."
flowchart LR
S1["Stage 1<br/>Layers 120"] --> S2["Stage 2<br/>Layers 2140"]
S2 --> S3["Stage 3<br/>Layers 4160"]
S3 --> S4["Stage 4<br/>Layers 6180"]
```
```
The key insight: **DP** uses inter-node bandwidth (network fabric), **TP** uses intra-node bandwidth (NVLink), and **PP** introduces idle time (pipeline bubbles). The optimal configuration balances all three overheads.
---
## 3. Baseline: Pure Data Parallelism
Start with the simplest configuration — no model splitting, just replicate the full model
on every GPU and split the batch. The per-GPU compute time is determined by the same
roofline model you used in the Hello World tutorial. The new element here is **communication
overhead**: after each training step, all GPUs must synchronize their gradients via the
network before the next step can begin.
```{python}
solver = DistributedSolver()
result_dp = solver.solve(
model=model,
fleet=cluster,
batch_size=256,
precision="fp16",
tp_size=1, # no tensor parallelism
pp_size=1, # no pipeline parallelism
)
node_perf = result_dp["node_performance"]
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):.1f} ms/step")
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):.2f} ms")
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):.2f} ms")
print(f"")
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):.1f} ms")
print(f"Scaling efficiency: {result_dp['scaling_efficiency']:.1%}")
print(f"Effective throughput: {result_dp['effective_throughput'].magnitude:.0f} samples/s")
print(f"Parallelism: DP={result_dp['parallelism']['dp']} TP={result_dp['parallelism']['tp']} PP={result_dp['parallelism']['pp']}")
```
::: {.callout-note}
## What does scaling efficiency mean?
If scaling efficiency is 80%, then your 256-GPU cluster is delivering the equivalent of
about 205 fully-utilized GPUs. The other ~51 GPUs worth of compute is being spent on
communication overhead. This is the **communication tax** of distributed training.
The tax is paid in **ring all-reduce**: after the backward pass, every GPU must synchronize
gradients with every other GPU. The time to do this grows with model size and shrinks with
network bandwidth.
:::
---
## 4. Ring All-Reduce: The Network Tax
The `DP all-reduce overhead` comes from the **ring all-reduce algorithm**, which is the
standard method for gradient synchronization. Its time depends on:
$$t_{\text{allreduce}} = 2 \times \frac{M \times (N-1)}{N \times B_{\text{eff}}}$$
Where $M$ is the message size (model gradient = 2× weights in fp16), $N$ is the number
of data-parallel replicas, and $B_{\text{eff}}$ is the effective inter-node bandwidth.
The following sweep shows how fabric bandwidth affects overhead:
```{python}
from mlsysim import Fleet, NetworkFabric, Systems
fabrics = [
("100GbE", Systems.Fabrics.Ethernet_100G),
("IB HDR", Systems.Fabrics.InfiniBand_HDR),
("IB NDR", Systems.Fabrics.InfiniBand_NDR),
]
print(f"{'Fabric':>10} {'BW (GB/s)':>10} {'Comm overhead':>14} {'Efficiency':>11}")
print("-" * 52)
for fab_name, fabric in fabrics:
custom_cluster = Fleet(
name="Custom",
node=Systems.Nodes.DGX_H100,
count=32,
fabric=fabric
)
r = solver.solve(
model=model,
fleet=custom_cluster,
batch_size=256,
precision="fp16"
)
print(
f"{fab_name:>10} "
f"{fabric.bandwidth.to('GB/s'):>10.0f~} "
f"{r['dp_communication_latency'].to('ms'):>14.2f~} "
f"{r['scaling_efficiency']:>11.1%}"
)
```
::: {.callout-warning}
## Fabric choice determines scaling efficiency
Upgrading from 100GbE to InfiniBand NDR roughly doubles the effective inter-node bandwidth.
On a model the size of Llama-70B (140 GB of gradients per step in fp16), that difference
is significant. For smaller models, it matters less — compute time dominates.
:::
---
## 5. Pipeline Parallelism and the Bubble
**Pipeline Parallelism** splits the model's layers across multiple nodes. Node 1 runs layers
120, node 2 runs layers 2140, etc. This allows a much larger model to be trained than
fits on a single node.
The downside: a **pipeline bubble**. The first microbatch must flow through all stages before
the last stage can start processing the second microbatch. During that startup phase, most
GPUs are idle.
$$\text{Bubble fraction} = \frac{P - 1}{P - 1 + M}$$
Where $P$ is the pipeline depth (number of stages) and $M$ is the number of microbatches.
```{python}
print(f"{'PP stages':>10} {'Microbatches':>13} {'Bubble %':>9} {'Comm (ms)':>10} {'Efficiency':>11}")
print("-" * 60)
for pp_size in [1, 2, 4, 8]:
for m in [1, 4, 16]:
# Only show interesting combinations
if pp_size == 1 and m > 1:
continue
tp = min(8, cluster.total_accelerators // (pp_size * 4))
r = solver.solve(
model=model,
fleet=cluster,
batch_size=256,
precision="fp16",
tp_size=1,
pp_size=pp_size,
microbatch_count=m
)
bubble_pct = r["bubble_fraction"] * 100
print(
f"{pp_size:>10} "
f"{m:>13} "
f"{bubble_pct:>9.1f}% "
f"{r['pipeline_bubble_latency'].to('ms'):>10.1f~} "
f"{r['scaling_efficiency']:>11.1%}"
)
```
::: {.callout-tip}
## Recovering bubble efficiency
Increasing the number of **microbatches** ($M$) reduces the bubble fraction. With $M = 16$
and $P = 8$, the bubble is only $7/(7+16) ≈ 30\%$ of the pipeline, down from $88\%$ with
$M = 1$.
In practice, frameworks like Megatron-LM use **interleaved pipeline schedules** that further
reduce the bubble. But even with the standard 1F1B schedule, choosing $M \gg P$ is essential.
:::
---
## 6. Finding the Optimal Configuration
Now combine all three parallelism strategies and find the configuration that maximizes
scaling efficiency for the `Research_256` cluster. In practice, 70-80% scaling efficiency
on hundreds of GPUs is considered excellent. Below 50% typically signals a suboptimal
parallelism configuration or insufficient network bandwidth.
```{python}
configs = [
# (description, tp, pp, m)
("DP only", 1, 1, 1),
("DP + TP=2", 2, 1, 1),
("DP + PP=4, M=16", 1, 4, 16),
("DP + TP=2 + PP=4, M=16", 2, 4, 16),
("DP + TP=8 + PP=4, M=16", 8, 4, 16),
]
print(f"{'Config':<26} {'DP':>4} {'TP':>4} {'PP':>4} {'Efficiency':>11} {'Throughput':>14}")
print("-" * 72)
for desc, tp, pp, m in configs:
try:
r = solver.solve(
model=model,
fleet=cluster,
batch_size=256,
precision="fp16",
tp_size=tp,
pp_size=pp,
microbatch_count=m
)
print(
f"{desc:<26} "
f"{r['parallelism']['dp']:>4} "
f"{r['parallelism']['tp']:>4} "
f"{r['parallelism']['pp']:>4} "
f"{r['scaling_efficiency']:>11.1%} "
f"{r['effective_throughput'].magnitude:>14.1f}"
)
except ValueError as e:
print(f"{desc:<26} {'INFEASIBLE':>44} ({e})")
```
---
## Your Turn
::: {.callout-caution}
## Exercises
**Exercise 1: Predict before you observe.**
For a 256-GPU cluster training Llama-3.1-70B, predict: will DP=256, TP=1, PP=1 have higher or lower scaling efficiency than DP=32, TP=4, PP=2? Write your prediction and reasoning, then run both configurations. Were you right?
**Exercise 2: Find the optimal configuration.**
Sweep all valid 3D parallelism configurations for 256 GPUs (where DP x TP x PP = 256). Which configuration maximizes scaling efficiency? Is it the same for Ethernet 100G vs. InfiniBand NDR? (Hint: valid TP values are divisors of 8, the GPUs per node: 1, 2, 4, 8. For each TP, valid PP values are divisors of 256/TP.)
**Exercise 3: The microbatch lever.**
With PP=8, sweep microbatch count M from 1 to 64. Plot the pipeline bubble fraction vs. M. At what value of M does the bubble fraction drop below 10%? (Use the formula from Section 5: bubble = (P-1)/(P-1+M). Predict the answer analytically before running the sweep.)
**Self-check:** Why must tensor parallelism (TP) stay within a single node on most clusters? What would happen to communication overhead if TP crossed node boundaries?
:::
---
## What You Learned
- **3D Parallelism** decomposes the training problem across $\text{DP} \times \text{TP} \times \text{PP}$ GPUs,
each with distinct communication costs.
- **Ring all-reduce** is the network tax of data parallelism. It grows with model size and
shrinks with fabric bandwidth. Switching from 100GbE to InfiniBand can recover 10-30%
efficiency on large models.
- **Pipeline bubbles** waste GPU cycles proportional to $\frac{P-1}{P-1+M}$. Use large
microbatch counts ($M \gg P$) to minimize waste.
- **Scaling efficiency below 100%** is normal and unavoidable. A well-tuned job at 70-80%
efficiency on hundreds of GPUs is excellent. Below 50% signals a configuration problem.
---
## Next Steps
- **[LLM Serving Lab](llm_serving.qmd)**: After training, learn how to model the serving cost of the same model
- **[Math Foundations](../math.qmd)**: Full derivations for ring all-reduce, pipeline bubble, and MFU
- **[Fleet Zoo](../zoo/fleets.qmd)**: Browse the available cluster configurations and their network specs

View File

@@ -0,0 +1,196 @@
---
title: "Hello World: Single-Node Roofline"
subtitle: "Predict model performance on hardware before writing a single CUDA kernel."
---
::: {.callout-note}
## Prerequisites
Complete the [Getting Started](../getting-started.qmd) guide before this tutorial. It introduces the `Engine.solve` API and the MLSys Zoo.
:::
In this tutorial, you will model the performance of **ResNet-50** on an **NVIDIA A100** GPU
using the analytical roofline model. By the end, you will understand:
- What it means for a model to be **memory-bound** vs. **compute-bound**
- How changing **batch size** shifts the bottleneck
- Why the A100's memory bandwidth matters as much as its peak TFLOP/s
::: {.callout-note}
## Background: ResNet-50 and the A100
**ResNet-50** is a 50-layer convolutional neural network (CNN) commonly used for image classification. It has roughly 25 million parameters and requires about 8 billion floating-point operations (8 GFLOP) per inference. It is a standard benchmark workload because its size is well-characterized and widely published.
The **NVIDIA A100** is a datacenter GPU designed for ML training and inference. Its key specifications: 312 TFLOP/s peak compute (FP16 Tensor Core), 2.0 TB/s HBM2e (High Bandwidth Memory) bandwidth, and 80 GB of memory. These two numbers (compute speed and memory speed) are what the roofline model uses to predict performance.
See the [Glossary](../glossary.qmd) for definitions of terms like FLOP/s, HBM, and Tensor Core.
:::
::: {.callout-tip}
## What is the roofline model?
Every GPU has two speed limits: how fast it can compute (FLOP/s) and how fast it can load
data from memory (bytes/s). Your model's actual throughput is determined by whichever limit
you hit first. The roofline model tells you exactly which one, and by how much.
:::
---
## 1. Setup
```{python}
#| echo: false
#| output: false
# Build-system path setup — hidden from students
import sys, os, importlib.util
current_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
if not os.path.exists(os.path.join(root_path, "mlsysim")):
root_path = os.path.abspath("../../")
package_path = os.path.join(root_path, "mlsysim")
init_file = os.path.join(package_path, "__init__.py")
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
mlsysim = importlib.util.module_from_spec(spec)
sys.modules["mlsysim"] = mlsysim
spec.loader.exec_module(mlsysim)
Engine = mlsysim.Engine
```
After `pip install mlsysim`, the import is simple:
```python
import mlsysim
from mlsysim import Engine
```
---
## 2. Select Workload and Hardware
Pull vetted specifications directly from the **MLSys Zoo**—no need to look up datasheets.
```{python}
# Load ResNet-50 from the Model Zoo
model = mlsysim.Models.ResNet50
# Load NVIDIA A100 from the Silicon Zoo
hardware = mlsysim.Hardware.Cloud.A100
print(f"Model: {model.name} ({model.architecture})")
print(f"Hardware: {hardware.name} ({hardware.release_year})")
print(f"")
print(f"Model FLOPs (inference): {model.inference_flops}")
print(f"Hardware Peak TFLOP/s: {hardware.compute.peak_flops.to('TFLOPs/s'):.0f}")
print(f"Hardware Memory BW: {hardware.memory.bandwidth.to('TB/s'):.1f}")
```
---
## 3. Solve the Performance Profile
The `Engine.solve` method applies the **Iron Law of ML Systems**—it calculates which of the
two hardware speed limits (compute or memory) you hit first, and returns your latency from there.
```{python}
profile = Engine.solve(
model=model,
hardware=hardware,
batch_size=1,
precision="fp16"
)
print(f"Bottleneck: {profile.bottleneck}")
print(f"Latency: {profile.latency.to('ms'):.3f} ms per inference")
print(f"Throughput: {profile.throughput:.0f} images/sec")
```
::: {.callout-note}
## Why "Memory Bound"?
At batch size 1, ResNet-50 performs ~8 GFLOPs of computation but loads ~50 MB of weights (25.6M parameters at fp16).
Its **arithmetic intensity** (FLOPs/Byte) is far below the A100's roofline ridge point.
The A100's memory bandwidth (2 TB/s) becomes the bottleneck, not its 312 TFLOP/s compute.
:::
---
## 4. Sweep Batch Sizes
The bottleneck changes as batch size grows. Run the sweep and see when compute takes over:
```{python}
print(f"{'Batch':>6} {'Bottleneck':<16} {'Throughput':>12} {'Latency':>10}")
print("-" * 52)
for batch in [1, 4, 16, 32, 64, 128, 256]:
p = Engine.solve(
model=model,
hardware=hardware,
batch_size=batch,
precision="fp16"
)
print(
f"{batch:>6} {p.bottleneck:<16} "
f"{p.throughput:>10.0f}/s "
f"{p.latency.to('ms'):>8.2f} ms"
)
```
::: {.callout-tip}
## The crossover point
Watch where the output switches from `Memory Bound` to `Compute Bound`. That is the **ridge
point** of the roofline—the batch size at which you've saturated both resources equally.
Beyond that point, adding more compute (or a bigger GPU) pays off. Below it, more memory
bandwidth is what matters.
:::
---
## 5. Visualizing the Roofline
MLSYSIM includes built-in visualization tools. The roofline chart plots the hardware's two
ceilings and shows where your workloads sit relative to them:
```python
import matplotlib.pyplot as plt
fig, ax = mlsysim.plot_roofline(hardware, workloads=[model])
ax.set_title(f"Roofline: {model.name} on {hardware.name}")
plt.show()
```
![Roofline plot for ResNet-50 on the NVIDIA A100. The rising diagonal is the memory bandwidth ceiling; the flat line is the compute ceiling. ResNet-50 at batch=1 lands on the memory-bound slope, well below the ridge point.](images/roofline_hello_world.png){#fig-roofline-hello}
---
## Your Turn
::: {.callout-caution}
## Exercises
**Exercise 1: Predict before you compute.**
Before running the code, predict: Will ResNet-50 at batch_size=64 be memory-bound or compute-bound on the A100? Write down your prediction, then verify with `Engine.solve(...)`. Were you right? Why or why not?
**Exercise 2: Hardware comparison.**
Before running: which GPU do you predict will have the highest ridge point -- the V100, A100, or H100? (Hint: compare their compute-to-bandwidth ratios.) Then run the same ResNet-50 analysis on `mlsysim.Hardware.Cloud.H100` and `mlsysim.Hardware.Cloud.V100`. Which gives the lowest latency at batch_size=1? At batch_size=256? What explains the difference?
**Exercise 3: Precision effect.**
Before running: will switching from `precision="fp16"` to `precision="int8"` change the bottleneck classification for ResNet-50 on the A100 at batch_size=1? Write your prediction and reasoning, then compare both. How does quantization change the arithmetic intensity?
**Self-check:** If a model's arithmetic intensity is 50 FLOP/byte and the hardware's ridge point is 156 FLOP/byte, is the model compute-bound or memory-bound?
:::
---
## What You Learned
- **Roofline model**: Performance is bounded by $\max\left(\frac{\text{FLOPs}}{\text{Peak}},\ \frac{\text{Bytes}}{\text{BW}}\right)$ (whichever takes longer, computing or loading data, determines your runtime)
- **Batch size matters**: Small batches are memory-bound; large batches become compute-bound
- **The ridge point**: The crossover batch size where memory and compute are equally saturated
- **Practical implication**: If you are memory-bound, reducing data movement (quantization, larger batches) helps more than a faster GPU
---
## Next Steps
- **[Sustainability Lab](sustainability.qmd)**: Calculate the carbon footprint of training across different grid regions
- **[LLM Serving Lab](llm_serving.qmd)**: Model the two phases of LLM inference and discover the KV-cache memory wall
- **[Math Foundations](../math.qmd)**: The complete set of equations used by all solvers
- **[Silicon Zoo](../zoo/hardware.qmd)**: Browse all vetted hardware specs and compare alternatives

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

View File

@@ -0,0 +1,75 @@
---
title: "Tutorials"
subtitle: "Step-by-step guides for modeling ML Systems."
format:
html:
toc: false
---
These tutorials are designed to build intuition for ML systems using the `mlsysim` framework.
They map directly to chapters in the *Machine Learning Systems* textbook—start at the beginning
or jump to any topic.
::: {.tutorial-grid}
::: {.tutorial-card}
[Beginner]{.tutorial-level .level-beginner}
### Hello World: Single-Node Roofline
Learn to lower a model onto hardware and identify the performance bottleneck.
Understand memory-bound vs. compute-bound in 5 minutes.
[Start Tutorial →](hello_world.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### Sustainability Lab: Carbon Footprint
Calculate the energy and CO₂ cost of training a frontier LLM across different
geographical grid regions. Quebec vs. Poland—the numbers will surprise you.
[Start Tutorial →](sustainability.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### LLM Serving: TTFT, ITL & the Memory Wall
Model the two physical regimes of autoregressive generation: the compute-bound
pre-fill phase and the memory-bound decoding phase. Discover how quantization
and hardware choice affect each phase differently.
[Start Tutorial →](llm_serving.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}
### Distributed Training: 3D Parallelism
Explore Data, Tensor, and Pipeline parallelism overhead. Model the ring all-reduce
communication cost and pipeline bubble fraction on a 256-GPU H100 cluster.
[Start Tutorial →](distributed.qmd){.tutorial-arrow}
:::
:::
---
## Learning Path
If you're new to ML systems modeling, we recommend this sequence:
1. **[Hello World](hello_world.qmd)** — Understand the roofline model and what determines inference speed.
2. **[Sustainability Lab](sustainability.qmd)** — Apply the framework to a real-world carbon analysis.
3. **[LLM Serving Lab](llm_serving.qmd)** — Model TTFT, ITL, and KV-cache pressure for production LLM serving.
4. **[Distributed Training](distributed.qmd)** — Scale to hundreds of GPUs and analyze where efficiency is lost.
5. **[Hardware Zoo](../zoo/hardware.qmd)** — Explore the vetted hardware specifications across deployment tiers.
6. *(Optional)* **[Math Foundations](../math.qmd)** — The first-principles equations behind every solver.
> **Tip:** All tutorials are Jupyter/Quarto compatible. Run them locally after `pip install mlsysim`.

View File

@@ -0,0 +1,338 @@
---
title: "LLM Serving Lab: TTFT, ITL, and the Memory Wall"
subtitle: "Model the two physical regimes of LLM inference before deploying a single server."
---
::: {.callout-note}
## Background: What is an LLM and why is serving different?
A **Large Language Model (LLM)** like Llama-3 generates text one token (roughly one word) at a time. Unlike image models that process a fixed input in one pass, LLMs run the model *repeatedly*, once for each output token. This creates two distinct phases with different performance characteristics, which is why LLM serving requires its own dedicated solver. You should complete the [Hello World tutorial](hello_world.qmd) before this one.
:::
Running a large language model in production is not like running ResNet. An LLM inference
request goes through **two completely different physical regimes**, each bottlenecked by a
different hardware resource. Understanding this is the difference between guessing at your
deployment budget and calculating it precisely.
By the end of this tutorial you will understand:
- Why **TTFT** (Time to First Token) and **ITL** (Inter-Token Latency) have different bottlenecks
- How **KV-cache** memory pressure limits batch concurrency
- Why **quantization** helps decoding more than prefill
- How to pick the right GPU for your serving latency targets
::: {.callout-tip}
## The Two Phases of LLM Inference
Recall from the [Hello World tutorial](hello_world.qmd) that every workload is either memory-bound
or compute-bound. LLM serving is unusual because *both regimes* occur in the same request:
**Pre-fill (TTFT):** All prompt tokens processed in a single forward pass. The model sees the
full context at once — this is compute-intensive and saturates GPU arithmetic units. Optimizing
TTFT means getting more TFLOP/s.
**Decoding (ITL):** One token generated at a time. Each step must reload the *entire model*
from HBM (High Bandwidth Memory) to produce just one output token. This is overwhelmingly **memory-bound**.
Optimizing ITL means getting more GB/s.
The same GPU has two different speed limits for the same model.
:::
---
## 1. Setup
```{python}
#| echo: false
#| output: false
# Build-system path setup — hidden from students
import sys, os, importlib.util
current_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
if not os.path.exists(os.path.join(root_path, "mlsysim")):
root_path = os.path.abspath("../../")
package_path = os.path.join(root_path, "mlsysim")
init_file = os.path.join(package_path, "__init__.py")
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
mlsysim_mod = importlib.util.module_from_spec(spec)
sys.modules["mlsysim"] = mlsysim_mod
spec.loader.exec_module(mlsysim_mod)
import mlsysim
```
```python
import mlsysim
from mlsysim import ServingSolver
```
Unlike the general-purpose `Engine.solve` from the Hello World tutorial, `ServingSolver`
separates inference into two phases — pre-fill and decoding — each with its own bottleneck.
Select our workload and hardware from the **MLSys Zoo**:
```{python}
from mlsysim import ServingSolver
# Llama-3.1-8B: 8B parameters, 32 layers, 4096 hidden_dim
# 8 GQA (Grouped Query Attention) heads — fewer KV heads than query heads, saving memory
model = mlsysim.Models.Llama3_8B
# NVIDIA H100 SXM5: 80 GB HBM3, 3.35 TB/s, 989 TFLOP/s (fp16)
hardware = mlsysim.Hardware.Cloud.H100
print(f"Model: {model.name}")
print(f"Parameters: {model.parameters.to('Gparam'):.1f}")
print(f"Layers: {model.layers}, Hidden: {model.hidden_dim}")
print(f"")
print(f"Hardware: {hardware.name}")
print(f"Memory: {hardware.memory.capacity.to('GB'):.0f} GB @ "
f"{hardware.memory.bandwidth.to('TB/s'):.2f} TB/s")
print(f"Compute: {hardware.compute.peak_flops.to('TFLOPs/s'):.0f} TFLOP/s (fp16)")
```
---
## 2. First Serving Prediction
The `ServingSolver` takes a **sequence length** — the total context window that must be
processed during pre-fill and cached during decoding.
```{python}
solver = ServingSolver()
result = solver.solve(
model=model,
hardware=hardware,
seq_len=2048, # tokens in context (prompt + history)
batch_size=1, # concurrent users
precision="fp16"
)
print(f"Feasible: {result['feasible']}")
print(f"")
print(f"── Latency ──────────────────────────────")
print(f"TTFT (prefill): {result['ttft'].to('ms'):~.1f}")
print(f"ITL (per token): {result['itl'].to('ms'):~.2f}")
print(f"")
print(f"── Memory ───────────────────────────────")
print(f"Model weights: {result['model_weights_size']:~.2f}")
print(f"KV-cache (2K ctx): {result['kv_cache_size']:~.3f}")
print(f"Total required: {result['total_memory_required']:~.2f}")
print(f"Memory util: {result['memory_utilization']:.1%}")
```
::: {.callout-note}
## Reading the output
- **TTFT** is tens of milliseconds — bounded by the GPU's 989 TFLOP/s compute ceiling.
- **ITL** is a small fraction of a millisecond — bounded by the 3.35 TB/s HBM bandwidth.
At each decode step, ~16 GB of weights must transit from HBM to compute units, yet
only one token of computation happens. The bandwidth is the wall, not the FLOPs.
- **Memory util** tells you how much of the 80 GB HBM is occupied. The remainder is
available for more concurrent users (larger `batch_size`).
- **Typical SLA targets**: For interactive chat applications, aim for TTFT < 200 ms and
ITL < 50 ms/token. The numbers above are well within these targets for a single user.
:::
---
## 3. The KV-Cache Memory Wall
The KV-cache stores the Key and Value matrices from every attention layer for every token
in the active context. Its size grows as:
$$\text{KV-Cache} = 2 \times L \times H_{kv} \times d_{head} \times S \times B \times \text{bpp}$$
Where $L$ = layers, $H_{kv}$ = KV heads, $S$ = sequence length, $B$ = batch size,
$\text{bpp}$ = bytes per parameter.
This means doubling `batch_size` doubles the KV-cache. At some point, you hit the
**memory wall** — the combined model + KV-cache exceeds the accelerator's HBM capacity.
```{python}
print(f"{'Batch':>6} {'Ctx':>6} {'KV-Cache':>10} {'Total':>8} {'Util':>6} {'Feasible':>8}")
print("-" * 56)
for batch in [1, 4, 8, 16, 32, 64]:
r = solver.solve(
model=model,
hardware=hardware,
seq_len=2048,
batch_size=batch,
precision="fp16"
)
print(
f"{batch:>6} "
f"{'2048':>6} "
f"{r['kv_cache_size']:>10.3f~} "
f"{r['total_memory_required']:>8.2f~} "
f"{r['memory_utilization']:>6.1%} "
f"{'✓' if r['feasible'] else '✗ OOM':>8}"
)
```
::: {.callout-warning}
## Finding the memory wall
Watch for `✗ OOM` — this is where `total_memory_required` exceeds the 80 GB HBM capacity.
That batch size is infeasible on a single H100. You would need to either: reduce the
context window, switch to a lower-precision format, or add more GPUs.
:::
```{python}
# Also sweep context length at fixed batch size
print(f"\n{'Ctx':>6} {'KV-Cache':>10} {'Total':>8} {'Util':>6} {'Feasible':>8}")
print("-" * 48)
for ctx in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
r = solver.solve(
model=model,
hardware=hardware,
seq_len=ctx,
batch_size=8,
precision="fp16"
)
print(
f"{ctx:>6} "
f"{r['kv_cache_size']:>10.3f~} "
f"{r['total_memory_required']:>8.2f~} "
f"{r['memory_utilization']:>6.1%} "
f"{'✓' if r['feasible'] else '✗ OOM':>8}"
)
```
---
## 4. Quantization: Precision as a Latency Knob
Reducing numerical precision does two things simultaneously:
1. **Shrinks model weights** → fewer bytes to load per decode step → lower ITL
2. **Shrinks KV-cache** → more headroom for larger batches or longer contexts
But precision affects the **two phases differently**: TTFT (compute-bound) improves only
when going to fp8 or below on hardware with native low-precision tensor cores. ITL
(memory-bound) improves with every step down in precision.
```{python}
print(f"{'Precision':>10} {'TTFT':>8} {'ITL':>10} {'Weights':>8} {'KV-Cache':>10} {'Util':>7}")
print("-" * 64)
for prec in ["fp16", "int8", "int4"]:
r = solver.solve(
model=model,
hardware=hardware,
seq_len=8192,
batch_size=8,
precision=prec
)
print(
f"{prec:>10} "
f"{r['ttft'].to('ms'):>8.1f~} "
f"{r['itl'].to('ms'):>10.3f~} "
f"{r['model_weights_size']:>8.2f~} "
f"{r['kv_cache_size']:>10.3f~} "
f"{r['memory_utilization']:>7.1%}"
)
```
::: {.callout-tip}
## Why ITL improves more than TTFT
Going from `fp16` → `int8` halves the model size. At **decode time**, each step must load
the full model from HBM — half the bytes means half the time. ITL drops by ~50%.
At **prefill time**, the computation is the bottleneck (not bandwidth), so halving byte
count helps less — you're not memory-bound in the first place. The improvement is
smaller and depends on whether your hardware has native `int8` tensor core support.
**Rule of thumb**: Quantization is a decoding optimization first, a prefill optimization second.
:::
---
## 5. Hardware Comparison
Different GPUs have different ratios of compute-to-memory-bandwidth. For LLM serving:
- **Higher TFLOP/s** → faster TTFT (prefill is compute-bound)
- **Higher HBM bandwidth** → faster ITL (decoding is memory-bound)
```{python}
gpus = [
("A100 (80GB)", mlsysim.Hardware.Cloud.A100),
("H100 SXM5", mlsysim.Hardware.Cloud.H100),
("H200", mlsysim.Hardware.Cloud.H200),
("MI300X", mlsysim.Hardware.Cloud.MI300X),
]
print(f"{'GPU':>14} {'BW (TB/s)':>10} {'TTFT':>8} {'ITL':>10} {'Max Util':>9}")
print("-" * 60)
for name, hw in gpus:
r = solver.solve(
model=model,
hardware=hw,
seq_len=4096,
batch_size=4,
precision="fp16"
)
print(
f"{name:>14} "
f"{hw.memory.bandwidth.to('TB/s'):>10.2f~} "
f"{r['ttft'].to('ms'):>8.1f~} "
f"{r['itl'].to('ms'):>10.3f~} "
f"{r['memory_utilization']:>9.1%}"
)
```
::: {.callout-note}
## Why H200 wins on ITL
The H200 uses HBM3e with **4.8 TB/s** bandwidth vs the H100's 3.35 TB/s — a 43% increase.
This directly maps to a 43% lower ITL, because decoding is a pure memory-bound operation.
The MI300X is even more interesting: its massive 192 GB HBM pool lets you pack far more
concurrent users (batch_size) before hitting the memory wall.
:::
---
## Your Turn
::: {.callout-caution}
## Exercises
**Exercise 1: Predict the memory wall.**
Before running the code, estimate: at what batch size will Llama-3.1-8B hit OOM on an 80 GB H100 with seq_len=4096 at FP16? Write your estimate, then sweep batch sizes to find the actual limit. How close were you?
**Exercise 2: The quantization trade-off.**
Before running: predict which GPU will benefit most from quantization (int8 vs. fp16) in terms of ITL improvement. (Hint: ITL depends on bandwidth, not compute. Think about which GPU has the lowest bandwidth relative to its memory capacity.) Then run the hardware comparison sweep (Section 5) at both precisions and check your prediction.
**Exercise 3: Context length scaling.**
Before running: predict whether TTFT scales linearly or quadratically with seq_len. (Hint: the simplified model in MLSYSIM computes prefill FLOPs as `2 × params × seq_len`, which is linear. But real transformers have attention layers whose cost grows as O(seq_len²). How does this affect your prediction for long contexts?) Sweep seq_len from 512 to 16384 at batch_size=1 and plot TTFT vs. seq_len. Does the result match the simplified model or the quadratic attention model?
**Self-check:** A user asks "Will my chatbot feel responsive on a single A100?" What two metrics would you check, and what thresholds would you target for a good user experience?
:::
---
## What You Learned
- **LLM serving has two regimes**: Pre-fill (TTFT) is **compute-bound**; Decoding (ITL) is
**memory-bound**. They respond to different optimizations.
- **KV-cache memory** scales as $O(L \times S \times B \times \text{bpp})$: longer contexts
and larger batches both consume HBM, eventually causing OOM.
- **Quantization** is primarily a **decoding speedup**: halving precision halves the bytes
loaded per decode step, directly halving ITL.
- **Hardware selection**: For low-latency chat (ITL-critical), maximize HBM bandwidth.
For long-context applications (TTFT-critical), maximize TFLOP/s.
---
## Next Steps
- **[Distributed Training](distributed.qmd)**: Scale a model across hundreds of GPUs using
3D parallelism — and discover why scaling efficiency is rarely 100%
- **[Math Foundations](../math.qmd)**: The exact equations behind TTFT, ITL, and KV-cache sizing
- **[Silicon Zoo](../zoo/hardware.qmd)**: Compare full hardware specs across the entire fleet

View File

@@ -0,0 +1,209 @@
---
title: "Sustainability Lab: Modeling Carbon Footprint"
subtitle: "Same model, same hardware — 41x difference in carbon footprint."
---
::: {.callout-note}
## Prerequisites
This tutorial can be completed independently, but completing the [Hello World tutorial](hello_world.qmd) first provides useful context on how hardware performance relates to energy consumption.
:::
This lab explores the environmental impact of machine learning at scale. You will model
the training of a large language model across different geographical regions and discover
how location, efficiency, and precision affect sustainability.
By the end of this tutorial you will understand:
- How **carbon intensity** varies dramatically across electricity grids
- How **PUE** (Power Usage Effectiveness) amplifies energy consumption
- Why choosing *where* to train matters more than *how* to train
- How to use the `SustainabilitySolver` for carbon-aware decisions
::: {.callout-tip}
## The sustainability equation
Carbon footprint = Energy × PUE × Carbon Intensity. The first factor depends on your
hardware and job duration. The second depends on your datacenter's cooling efficiency.
The third depends on your region's electricity mix. MLSYSIM lets you vary all three.
:::
---
## 1. Setup
```{python}
#| echo: false
#| output: false
import sys, os, importlib.util
current_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
if not os.path.exists(os.path.join(root_path, "mlsysim")):
root_path = os.path.abspath("../../")
package_path = os.path.join(root_path, "mlsysim")
init_file = os.path.join(package_path, "__init__.py")
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
mlsysim = importlib.util.module_from_spec(spec)
sys.modules["mlsysim"] = mlsysim
spec.loader.exec_module(mlsysim)
SustainabilitySolver = mlsysim.SustainabilitySolver
```
```python
import mlsysim
from mlsysim import SustainabilitySolver
```
---
## 2. Select a Fleet
We'll use a production-scale cluster from the **Fleet Zoo** — 8,192 H100 GPUs
connected via InfiniBand NDR.
```{python}
fleet = mlsysim.Systems.Clusters.Frontier_8K
print(f"Fleet: {fleet.name}")
print(f"Total Accelerators: {fleet.total_accelerators}")
```
With the fleet defined, the remaining variables are *how long* the job runs and *where*.
The `duration_days` parameter represents total training time — in practice, this depends on
the model's compute requirements and the cluster's performance (exactly what the
[Hello World](hello_world.qmd) and [Distributed Training](distributed.qmd) tutorials
teach you to calculate). The carbon cost then depends entirely on how that electricity
is generated.
---
## 3. Compare Two Regions
The `SustainabilitySolver` factors in Power Usage Effectiveness (PUE) and regional
carbon intensity. The following comparison uses the cleanest and dirtiest grids
in the registry.
```{python}
solver = SustainabilitySolver()
# Model training for 30 days in Quebec (Hydro-powered)
res_quebec = solver.solve(
fleet=fleet,
duration_days=30,
datacenter=mlsysim.Infra.Grids.Quebec
)
# Compare with training in a coal-heavy region (Poland)
res_poland = solver.solve(
fleet=fleet,
duration_days=30,
datacenter=mlsysim.Infra.Grids.Poland
)
print(f"Region: {res_quebec['region_name']}")
print(f"Carbon Footprint: {res_quebec['carbon_footprint_kg']:.1f} kg CO2e")
print("-" * 40)
print(f"Region: {res_poland['region_name']}")
print(f"Carbon Footprint: {res_poland['carbon_footprint_kg']:.1f} kg CO2e")
```
::: {.callout-important}
## The ~41x factor
The same model, the same hardware, the same training duration — but the carbon
footprint differs by roughly **41x** depending on the electricity grid. Location
is the single largest lever for sustainable ML.
:::
---
## 4. All-Region Comparison
The following sweep covers all four grid regions in the Infrastructure Zoo,
comparing energy, carbon, and water usage.
```{python}
grids = [
mlsysim.Infra.Grids.Quebec,
mlsysim.Infra.Grids.Norway,
mlsysim.Infra.Grids.US_Avg,
mlsysim.Infra.Grids.Poland,
]
print(f"{'Region':<20} {'Energy (MWh)':>14} {'Carbon (t CO2e)':>16} {'Water (kL)':>12} {'PUE':>6}")
print("-" * 72)
for grid in grids:
r = solver.solve(fleet=fleet, duration_days=30, datacenter=grid)
energy_mwh = r['total_energy_kwh'].magnitude / 1000
carbon_t = r['carbon_footprint_kg'] / 1000
water_kl = r['water_usage_liters'] / 1000
print(f"{r['region_name']:<20} {energy_mwh:>12,.1f} {carbon_t:>14,.1f} {water_kl:>10,.1f} {r['pue']:>5.2f}")
```
::: {.callout-note}
## Water matters too
Datacenters use water for evaporative cooling. The Water Usage Effectiveness (WUE)
varies by cooling technology: liquid-cooled facilities use far less water than
evaporative-cooled ones.
:::
Carbon intensity varies by region, but it is not the only multiplier. The datacenter
itself adds overhead through cooling and facility power, captured by the PUE metric.
---
## 5. The PUE Multiplier
PUE determines how much energy is "wasted" on cooling and facility overhead.
Compare a modern liquid-cooled facility (PUE 1.1) against a legacy air-cooled
one (PUE 1.6), both in the same grid region.
```{python}
# Both in US Average grid, but different PUE
res_modern = solver.solve(fleet=fleet, duration_days=30, datacenter=mlsysim.Infra.Grids.US_Avg)
# The US_Avg grid uses PUE from its profile
print(f"US Average grid:")
print(f" PUE: {res_modern['pue']:.2f}")
print(f" Energy: {res_modern['total_energy_kwh'].magnitude/1000:,.1f} MWh")
print(f" Carbon: {res_modern['carbon_footprint_kg']/1000:,.1f} tonnes CO2e")
```
---
## Your Turn
::: {.callout-caution}
## Exercises
**Exercise 1: Duration vs. location.**
Predict: does training for 30 days in Quebec produce more or less carbon than training for 10 days in Poland? Write your prediction, then run both configurations with the `SustainabilitySolver`. Were you right? What does this tell you about the relative importance of training duration vs. grid selection?
**Exercise 2: Why is the solver model-agnostic?**
Try running `solver.solve(fleet=fleet, duration_days=30, datacenter=mlsysim.Infra.Grids.Quebec)` for different fleet sizes. Notice that the `SustainabilitySolver` does not take a `model` parameter. Why? What assumption is the solver making about GPU utilization during training? When would this assumption break down?
**Exercise 3: PUE sensitivity.**
Sweep PUE from 1.0 to 2.0. You can create custom grid profiles: `from mlsysim.infra.types import GridProfile` and then `GridProfile(name="Custom", carbon_intensity_g_kwh=390, pue=1.3, wue=1.8, primary_source="mixed")`. At what PUE value does the facility overhead exceed the IT energy itself? (Hint: PUE = total energy / IT energy, so overhead > IT energy when PUE > 2.0.)
**Self-check:** If you train for 30 days in Quebec (20 gCO2/kWh) vs. 15 days in Poland (820 gCO2/kWh), which produces more total carbon? Show the calculation.
:::
---
## What You Learned
- **Carbon intensity is the biggest lever**: A large difference between hydro (Quebec)
and coal (Poland) grids for identical workloads
- **PUE amplifies everything**: A facility with PUE 1.6 uses 45% more energy than one
with PUE 1.1
- **Water usage varies by cooling technology**: Liquid cooling uses far less water
than evaporative cooling
- **The SustainabilitySolver** chains energy, PUE, and carbon intensity into a single
analytical model
---
## Next Steps
- **[LLM Serving Lab](llm_serving.qmd)** — model the two phases of LLM inference and discover the KV-cache memory wall
- **[Distributed Training](distributed.qmd)** — scale to hundreds of GPUs and analyze where efficiency is lost
- **[Infrastructure Zoo](../zoo/infra.qmd)** — browse all regional grid profiles and datacenter configurations
- **[Solver Guide](../solver-guide.qmd)** — learn how to chain the SustainabilitySolver with other solvers
- **[Math Foundations](../math.qmd)** — see the equations behind energy and carbon calculations