mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
feat(mlsysim): add documentation site, typed registries, and 6-solver core
Complete MLSYSIM v0.1.0 implementation with: - Documentation website (Quarto): landing page with animated hero and capability carousel, 4 tutorials (hello world, LLM serving, distributed training, sustainability), hardware/model/fleet/infra catalogs, solver guide, whitepaper, math foundations, glossary, and full quartodoc API reference - Typed registry system: Hardware (18 devices across 5 tiers), Models (15 workloads), Systems (fleets, clusters, fabrics), Infrastructure (grid profiles, rack configs, datacenters) - Core types: Pint-backed Quantity, Metadata provenance tracking, custom exception hierarchy (OOMError, SLAViolation) - SimulationConfig with YAML/JSON loading and pre-validation - Scenario system tying workloads to systems with SLA constraints - Multi-level evaluation scorecard (feasibility, performance, macro) - Examples, tests, and Jetson Orin NX spec fix (100 → 25 TFLOP/s) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
359
mlsysim/docs/tutorials/distributed.qmd
Normal file
359
mlsysim/docs/tutorials/distributed.qmd
Normal file
@@ -0,0 +1,359 @@
|
||||
---
|
||||
title: "Distributed Training: 3D Parallelism and Scaling Efficiency"
|
||||
subtitle: "Discover why 1024 GPUs rarely deliver 1024× speedup — and how to minimize the gap."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Background: Why distributed training?
|
||||
|
||||
Some models are too large to fit in a single GPU's memory, and some training jobs would take months on one GPU. **Distributed training** splits the work across many GPUs. This tutorial explores the three main ways to split work and the overhead each one introduces. You should complete the Hello World and LLM Serving tutorials before this one.
|
||||
:::
|
||||
|
||||
Scaling a training job from 1 GPU to 1024 GPUs incurs overhead at every step.
|
||||
Communication, pipeline stalls, and coordination each chip away at theoretical speedup.
|
||||
Understanding where that efficiency goes, and how to recover it, is what separates
|
||||
a well-tuned distributed training job from an expensive waste of cluster time.
|
||||
|
||||
By the end of this tutorial you will understand:
|
||||
|
||||
- How **Data Parallelism**, **Tensor Parallelism**, and **Pipeline Parallelism** decompose work across GPUs
|
||||
- Why synchronization (ring all-reduce) overhead depends on model size and network bandwidth
|
||||
- Why **pipeline bubbles** reduce effective GPU utilization
|
||||
- How to calculate **scaling efficiency** for a real cluster
|
||||
|
||||
::: {.callout-tip}
|
||||
## 3D Parallelism at a Glance
|
||||
|
||||
Modern distributed training uses three orthogonal strategies simultaneously:
|
||||
|
||||
| Strategy | What it splits | Main overhead |
|
||||
|:---|:---|:---|
|
||||
| **Data Parallelism (DP)** | Batch across GPUs | All-reduce gradients after backward pass |
|
||||
| **Tensor Parallelism (TP)** | Individual matrix ops within a layer | All-gather within each forward/backward |
|
||||
| **Pipeline Parallelism (PP)** | Layer groups across nodes | Pipeline bubble at start/end of batch |
|
||||
|
||||
The product $\text{DP} \times \text{TP} \times \text{PP} = \text{total GPUs}$.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 1. Setup
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| output: false
|
||||
# Build-system path setup — hidden from students
|
||||
import sys, os, importlib.util
|
||||
current_dir = os.getcwd()
|
||||
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
|
||||
if not os.path.exists(os.path.join(root_path, "mlsysim")):
|
||||
root_path = os.path.abspath("../../")
|
||||
package_path = os.path.join(root_path, "mlsysim")
|
||||
init_file = os.path.join(package_path, "__init__.py")
|
||||
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
|
||||
mlsysim_mod = importlib.util.module_from_spec(spec)
|
||||
sys.modules["mlsysim"] = mlsysim_mod
|
||||
spec.loader.exec_module(mlsysim_mod)
|
||||
import mlsysim
|
||||
```
|
||||
|
||||
```python
|
||||
import mlsysim
|
||||
from mlsysim import DistributedSolver
|
||||
```
|
||||
|
||||
```{python}
|
||||
from mlsysim import DistributedSolver
|
||||
|
||||
# Llama-3.1-70B: the model requires distributed training — too large for a single GPU
|
||||
model = mlsysim.Models.Llama3_70B
|
||||
|
||||
# A research-scale cluster: 32 DGX H100 nodes × 8 GPUs = 256 H100s
|
||||
# (DGX is NVIDIA's pre-built server containing 8 H100 GPUs connected via NVLink)
|
||||
cluster = mlsysim.Systems.Clusters.Research_256
|
||||
|
||||
print(f"Model: {model.name} ({model.parameters.to('Gparam'):.0f} params)")
|
||||
print(f"Cluster: {cluster.name}")
|
||||
print(f" Nodes: {cluster.count} × {cluster.node.accelerators_per_node} GPUs/node")
|
||||
print(f" Total: {cluster.total_accelerators} accelerators")
|
||||
print(f" Fabric: {cluster.fabric.name} @ {cluster.fabric.bandwidth.to('GB/s'):.0f} GB/s/link")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Visualizing 3D Parallelism
|
||||
|
||||
Before working through the numbers, consider how 3D parallelism decomposes a training job across a cluster. Each dimension splits work differently and introduces a different type of overhead:
|
||||
|
||||
**Data Parallelism (DP=4)** — each GPU holds a full model copy and processes 1/4 of the batch. After the backward pass, gradients are synchronized via All-Reduce.
|
||||
|
||||
```{mermaid}
|
||||
%%| fig-cap: "Data Parallelism: replicate the model, split the batch, synchronize gradients."
|
||||
flowchart LR
|
||||
R1["Replica 1<br/>Batch 1/4"] <-.->|"All-Reduce"| R2["Replica 2<br/>Batch 2/4"]
|
||||
R2 <-.->|"All-Reduce"| R3["Replica 3<br/>Batch 3/4"]
|
||||
R3 <-.->|"All-Reduce"| R4["Replica 4<br/>Batch 4/4"]
|
||||
```
|
||||
|
||||
**Tensor Parallelism (TP=2)** — each layer is split across GPUs. Requires fast interconnect (NVLink).
|
||||
|
||||
```{mermaid}
|
||||
%%| fig-cap: "Tensor Parallelism: split each layer across GPUs, communicate via NVLink."
|
||||
flowchart LR
|
||||
G1["GPU 0<br/>Left half of each layer"] <-->|"All-Gather<br/>(NVLink)"| G2["GPU 1<br/>Right half of each layer"]
|
||||
```
|
||||
|
||||
**Pipeline Parallelism (PP=4)** — model layers are partitioned across stages. Activations flow forward; gradients flow backward.
|
||||
|
||||
```{mermaid}
|
||||
%%| fig-cap: "Pipeline Parallelism: partition layers across stages, activations flow forward."
|
||||
flowchart LR
|
||||
S1["Stage 1<br/>Layers 1–20"] --> S2["Stage 2<br/>Layers 21–40"]
|
||||
S2 --> S3["Stage 3<br/>Layers 41–60"]
|
||||
S3 --> S4["Stage 4<br/>Layers 61–80"]
|
||||
```
|
||||
```
|
||||
|
||||
The key insight: **DP** uses inter-node bandwidth (network fabric), **TP** uses intra-node bandwidth (NVLink), and **PP** introduces idle time (pipeline bubbles). The optimal configuration balances all three overheads.
|
||||
|
||||
---
|
||||
|
||||
## 3. Baseline: Pure Data Parallelism
|
||||
|
||||
Start with the simplest configuration — no model splitting, just replicate the full model
|
||||
on every GPU and split the batch. The per-GPU compute time is determined by the same
|
||||
roofline model you used in the Hello World tutorial. The new element here is **communication
|
||||
overhead**: after each training step, all GPUs must synchronize their gradients via the
|
||||
network before the next step can begin.
|
||||
|
||||
```{python}
|
||||
solver = DistributedSolver()
|
||||
|
||||
result_dp = solver.solve(
|
||||
model=model,
|
||||
fleet=cluster,
|
||||
batch_size=256,
|
||||
precision="fp16",
|
||||
tp_size=1, # no tensor parallelism
|
||||
pp_size=1, # no pipeline parallelism
|
||||
)
|
||||
|
||||
node_perf = result_dp["node_performance"]
|
||||
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):.1f} ms/step")
|
||||
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):.2f} ms")
|
||||
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):.2f} ms")
|
||||
print(f"")
|
||||
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):.1f} ms")
|
||||
print(f"Scaling efficiency: {result_dp['scaling_efficiency']:.1%}")
|
||||
print(f"Effective throughput: {result_dp['effective_throughput'].magnitude:.0f} samples/s")
|
||||
print(f"Parallelism: DP={result_dp['parallelism']['dp']} TP={result_dp['parallelism']['tp']} PP={result_dp['parallelism']['pp']}")
|
||||
```
|
||||
|
||||
::: {.callout-note}
|
||||
## What does scaling efficiency mean?
|
||||
|
||||
If scaling efficiency is 80%, then your 256-GPU cluster is delivering the equivalent of
|
||||
about 205 fully-utilized GPUs. The other ~51 GPUs worth of compute is being spent on
|
||||
communication overhead. This is the **communication tax** of distributed training.
|
||||
|
||||
The tax is paid in **ring all-reduce**: after the backward pass, every GPU must synchronize
|
||||
gradients with every other GPU. The time to do this grows with model size and shrinks with
|
||||
network bandwidth.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 4. Ring All-Reduce: The Network Tax
|
||||
|
||||
The `DP all-reduce overhead` comes from the **ring all-reduce algorithm**, which is the
|
||||
standard method for gradient synchronization. Its time depends on:
|
||||
|
||||
$$t_{\text{allreduce}} = 2 \times \frac{M \times (N-1)}{N \times B_{\text{eff}}}$$
|
||||
|
||||
Where $M$ is the message size (model gradient = 2× weights in fp16), $N$ is the number
|
||||
of data-parallel replicas, and $B_{\text{eff}}$ is the effective inter-node bandwidth.
|
||||
|
||||
The following sweep shows how fabric bandwidth affects overhead:
|
||||
|
||||
```{python}
|
||||
from mlsysim import Fleet, NetworkFabric, Systems
|
||||
|
||||
fabrics = [
|
||||
("100GbE", Systems.Fabrics.Ethernet_100G),
|
||||
("IB HDR", Systems.Fabrics.InfiniBand_HDR),
|
||||
("IB NDR", Systems.Fabrics.InfiniBand_NDR),
|
||||
]
|
||||
|
||||
print(f"{'Fabric':>10} {'BW (GB/s)':>10} {'Comm overhead':>14} {'Efficiency':>11}")
|
||||
print("-" * 52)
|
||||
|
||||
for fab_name, fabric in fabrics:
|
||||
custom_cluster = Fleet(
|
||||
name="Custom",
|
||||
node=Systems.Nodes.DGX_H100,
|
||||
count=32,
|
||||
fabric=fabric
|
||||
)
|
||||
r = solver.solve(
|
||||
model=model,
|
||||
fleet=custom_cluster,
|
||||
batch_size=256,
|
||||
precision="fp16"
|
||||
)
|
||||
print(
|
||||
f"{fab_name:>10} "
|
||||
f"{fabric.bandwidth.to('GB/s'):>10.0f~} "
|
||||
f"{r['dp_communication_latency'].to('ms'):>14.2f~} "
|
||||
f"{r['scaling_efficiency']:>11.1%}"
|
||||
)
|
||||
```
|
||||
|
||||
::: {.callout-warning}
|
||||
## Fabric choice determines scaling efficiency
|
||||
|
||||
Upgrading from 100GbE to InfiniBand NDR roughly doubles the effective inter-node bandwidth.
|
||||
On a model the size of Llama-70B (140 GB of gradients per step in fp16), that difference
|
||||
is significant. For smaller models, it matters less — compute time dominates.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 5. Pipeline Parallelism and the Bubble
|
||||
|
||||
**Pipeline Parallelism** splits the model's layers across multiple nodes. Node 1 runs layers
|
||||
1–20, node 2 runs layers 21–40, etc. This allows a much larger model to be trained than
|
||||
fits on a single node.
|
||||
|
||||
The downside: a **pipeline bubble**. The first microbatch must flow through all stages before
|
||||
the last stage can start processing the second microbatch. During that startup phase, most
|
||||
GPUs are idle.
|
||||
|
||||
$$\text{Bubble fraction} = \frac{P - 1}{P - 1 + M}$$
|
||||
|
||||
Where $P$ is the pipeline depth (number of stages) and $M$ is the number of microbatches.
|
||||
|
||||
```{python}
|
||||
print(f"{'PP stages':>10} {'Microbatches':>13} {'Bubble %':>9} {'Comm (ms)':>10} {'Efficiency':>11}")
|
||||
print("-" * 60)
|
||||
|
||||
for pp_size in [1, 2, 4, 8]:
|
||||
for m in [1, 4, 16]:
|
||||
# Only show interesting combinations
|
||||
if pp_size == 1 and m > 1:
|
||||
continue
|
||||
tp = min(8, cluster.total_accelerators // (pp_size * 4))
|
||||
r = solver.solve(
|
||||
model=model,
|
||||
fleet=cluster,
|
||||
batch_size=256,
|
||||
precision="fp16",
|
||||
tp_size=1,
|
||||
pp_size=pp_size,
|
||||
microbatch_count=m
|
||||
)
|
||||
bubble_pct = r["bubble_fraction"] * 100
|
||||
print(
|
||||
f"{pp_size:>10} "
|
||||
f"{m:>13} "
|
||||
f"{bubble_pct:>9.1f}% "
|
||||
f"{r['pipeline_bubble_latency'].to('ms'):>10.1f~} "
|
||||
f"{r['scaling_efficiency']:>11.1%}"
|
||||
)
|
||||
```
|
||||
|
||||
::: {.callout-tip}
|
||||
## Recovering bubble efficiency
|
||||
|
||||
Increasing the number of **microbatches** ($M$) reduces the bubble fraction. With $M = 16$
|
||||
and $P = 8$, the bubble is only $7/(7+16) ≈ 30\%$ of the pipeline, down from $88\%$ with
|
||||
$M = 1$.
|
||||
|
||||
In practice, frameworks like Megatron-LM use **interleaved pipeline schedules** that further
|
||||
reduce the bubble. But even with the standard 1F1B schedule, choosing $M \gg P$ is essential.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 6. Finding the Optimal Configuration
|
||||
|
||||
Now combine all three parallelism strategies and find the configuration that maximizes
|
||||
scaling efficiency for the `Research_256` cluster. In practice, 70-80% scaling efficiency
|
||||
on hundreds of GPUs is considered excellent. Below 50% typically signals a suboptimal
|
||||
parallelism configuration or insufficient network bandwidth.
|
||||
|
||||
```{python}
|
||||
configs = [
|
||||
# (description, tp, pp, m)
|
||||
("DP only", 1, 1, 1),
|
||||
("DP + TP=2", 2, 1, 1),
|
||||
("DP + PP=4, M=16", 1, 4, 16),
|
||||
("DP + TP=2 + PP=4, M=16", 2, 4, 16),
|
||||
("DP + TP=8 + PP=4, M=16", 8, 4, 16),
|
||||
]
|
||||
|
||||
print(f"{'Config':<26} {'DP':>4} {'TP':>4} {'PP':>4} {'Efficiency':>11} {'Throughput':>14}")
|
||||
print("-" * 72)
|
||||
|
||||
for desc, tp, pp, m in configs:
|
||||
try:
|
||||
r = solver.solve(
|
||||
model=model,
|
||||
fleet=cluster,
|
||||
batch_size=256,
|
||||
precision="fp16",
|
||||
tp_size=tp,
|
||||
pp_size=pp,
|
||||
microbatch_count=m
|
||||
)
|
||||
print(
|
||||
f"{desc:<26} "
|
||||
f"{r['parallelism']['dp']:>4} "
|
||||
f"{r['parallelism']['tp']:>4} "
|
||||
f"{r['parallelism']['pp']:>4} "
|
||||
f"{r['scaling_efficiency']:>11.1%} "
|
||||
f"{r['effective_throughput'].magnitude:>14.1f}"
|
||||
)
|
||||
except ValueError as e:
|
||||
print(f"{desc:<26} {'INFEASIBLE':>44} ({e})")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Your Turn
|
||||
|
||||
::: {.callout-caution}
|
||||
## Exercises
|
||||
|
||||
**Exercise 1: Predict before you observe.**
|
||||
For a 256-GPU cluster training Llama-3.1-70B, predict: will DP=256, TP=1, PP=1 have higher or lower scaling efficiency than DP=32, TP=4, PP=2? Write your prediction and reasoning, then run both configurations. Were you right?
|
||||
|
||||
**Exercise 2: Find the optimal configuration.**
|
||||
Sweep all valid 3D parallelism configurations for 256 GPUs (where DP x TP x PP = 256). Which configuration maximizes scaling efficiency? Is it the same for Ethernet 100G vs. InfiniBand NDR? (Hint: valid TP values are divisors of 8, the GPUs per node: 1, 2, 4, 8. For each TP, valid PP values are divisors of 256/TP.)
|
||||
|
||||
**Exercise 3: The microbatch lever.**
|
||||
With PP=8, sweep microbatch count M from 1 to 64. Plot the pipeline bubble fraction vs. M. At what value of M does the bubble fraction drop below 10%? (Use the formula from Section 5: bubble = (P-1)/(P-1+M). Predict the answer analytically before running the sweep.)
|
||||
|
||||
**Self-check:** Why must tensor parallelism (TP) stay within a single node on most clusters? What would happen to communication overhead if TP crossed node boundaries?
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## What You Learned
|
||||
|
||||
- **3D Parallelism** decomposes the training problem across $\text{DP} \times \text{TP} \times \text{PP}$ GPUs,
|
||||
each with distinct communication costs.
|
||||
- **Ring all-reduce** is the network tax of data parallelism. It grows with model size and
|
||||
shrinks with fabric bandwidth. Switching from 100GbE to InfiniBand can recover 10-30%
|
||||
efficiency on large models.
|
||||
- **Pipeline bubbles** waste GPU cycles proportional to $\frac{P-1}{P-1+M}$. Use large
|
||||
microbatch counts ($M \gg P$) to minimize waste.
|
||||
- **Scaling efficiency below 100%** is normal and unavoidable. A well-tuned job at 70-80%
|
||||
efficiency on hundreds of GPUs is excellent. Below 50% signals a configuration problem.
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- **[LLM Serving Lab](llm_serving.qmd)**: After training, learn how to model the serving cost of the same model
|
||||
- **[Math Foundations](../math.qmd)**: Full derivations for ring all-reduce, pipeline bubble, and MFU
|
||||
- **[Fleet Zoo](../zoo/fleets.qmd)**: Browse the available cluster configurations and their network specs
|
||||
196
mlsysim/docs/tutorials/hello_world.qmd
Normal file
196
mlsysim/docs/tutorials/hello_world.qmd
Normal file
@@ -0,0 +1,196 @@
|
||||
---
|
||||
title: "Hello World: Single-Node Roofline"
|
||||
subtitle: "Predict model performance on hardware before writing a single CUDA kernel."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Prerequisites
|
||||
Complete the [Getting Started](../getting-started.qmd) guide before this tutorial. It introduces the `Engine.solve` API and the MLSys Zoo.
|
||||
:::
|
||||
|
||||
In this tutorial, you will model the performance of **ResNet-50** on an **NVIDIA A100** GPU
|
||||
using the analytical roofline model. By the end, you will understand:
|
||||
|
||||
- What it means for a model to be **memory-bound** vs. **compute-bound**
|
||||
- How changing **batch size** shifts the bottleneck
|
||||
- Why the A100's memory bandwidth matters as much as its peak TFLOP/s
|
||||
|
||||
::: {.callout-note}
|
||||
## Background: ResNet-50 and the A100
|
||||
|
||||
**ResNet-50** is a 50-layer convolutional neural network (CNN) commonly used for image classification. It has roughly 25 million parameters and requires about 8 billion floating-point operations (8 GFLOP) per inference. It is a standard benchmark workload because its size is well-characterized and widely published.
|
||||
|
||||
The **NVIDIA A100** is a datacenter GPU designed for ML training and inference. Its key specifications: 312 TFLOP/s peak compute (FP16 Tensor Core), 2.0 TB/s HBM2e (High Bandwidth Memory) bandwidth, and 80 GB of memory. These two numbers (compute speed and memory speed) are what the roofline model uses to predict performance.
|
||||
|
||||
See the [Glossary](../glossary.qmd) for definitions of terms like FLOP/s, HBM, and Tensor Core.
|
||||
:::
|
||||
|
||||
::: {.callout-tip}
|
||||
## What is the roofline model?
|
||||
Every GPU has two speed limits: how fast it can compute (FLOP/s) and how fast it can load
|
||||
data from memory (bytes/s). Your model's actual throughput is determined by whichever limit
|
||||
you hit first. The roofline model tells you exactly which one, and by how much.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 1. Setup
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| output: false
|
||||
# Build-system path setup — hidden from students
|
||||
import sys, os, importlib.util
|
||||
current_dir = os.getcwd()
|
||||
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
|
||||
if not os.path.exists(os.path.join(root_path, "mlsysim")):
|
||||
root_path = os.path.abspath("../../")
|
||||
package_path = os.path.join(root_path, "mlsysim")
|
||||
init_file = os.path.join(package_path, "__init__.py")
|
||||
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
|
||||
mlsysim = importlib.util.module_from_spec(spec)
|
||||
sys.modules["mlsysim"] = mlsysim
|
||||
spec.loader.exec_module(mlsysim)
|
||||
Engine = mlsysim.Engine
|
||||
```
|
||||
|
||||
After `pip install mlsysim`, the import is simple:
|
||||
|
||||
```python
|
||||
import mlsysim
|
||||
from mlsysim import Engine
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Select Workload and Hardware
|
||||
|
||||
Pull vetted specifications directly from the **MLSys Zoo**—no need to look up datasheets.
|
||||
|
||||
```{python}
|
||||
# Load ResNet-50 from the Model Zoo
|
||||
model = mlsysim.Models.ResNet50
|
||||
|
||||
# Load NVIDIA A100 from the Silicon Zoo
|
||||
hardware = mlsysim.Hardware.Cloud.A100
|
||||
|
||||
print(f"Model: {model.name} ({model.architecture})")
|
||||
print(f"Hardware: {hardware.name} ({hardware.release_year})")
|
||||
print(f"")
|
||||
print(f"Model FLOPs (inference): {model.inference_flops}")
|
||||
print(f"Hardware Peak TFLOP/s: {hardware.compute.peak_flops.to('TFLOPs/s'):.0f}")
|
||||
print(f"Hardware Memory BW: {hardware.memory.bandwidth.to('TB/s'):.1f}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Solve the Performance Profile
|
||||
|
||||
The `Engine.solve` method applies the **Iron Law of ML Systems**—it calculates which of the
|
||||
two hardware speed limits (compute or memory) you hit first, and returns your latency from there.
|
||||
|
||||
```{python}
|
||||
profile = Engine.solve(
|
||||
model=model,
|
||||
hardware=hardware,
|
||||
batch_size=1,
|
||||
precision="fp16"
|
||||
)
|
||||
|
||||
print(f"Bottleneck: {profile.bottleneck}")
|
||||
print(f"Latency: {profile.latency.to('ms'):.3f} ms per inference")
|
||||
print(f"Throughput: {profile.throughput:.0f} images/sec")
|
||||
```
|
||||
|
||||
::: {.callout-note}
|
||||
## Why "Memory Bound"?
|
||||
At batch size 1, ResNet-50 performs ~8 GFLOPs of computation but loads ~50 MB of weights (25.6M parameters at fp16).
|
||||
Its **arithmetic intensity** (FLOPs/Byte) is far below the A100's roofline ridge point.
|
||||
The A100's memory bandwidth (2 TB/s) becomes the bottleneck, not its 312 TFLOP/s compute.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 4. Sweep Batch Sizes
|
||||
|
||||
The bottleneck changes as batch size grows. Run the sweep and see when compute takes over:
|
||||
|
||||
```{python}
|
||||
print(f"{'Batch':>6} {'Bottleneck':<16} {'Throughput':>12} {'Latency':>10}")
|
||||
print("-" * 52)
|
||||
|
||||
for batch in [1, 4, 16, 32, 64, 128, 256]:
|
||||
p = Engine.solve(
|
||||
model=model,
|
||||
hardware=hardware,
|
||||
batch_size=batch,
|
||||
precision="fp16"
|
||||
)
|
||||
print(
|
||||
f"{batch:>6} {p.bottleneck:<16} "
|
||||
f"{p.throughput:>10.0f}/s "
|
||||
f"{p.latency.to('ms'):>8.2f} ms"
|
||||
)
|
||||
```
|
||||
|
||||
::: {.callout-tip}
|
||||
## The crossover point
|
||||
Watch where the output switches from `Memory Bound` to `Compute Bound`. That is the **ridge
|
||||
point** of the roofline—the batch size at which you've saturated both resources equally.
|
||||
Beyond that point, adding more compute (or a bigger GPU) pays off. Below it, more memory
|
||||
bandwidth is what matters.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 5. Visualizing the Roofline
|
||||
|
||||
MLSYSIM includes built-in visualization tools. The roofline chart plots the hardware's two
|
||||
ceilings and shows where your workloads sit relative to them:
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
fig, ax = mlsysim.plot_roofline(hardware, workloads=[model])
|
||||
ax.set_title(f"Roofline: {model.name} on {hardware.name}")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
{#fig-roofline-hello}
|
||||
|
||||
---
|
||||
|
||||
## Your Turn
|
||||
|
||||
::: {.callout-caution}
|
||||
## Exercises
|
||||
|
||||
**Exercise 1: Predict before you compute.**
|
||||
Before running the code, predict: Will ResNet-50 at batch_size=64 be memory-bound or compute-bound on the A100? Write down your prediction, then verify with `Engine.solve(...)`. Were you right? Why or why not?
|
||||
|
||||
**Exercise 2: Hardware comparison.**
|
||||
Before running: which GPU do you predict will have the highest ridge point -- the V100, A100, or H100? (Hint: compare their compute-to-bandwidth ratios.) Then run the same ResNet-50 analysis on `mlsysim.Hardware.Cloud.H100` and `mlsysim.Hardware.Cloud.V100`. Which gives the lowest latency at batch_size=1? At batch_size=256? What explains the difference?
|
||||
|
||||
**Exercise 3: Precision effect.**
|
||||
Before running: will switching from `precision="fp16"` to `precision="int8"` change the bottleneck classification for ResNet-50 on the A100 at batch_size=1? Write your prediction and reasoning, then compare both. How does quantization change the arithmetic intensity?
|
||||
|
||||
**Self-check:** If a model's arithmetic intensity is 50 FLOP/byte and the hardware's ridge point is 156 FLOP/byte, is the model compute-bound or memory-bound?
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## What You Learned
|
||||
|
||||
- **Roofline model**: Performance is bounded by $\max\left(\frac{\text{FLOPs}}{\text{Peak}},\ \frac{\text{Bytes}}{\text{BW}}\right)$ (whichever takes longer, computing or loading data, determines your runtime)
|
||||
- **Batch size matters**: Small batches are memory-bound; large batches become compute-bound
|
||||
- **The ridge point**: The crossover batch size where memory and compute are equally saturated
|
||||
- **Practical implication**: If you are memory-bound, reducing data movement (quantization, larger batches) helps more than a faster GPU
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- **[Sustainability Lab](sustainability.qmd)**: Calculate the carbon footprint of training across different grid regions
|
||||
- **[LLM Serving Lab](llm_serving.qmd)**: Model the two phases of LLM inference and discover the KV-cache memory wall
|
||||
- **[Math Foundations](../math.qmd)**: The complete set of equations used by all solvers
|
||||
- **[Silicon Zoo](../zoo/hardware.qmd)**: Browse all vetted hardware specs and compare alternatives
|
||||
BIN
mlsysim/docs/tutorials/images/roofline_hello_world.png
Normal file
BIN
mlsysim/docs/tutorials/images/roofline_hello_world.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 51 KiB |
75
mlsysim/docs/tutorials/index.qmd
Normal file
75
mlsysim/docs/tutorials/index.qmd
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
title: "Tutorials"
|
||||
subtitle: "Step-by-step guides for modeling ML Systems."
|
||||
format:
|
||||
html:
|
||||
toc: false
|
||||
---
|
||||
|
||||
These tutorials are designed to build intuition for ML systems using the `mlsysim` framework.
|
||||
They map directly to chapters in the *Machine Learning Systems* textbook—start at the beginning
|
||||
or jump to any topic.
|
||||
|
||||
::: {.tutorial-grid}
|
||||
|
||||
::: {.tutorial-card}
|
||||
[Beginner]{.tutorial-level .level-beginner}
|
||||
|
||||
### Hello World: Single-Node Roofline
|
||||
|
||||
Learn to lower a model onto hardware and identify the performance bottleneck.
|
||||
Understand memory-bound vs. compute-bound in 5 minutes.
|
||||
|
||||
[Start Tutorial →](hello_world.qmd){.tutorial-arrow}
|
||||
:::
|
||||
|
||||
::: {.tutorial-card}
|
||||
[Intermediate]{.tutorial-level .level-intermediate}
|
||||
|
||||
### Sustainability Lab: Carbon Footprint
|
||||
|
||||
Calculate the energy and CO₂ cost of training a frontier LLM across different
|
||||
geographical grid regions. Quebec vs. Poland—the numbers will surprise you.
|
||||
|
||||
[Start Tutorial →](sustainability.qmd){.tutorial-arrow}
|
||||
:::
|
||||
|
||||
::: {.tutorial-card}
|
||||
[Intermediate]{.tutorial-level .level-intermediate}
|
||||
|
||||
### LLM Serving: TTFT, ITL & the Memory Wall
|
||||
|
||||
Model the two physical regimes of autoregressive generation: the compute-bound
|
||||
pre-fill phase and the memory-bound decoding phase. Discover how quantization
|
||||
and hardware choice affect each phase differently.
|
||||
|
||||
[Start Tutorial →](llm_serving.qmd){.tutorial-arrow}
|
||||
:::
|
||||
|
||||
::: {.tutorial-card}
|
||||
[Advanced]{.tutorial-level .level-advanced}
|
||||
|
||||
### Distributed Training: 3D Parallelism
|
||||
|
||||
Explore Data, Tensor, and Pipeline parallelism overhead. Model the ring all-reduce
|
||||
communication cost and pipeline bubble fraction on a 256-GPU H100 cluster.
|
||||
|
||||
[Start Tutorial →](distributed.qmd){.tutorial-arrow}
|
||||
:::
|
||||
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## Learning Path
|
||||
|
||||
If you're new to ML systems modeling, we recommend this sequence:
|
||||
|
||||
1. **[Hello World](hello_world.qmd)** — Understand the roofline model and what determines inference speed.
|
||||
2. **[Sustainability Lab](sustainability.qmd)** — Apply the framework to a real-world carbon analysis.
|
||||
3. **[LLM Serving Lab](llm_serving.qmd)** — Model TTFT, ITL, and KV-cache pressure for production LLM serving.
|
||||
4. **[Distributed Training](distributed.qmd)** — Scale to hundreds of GPUs and analyze where efficiency is lost.
|
||||
5. **[Hardware Zoo](../zoo/hardware.qmd)** — Explore the vetted hardware specifications across deployment tiers.
|
||||
6. *(Optional)* **[Math Foundations](../math.qmd)** — The first-principles equations behind every solver.
|
||||
|
||||
> **Tip:** All tutorials are Jupyter/Quarto compatible. Run them locally after `pip install mlsysim`.
|
||||
338
mlsysim/docs/tutorials/llm_serving.qmd
Normal file
338
mlsysim/docs/tutorials/llm_serving.qmd
Normal file
@@ -0,0 +1,338 @@
|
||||
---
|
||||
title: "LLM Serving Lab: TTFT, ITL, and the Memory Wall"
|
||||
subtitle: "Model the two physical regimes of LLM inference before deploying a single server."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Background: What is an LLM and why is serving different?
|
||||
|
||||
A **Large Language Model (LLM)** like Llama-3 generates text one token (roughly one word) at a time. Unlike image models that process a fixed input in one pass, LLMs run the model *repeatedly*, once for each output token. This creates two distinct phases with different performance characteristics, which is why LLM serving requires its own dedicated solver. You should complete the [Hello World tutorial](hello_world.qmd) before this one.
|
||||
:::
|
||||
|
||||
Running a large language model in production is not like running ResNet. An LLM inference
|
||||
request goes through **two completely different physical regimes**, each bottlenecked by a
|
||||
different hardware resource. Understanding this is the difference between guessing at your
|
||||
deployment budget and calculating it precisely.
|
||||
|
||||
By the end of this tutorial you will understand:
|
||||
|
||||
- Why **TTFT** (Time to First Token) and **ITL** (Inter-Token Latency) have different bottlenecks
|
||||
- How **KV-cache** memory pressure limits batch concurrency
|
||||
- Why **quantization** helps decoding more than prefill
|
||||
- How to pick the right GPU for your serving latency targets
|
||||
|
||||
::: {.callout-tip}
|
||||
## The Two Phases of LLM Inference
|
||||
|
||||
Recall from the [Hello World tutorial](hello_world.qmd) that every workload is either memory-bound
|
||||
or compute-bound. LLM serving is unusual because *both regimes* occur in the same request:
|
||||
|
||||
**Pre-fill (TTFT):** All prompt tokens processed in a single forward pass. The model sees the
|
||||
full context at once — this is compute-intensive and saturates GPU arithmetic units. Optimizing
|
||||
TTFT means getting more TFLOP/s.
|
||||
|
||||
**Decoding (ITL):** One token generated at a time. Each step must reload the *entire model*
|
||||
from HBM (High Bandwidth Memory) to produce just one output token. This is overwhelmingly **memory-bound**.
|
||||
Optimizing ITL means getting more GB/s.
|
||||
|
||||
The same GPU has two different speed limits for the same model.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 1. Setup
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| output: false
|
||||
# Build-system path setup — hidden from students
|
||||
import sys, os, importlib.util
|
||||
current_dir = os.getcwd()
|
||||
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
|
||||
if not os.path.exists(os.path.join(root_path, "mlsysim")):
|
||||
root_path = os.path.abspath("../../")
|
||||
package_path = os.path.join(root_path, "mlsysim")
|
||||
init_file = os.path.join(package_path, "__init__.py")
|
||||
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
|
||||
mlsysim_mod = importlib.util.module_from_spec(spec)
|
||||
sys.modules["mlsysim"] = mlsysim_mod
|
||||
spec.loader.exec_module(mlsysim_mod)
|
||||
import mlsysim
|
||||
```
|
||||
|
||||
```python
|
||||
import mlsysim
|
||||
from mlsysim import ServingSolver
|
||||
```
|
||||
|
||||
Unlike the general-purpose `Engine.solve` from the Hello World tutorial, `ServingSolver`
|
||||
separates inference into two phases — pre-fill and decoding — each with its own bottleneck.
|
||||
|
||||
Select our workload and hardware from the **MLSys Zoo**:
|
||||
|
||||
```{python}
|
||||
from mlsysim import ServingSolver
|
||||
|
||||
# Llama-3.1-8B: 8B parameters, 32 layers, 4096 hidden_dim
|
||||
# 8 GQA (Grouped Query Attention) heads — fewer KV heads than query heads, saving memory
|
||||
model = mlsysim.Models.Llama3_8B
|
||||
|
||||
# NVIDIA H100 SXM5: 80 GB HBM3, 3.35 TB/s, 989 TFLOP/s (fp16)
|
||||
hardware = mlsysim.Hardware.Cloud.H100
|
||||
|
||||
print(f"Model: {model.name}")
|
||||
print(f"Parameters: {model.parameters.to('Gparam'):.1f}")
|
||||
print(f"Layers: {model.layers}, Hidden: {model.hidden_dim}")
|
||||
print(f"")
|
||||
print(f"Hardware: {hardware.name}")
|
||||
print(f"Memory: {hardware.memory.capacity.to('GB'):.0f} GB @ "
|
||||
f"{hardware.memory.bandwidth.to('TB/s'):.2f} TB/s")
|
||||
print(f"Compute: {hardware.compute.peak_flops.to('TFLOPs/s'):.0f} TFLOP/s (fp16)")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. First Serving Prediction
|
||||
|
||||
The `ServingSolver` takes a **sequence length** — the total context window that must be
|
||||
processed during pre-fill and cached during decoding.
|
||||
|
||||
```{python}
|
||||
solver = ServingSolver()
|
||||
|
||||
result = solver.solve(
|
||||
model=model,
|
||||
hardware=hardware,
|
||||
seq_len=2048, # tokens in context (prompt + history)
|
||||
batch_size=1, # concurrent users
|
||||
precision="fp16"
|
||||
)
|
||||
|
||||
print(f"Feasible: {result['feasible']}")
|
||||
print(f"")
|
||||
print(f"── Latency ──────────────────────────────")
|
||||
print(f"TTFT (prefill): {result['ttft'].to('ms'):~.1f}")
|
||||
print(f"ITL (per token): {result['itl'].to('ms'):~.2f}")
|
||||
print(f"")
|
||||
print(f"── Memory ───────────────────────────────")
|
||||
print(f"Model weights: {result['model_weights_size']:~.2f}")
|
||||
print(f"KV-cache (2K ctx): {result['kv_cache_size']:~.3f}")
|
||||
print(f"Total required: {result['total_memory_required']:~.2f}")
|
||||
print(f"Memory util: {result['memory_utilization']:.1%}")
|
||||
```
|
||||
|
||||
::: {.callout-note}
|
||||
## Reading the output
|
||||
|
||||
- **TTFT** is tens of milliseconds — bounded by the GPU's 989 TFLOP/s compute ceiling.
|
||||
- **ITL** is a small fraction of a millisecond — bounded by the 3.35 TB/s HBM bandwidth.
|
||||
At each decode step, ~16 GB of weights must transit from HBM to compute units, yet
|
||||
only one token of computation happens. The bandwidth is the wall, not the FLOPs.
|
||||
- **Memory util** tells you how much of the 80 GB HBM is occupied. The remainder is
|
||||
available for more concurrent users (larger `batch_size`).
|
||||
- **Typical SLA targets**: For interactive chat applications, aim for TTFT < 200 ms and
|
||||
ITL < 50 ms/token. The numbers above are well within these targets for a single user.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 3. The KV-Cache Memory Wall
|
||||
|
||||
The KV-cache stores the Key and Value matrices from every attention layer for every token
|
||||
in the active context. Its size grows as:
|
||||
|
||||
$$\text{KV-Cache} = 2 \times L \times H_{kv} \times d_{head} \times S \times B \times \text{bpp}$$
|
||||
|
||||
Where $L$ = layers, $H_{kv}$ = KV heads, $S$ = sequence length, $B$ = batch size,
|
||||
$\text{bpp}$ = bytes per parameter.
|
||||
|
||||
This means doubling `batch_size` doubles the KV-cache. At some point, you hit the
|
||||
**memory wall** — the combined model + KV-cache exceeds the accelerator's HBM capacity.
|
||||
|
||||
```{python}
|
||||
print(f"{'Batch':>6} {'Ctx':>6} {'KV-Cache':>10} {'Total':>8} {'Util':>6} {'Feasible':>8}")
|
||||
print("-" * 56)
|
||||
|
||||
for batch in [1, 4, 8, 16, 32, 64]:
|
||||
r = solver.solve(
|
||||
model=model,
|
||||
hardware=hardware,
|
||||
seq_len=2048,
|
||||
batch_size=batch,
|
||||
precision="fp16"
|
||||
)
|
||||
print(
|
||||
f"{batch:>6} "
|
||||
f"{'2048':>6} "
|
||||
f"{r['kv_cache_size']:>10.3f~} "
|
||||
f"{r['total_memory_required']:>8.2f~} "
|
||||
f"{r['memory_utilization']:>6.1%} "
|
||||
f"{'✓' if r['feasible'] else '✗ OOM':>8}"
|
||||
)
|
||||
```
|
||||
|
||||
::: {.callout-warning}
|
||||
## Finding the memory wall
|
||||
|
||||
Watch for `✗ OOM` — this is where `total_memory_required` exceeds the 80 GB HBM capacity.
|
||||
That batch size is infeasible on a single H100. You would need to either: reduce the
|
||||
context window, switch to a lower-precision format, or add more GPUs.
|
||||
:::
|
||||
|
||||
```{python}
|
||||
# Also sweep context length at fixed batch size
|
||||
print(f"\n{'Ctx':>6} {'KV-Cache':>10} {'Total':>8} {'Util':>6} {'Feasible':>8}")
|
||||
print("-" * 48)
|
||||
|
||||
for ctx in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
|
||||
r = solver.solve(
|
||||
model=model,
|
||||
hardware=hardware,
|
||||
seq_len=ctx,
|
||||
batch_size=8,
|
||||
precision="fp16"
|
||||
)
|
||||
print(
|
||||
f"{ctx:>6} "
|
||||
f"{r['kv_cache_size']:>10.3f~} "
|
||||
f"{r['total_memory_required']:>8.2f~} "
|
||||
f"{r['memory_utilization']:>6.1%} "
|
||||
f"{'✓' if r['feasible'] else '✗ OOM':>8}"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Quantization: Precision as a Latency Knob
|
||||
|
||||
Reducing numerical precision does two things simultaneously:
|
||||
|
||||
1. **Shrinks model weights** → fewer bytes to load per decode step → lower ITL
|
||||
2. **Shrinks KV-cache** → more headroom for larger batches or longer contexts
|
||||
|
||||
But precision affects the **two phases differently**: TTFT (compute-bound) improves only
|
||||
when going to fp8 or below on hardware with native low-precision tensor cores. ITL
|
||||
(memory-bound) improves with every step down in precision.
|
||||
|
||||
```{python}
|
||||
print(f"{'Precision':>10} {'TTFT':>8} {'ITL':>10} {'Weights':>8} {'KV-Cache':>10} {'Util':>7}")
|
||||
print("-" * 64)
|
||||
|
||||
for prec in ["fp16", "int8", "int4"]:
|
||||
r = solver.solve(
|
||||
model=model,
|
||||
hardware=hardware,
|
||||
seq_len=8192,
|
||||
batch_size=8,
|
||||
precision=prec
|
||||
)
|
||||
print(
|
||||
f"{prec:>10} "
|
||||
f"{r['ttft'].to('ms'):>8.1f~} "
|
||||
f"{r['itl'].to('ms'):>10.3f~} "
|
||||
f"{r['model_weights_size']:>8.2f~} "
|
||||
f"{r['kv_cache_size']:>10.3f~} "
|
||||
f"{r['memory_utilization']:>7.1%}"
|
||||
)
|
||||
```
|
||||
|
||||
::: {.callout-tip}
|
||||
## Why ITL improves more than TTFT
|
||||
|
||||
Going from `fp16` → `int8` halves the model size. At **decode time**, each step must load
|
||||
the full model from HBM — half the bytes means half the time. ITL drops by ~50%.
|
||||
|
||||
At **prefill time**, the computation is the bottleneck (not bandwidth), so halving byte
|
||||
count helps less — you're not memory-bound in the first place. The improvement is
|
||||
smaller and depends on whether your hardware has native `int8` tensor core support.
|
||||
|
||||
**Rule of thumb**: Quantization is a decoding optimization first, a prefill optimization second.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 5. Hardware Comparison
|
||||
|
||||
Different GPUs have different ratios of compute-to-memory-bandwidth. For LLM serving:
|
||||
|
||||
- **Higher TFLOP/s** → faster TTFT (prefill is compute-bound)
|
||||
- **Higher HBM bandwidth** → faster ITL (decoding is memory-bound)
|
||||
|
||||
```{python}
|
||||
gpus = [
|
||||
("A100 (80GB)", mlsysim.Hardware.Cloud.A100),
|
||||
("H100 SXM5", mlsysim.Hardware.Cloud.H100),
|
||||
("H200", mlsysim.Hardware.Cloud.H200),
|
||||
("MI300X", mlsysim.Hardware.Cloud.MI300X),
|
||||
]
|
||||
|
||||
print(f"{'GPU':>14} {'BW (TB/s)':>10} {'TTFT':>8} {'ITL':>10} {'Max Util':>9}")
|
||||
print("-" * 60)
|
||||
|
||||
for name, hw in gpus:
|
||||
r = solver.solve(
|
||||
model=model,
|
||||
hardware=hw,
|
||||
seq_len=4096,
|
||||
batch_size=4,
|
||||
precision="fp16"
|
||||
)
|
||||
print(
|
||||
f"{name:>14} "
|
||||
f"{hw.memory.bandwidth.to('TB/s'):>10.2f~} "
|
||||
f"{r['ttft'].to('ms'):>8.1f~} "
|
||||
f"{r['itl'].to('ms'):>10.3f~} "
|
||||
f"{r['memory_utilization']:>9.1%}"
|
||||
)
|
||||
```
|
||||
|
||||
::: {.callout-note}
|
||||
## Why H200 wins on ITL
|
||||
|
||||
The H200 uses HBM3e with **4.8 TB/s** bandwidth vs the H100's 3.35 TB/s — a 43% increase.
|
||||
This directly maps to a 43% lower ITL, because decoding is a pure memory-bound operation.
|
||||
|
||||
The MI300X is even more interesting: its massive 192 GB HBM pool lets you pack far more
|
||||
concurrent users (batch_size) before hitting the memory wall.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## Your Turn
|
||||
|
||||
::: {.callout-caution}
|
||||
## Exercises
|
||||
|
||||
**Exercise 1: Predict the memory wall.**
|
||||
Before running the code, estimate: at what batch size will Llama-3.1-8B hit OOM on an 80 GB H100 with seq_len=4096 at FP16? Write your estimate, then sweep batch sizes to find the actual limit. How close were you?
|
||||
|
||||
**Exercise 2: The quantization trade-off.**
|
||||
Before running: predict which GPU will benefit most from quantization (int8 vs. fp16) in terms of ITL improvement. (Hint: ITL depends on bandwidth, not compute. Think about which GPU has the lowest bandwidth relative to its memory capacity.) Then run the hardware comparison sweep (Section 5) at both precisions and check your prediction.
|
||||
|
||||
**Exercise 3: Context length scaling.**
|
||||
Before running: predict whether TTFT scales linearly or quadratically with seq_len. (Hint: the simplified model in MLSYSIM computes prefill FLOPs as `2 × params × seq_len`, which is linear. But real transformers have attention layers whose cost grows as O(seq_len²). How does this affect your prediction for long contexts?) Sweep seq_len from 512 to 16384 at batch_size=1 and plot TTFT vs. seq_len. Does the result match the simplified model or the quadratic attention model?
|
||||
|
||||
**Self-check:** A user asks "Will my chatbot feel responsive on a single A100?" What two metrics would you check, and what thresholds would you target for a good user experience?
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## What You Learned
|
||||
|
||||
- **LLM serving has two regimes**: Pre-fill (TTFT) is **compute-bound**; Decoding (ITL) is
|
||||
**memory-bound**. They respond to different optimizations.
|
||||
- **KV-cache memory** scales as $O(L \times S \times B \times \text{bpp})$: longer contexts
|
||||
and larger batches both consume HBM, eventually causing OOM.
|
||||
- **Quantization** is primarily a **decoding speedup**: halving precision halves the bytes
|
||||
loaded per decode step, directly halving ITL.
|
||||
- **Hardware selection**: For low-latency chat (ITL-critical), maximize HBM bandwidth.
|
||||
For long-context applications (TTFT-critical), maximize TFLOP/s.
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- **[Distributed Training](distributed.qmd)**: Scale a model across hundreds of GPUs using
|
||||
3D parallelism — and discover why scaling efficiency is rarely 100%
|
||||
- **[Math Foundations](../math.qmd)**: The exact equations behind TTFT, ITL, and KV-cache sizing
|
||||
- **[Silicon Zoo](../zoo/hardware.qmd)**: Compare full hardware specs across the entire fleet
|
||||
209
mlsysim/docs/tutorials/sustainability.qmd
Normal file
209
mlsysim/docs/tutorials/sustainability.qmd
Normal file
@@ -0,0 +1,209 @@
|
||||
---
|
||||
title: "Sustainability Lab: Modeling Carbon Footprint"
|
||||
subtitle: "Same model, same hardware — 41x difference in carbon footprint."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Prerequisites
|
||||
This tutorial can be completed independently, but completing the [Hello World tutorial](hello_world.qmd) first provides useful context on how hardware performance relates to energy consumption.
|
||||
:::
|
||||
|
||||
This lab explores the environmental impact of machine learning at scale. You will model
|
||||
the training of a large language model across different geographical regions and discover
|
||||
how location, efficiency, and precision affect sustainability.
|
||||
|
||||
By the end of this tutorial you will understand:
|
||||
|
||||
- How **carbon intensity** varies dramatically across electricity grids
|
||||
- How **PUE** (Power Usage Effectiveness) amplifies energy consumption
|
||||
- Why choosing *where* to train matters more than *how* to train
|
||||
- How to use the `SustainabilitySolver` for carbon-aware decisions
|
||||
|
||||
::: {.callout-tip}
|
||||
## The sustainability equation
|
||||
Carbon footprint = Energy × PUE × Carbon Intensity. The first factor depends on your
|
||||
hardware and job duration. The second depends on your datacenter's cooling efficiency.
|
||||
The third depends on your region's electricity mix. MLSYSIM lets you vary all three.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 1. Setup
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| output: false
|
||||
import sys, os, importlib.util
|
||||
current_dir = os.getcwd()
|
||||
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
|
||||
if not os.path.exists(os.path.join(root_path, "mlsysim")):
|
||||
root_path = os.path.abspath("../../")
|
||||
package_path = os.path.join(root_path, "mlsysim")
|
||||
init_file = os.path.join(package_path, "__init__.py")
|
||||
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
|
||||
mlsysim = importlib.util.module_from_spec(spec)
|
||||
sys.modules["mlsysim"] = mlsysim
|
||||
spec.loader.exec_module(mlsysim)
|
||||
SustainabilitySolver = mlsysim.SustainabilitySolver
|
||||
```
|
||||
|
||||
```python
|
||||
import mlsysim
|
||||
from mlsysim import SustainabilitySolver
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Select a Fleet
|
||||
|
||||
We'll use a production-scale cluster from the **Fleet Zoo** — 8,192 H100 GPUs
|
||||
connected via InfiniBand NDR.
|
||||
|
||||
```{python}
|
||||
fleet = mlsysim.Systems.Clusters.Frontier_8K
|
||||
print(f"Fleet: {fleet.name}")
|
||||
print(f"Total Accelerators: {fleet.total_accelerators}")
|
||||
```
|
||||
|
||||
With the fleet defined, the remaining variables are *how long* the job runs and *where*.
|
||||
The `duration_days` parameter represents total training time — in practice, this depends on
|
||||
the model's compute requirements and the cluster's performance (exactly what the
|
||||
[Hello World](hello_world.qmd) and [Distributed Training](distributed.qmd) tutorials
|
||||
teach you to calculate). The carbon cost then depends entirely on how that electricity
|
||||
is generated.
|
||||
|
||||
---
|
||||
|
||||
## 3. Compare Two Regions
|
||||
|
||||
The `SustainabilitySolver` factors in Power Usage Effectiveness (PUE) and regional
|
||||
carbon intensity. The following comparison uses the cleanest and dirtiest grids
|
||||
in the registry.
|
||||
|
||||
```{python}
|
||||
solver = SustainabilitySolver()
|
||||
|
||||
# Model training for 30 days in Quebec (Hydro-powered)
|
||||
res_quebec = solver.solve(
|
||||
fleet=fleet,
|
||||
duration_days=30,
|
||||
datacenter=mlsysim.Infra.Grids.Quebec
|
||||
)
|
||||
|
||||
# Compare with training in a coal-heavy region (Poland)
|
||||
res_poland = solver.solve(
|
||||
fleet=fleet,
|
||||
duration_days=30,
|
||||
datacenter=mlsysim.Infra.Grids.Poland
|
||||
)
|
||||
|
||||
print(f"Region: {res_quebec['region_name']}")
|
||||
print(f"Carbon Footprint: {res_quebec['carbon_footprint_kg']:.1f} kg CO2e")
|
||||
print("-" * 40)
|
||||
print(f"Region: {res_poland['region_name']}")
|
||||
print(f"Carbon Footprint: {res_poland['carbon_footprint_kg']:.1f} kg CO2e")
|
||||
```
|
||||
|
||||
::: {.callout-important}
|
||||
## The ~41x factor
|
||||
The same model, the same hardware, the same training duration — but the carbon
|
||||
footprint differs by roughly **41x** depending on the electricity grid. Location
|
||||
is the single largest lever for sustainable ML.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## 4. All-Region Comparison
|
||||
|
||||
The following sweep covers all four grid regions in the Infrastructure Zoo,
|
||||
comparing energy, carbon, and water usage.
|
||||
|
||||
```{python}
|
||||
grids = [
|
||||
mlsysim.Infra.Grids.Quebec,
|
||||
mlsysim.Infra.Grids.Norway,
|
||||
mlsysim.Infra.Grids.US_Avg,
|
||||
mlsysim.Infra.Grids.Poland,
|
||||
]
|
||||
|
||||
print(f"{'Region':<20} {'Energy (MWh)':>14} {'Carbon (t CO2e)':>16} {'Water (kL)':>12} {'PUE':>6}")
|
||||
print("-" * 72)
|
||||
|
||||
for grid in grids:
|
||||
r = solver.solve(fleet=fleet, duration_days=30, datacenter=grid)
|
||||
energy_mwh = r['total_energy_kwh'].magnitude / 1000
|
||||
carbon_t = r['carbon_footprint_kg'] / 1000
|
||||
water_kl = r['water_usage_liters'] / 1000
|
||||
print(f"{r['region_name']:<20} {energy_mwh:>12,.1f} {carbon_t:>14,.1f} {water_kl:>10,.1f} {r['pue']:>5.2f}")
|
||||
```
|
||||
|
||||
::: {.callout-note}
|
||||
## Water matters too
|
||||
Datacenters use water for evaporative cooling. The Water Usage Effectiveness (WUE)
|
||||
varies by cooling technology: liquid-cooled facilities use far less water than
|
||||
evaporative-cooled ones.
|
||||
:::
|
||||
|
||||
Carbon intensity varies by region, but it is not the only multiplier. The datacenter
|
||||
itself adds overhead through cooling and facility power, captured by the PUE metric.
|
||||
|
||||
---
|
||||
|
||||
## 5. The PUE Multiplier
|
||||
|
||||
PUE determines how much energy is "wasted" on cooling and facility overhead.
|
||||
Compare a modern liquid-cooled facility (PUE 1.1) against a legacy air-cooled
|
||||
one (PUE 1.6), both in the same grid region.
|
||||
|
||||
```{python}
|
||||
# Both in US Average grid, but different PUE
|
||||
res_modern = solver.solve(fleet=fleet, duration_days=30, datacenter=mlsysim.Infra.Grids.US_Avg)
|
||||
|
||||
# The US_Avg grid uses PUE from its profile
|
||||
print(f"US Average grid:")
|
||||
print(f" PUE: {res_modern['pue']:.2f}")
|
||||
print(f" Energy: {res_modern['total_energy_kwh'].magnitude/1000:,.1f} MWh")
|
||||
print(f" Carbon: {res_modern['carbon_footprint_kg']/1000:,.1f} tonnes CO2e")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Your Turn
|
||||
|
||||
::: {.callout-caution}
|
||||
## Exercises
|
||||
|
||||
**Exercise 1: Duration vs. location.**
|
||||
Predict: does training for 30 days in Quebec produce more or less carbon than training for 10 days in Poland? Write your prediction, then run both configurations with the `SustainabilitySolver`. Were you right? What does this tell you about the relative importance of training duration vs. grid selection?
|
||||
|
||||
**Exercise 2: Why is the solver model-agnostic?**
|
||||
Try running `solver.solve(fleet=fleet, duration_days=30, datacenter=mlsysim.Infra.Grids.Quebec)` for different fleet sizes. Notice that the `SustainabilitySolver` does not take a `model` parameter. Why? What assumption is the solver making about GPU utilization during training? When would this assumption break down?
|
||||
|
||||
**Exercise 3: PUE sensitivity.**
|
||||
Sweep PUE from 1.0 to 2.0. You can create custom grid profiles: `from mlsysim.infra.types import GridProfile` and then `GridProfile(name="Custom", carbon_intensity_g_kwh=390, pue=1.3, wue=1.8, primary_source="mixed")`. At what PUE value does the facility overhead exceed the IT energy itself? (Hint: PUE = total energy / IT energy, so overhead > IT energy when PUE > 2.0.)
|
||||
|
||||
**Self-check:** If you train for 30 days in Quebec (20 gCO2/kWh) vs. 15 days in Poland (820 gCO2/kWh), which produces more total carbon? Show the calculation.
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
## What You Learned
|
||||
|
||||
- **Carbon intensity is the biggest lever**: A large difference between hydro (Quebec)
|
||||
and coal (Poland) grids for identical workloads
|
||||
- **PUE amplifies everything**: A facility with PUE 1.6 uses 45% more energy than one
|
||||
with PUE 1.1
|
||||
- **Water usage varies by cooling technology**: Liquid cooling uses far less water
|
||||
than evaporative cooling
|
||||
- **The SustainabilitySolver** chains energy, PUE, and carbon intensity into a single
|
||||
analytical model
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- **[LLM Serving Lab](llm_serving.qmd)** — model the two phases of LLM inference and discover the KV-cache memory wall
|
||||
- **[Distributed Training](distributed.qmd)** — scale to hundreds of GPUs and analyze where efficiency is lost
|
||||
- **[Infrastructure Zoo](../zoo/infra.qmd)** — browse all regional grid profiles and datacenter configurations
|
||||
- **[Solver Guide](../solver-guide.qmd)** — learn how to chain the SustainabilitySolver with other solvers
|
||||
- **[Math Foundations](../math.qmd)** — see the equations behind energy and carbon calculations
|
||||
Reference in New Issue
Block a user