docs: clean up landing page and centralize math foundations

- Elevate 5-Layer Progressive Lowering mental model to architecture.qmd

- Clean up landing page copy to be a punchy one-liner

- Re-render architecture composition diagram as SVG for reliability

- Move math derivations out of tutorials and into math.qmd with citations

- Add DGX Spark to Silicon Zoo
This commit is contained in:
Vijay Janapa Reddi
2026-03-07 18:37:06 -05:00
parent a78f1bd8b0
commit aed43c5b81
36 changed files with 1247 additions and 311 deletions

View File

@@ -2,7 +2,6 @@
title: "Distributed Training: 3D Parallelism and Scaling Efficiency"
subtitle: "Discover why 1024 GPUs rarely deliver 1024× speedup — and how to minimize the gap."
---
::: {.callout-note}
## Background: Why distributed training?
@@ -139,11 +138,11 @@ result_dp = solver.solve(
)
node_perf = result_dp["node_performance"]
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):.1f} ms/step")
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):.2f} ms")
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):.2f} ms")
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):~.1f}/step")
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):~.2f}")
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):~.2f}")
print(f"")
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):.1f} ms")
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):~.1f}")
print(f"Scaling efficiency: {result_dp['scaling_efficiency']:.1%}")
print(f"Effective throughput: {result_dp['effective_throughput'].magnitude:.0f} samples/s")
print(f"Parallelism: DP={result_dp['parallelism']['dp']} TP={result_dp['parallelism']['tp']} PP={result_dp['parallelism']['pp']}")
@@ -166,12 +165,12 @@ network bandwidth.
## 4. Ring All-Reduce: The Network Tax
The `DP all-reduce overhead` comes from the **ring all-reduce algorithm**, which is the
standard method for gradient synchronization. Its time depends on:
standard method for gradient synchronization.
$$t_{\text{allreduce}} = 2 \times \frac{M \times (N-1)}{N \times B_{\text{eff}}}$$
Where $M$ is the message size (model gradient = 2× weights in fp16), $N$ is the number
of data-parallel replicas, and $B_{\text{eff}}$ is the effective inter-node bandwidth.
::: {.callout-note}
## 🧮 See the Math
For the full equation deriving All-Reduce overhead from model size, node count, and fabric bandwidth, see the [Mathematical Foundations: Ring All-Reduce](../math.qmd#ring-all-reduce-data-parallelism).
:::
The following sweep shows how fabric bandwidth affects overhead:
@@ -228,9 +227,10 @@ The downside: a **pipeline bubble**. The first microbatch must flow through all
the last stage can start processing the second microbatch. During that startup phase, most
GPUs are idle.
$$\text{Bubble fraction} = \frac{P - 1}{P - 1 + M}$$
Where $P$ is the pipeline depth (number of stages) and $M$ is the number of microbatches.
::: {.callout-note}
## 🧮 See the Math
For the full equation governing pipeline bubbles and interleaved 1F1B schedules, see the [Mathematical Foundations: Pipeline Parallelism Bubble](../math.qmd#pipeline-parallelism-bubble).
:::
```{python}
print(f"{'PP stages':>10} {'Microbatches':>13} {'Bubble %':>9} {'Comm (ms)':>10} {'Efficiency':>11}")

View File

@@ -2,7 +2,6 @@
title: "Hello World: Single-Node Roofline"
subtitle: "Predict model performance on hardware before writing a single CUDA kernel."
---
::: {.callout-note}
## Prerequisites
Complete the [Getting Started](../getting-started.qmd) guide before this tutorial. It introduces the `Engine.solve` API and the MLSys Zoo.
@@ -98,7 +97,7 @@ profile = Engine.solve(
)
print(f"Bottleneck: {profile.bottleneck}")
print(f"Latency: {profile.latency.to('ms'):.3f} ms per inference")
print(f"Latency: {profile.latency.to('ms'):~.3f} per inference")
print(f"Throughput: {profile.throughput:.0f} images/sec")
```
@@ -129,7 +128,7 @@ for batch in [1, 4, 16, 32, 64, 128, 256]:
print(
f"{batch:>6} {p.bottleneck:<16} "
f"{p.throughput:>10.0f}/s "
f"{p.latency.to('ms'):>8.2f} ms"
f"{p.latency.to('ms').magnitude:>8.2f} ms"
)
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 51 KiB

After

Width:  |  Height:  |  Size: 90 KiB

View File

@@ -1,11 +1,7 @@
---
title: "Tutorials"
subtitle: "Step-by-step guides for modeling ML Systems."
format:
html:
toc: false
---
These tutorials are designed to build intuition for ML systems using the `mlsysim` framework.
They map directly to chapters in the *Machine Learning Systems* textbook—start at the beginning
or jump to any topic.

View File

@@ -2,7 +2,6 @@
title: "LLM Serving Lab: TTFT, ITL, and the Memory Wall"
subtitle: "Model the two physical regimes of LLM inference before deploying a single server."
---
::: {.callout-note}
## Background: What is an LLM and why is serving different?
@@ -139,12 +138,12 @@ print(f"Memory util: {result['memory_utilization']:.1%}")
## 3. The KV-Cache Memory Wall
The KV-cache stores the Key and Value matrices from every attention layer for every token
in the active context. Its size grows as:
in the active context. This statefulness is what makes LLM decoding uniquely memory-bound.
$$\text{KV-Cache} = 2 \times L \times H_{kv} \times d_{head} \times S \times B \times \text{bpp}$$
Where $L$ = layers, $H_{kv}$ = KV heads, $S$ = sequence length, $B$ = batch size,
$\text{bpp}$ = bytes per parameter.
::: {.callout-note}
## 🧮 See the Math
To see the exact formula for how KV-Cache size scales with sequence length, batch size, and network architecture, see the [Mathematical Foundations: KV-Cache Size](../math.qmd#kv-cache-size).
:::
This means doubling `batch_size` doubles the KV-cache. At some point, you hit the
**memory wall** — the combined model + KV-cache exceeds the accelerator's HBM capacity.

View File

@@ -2,7 +2,6 @@
title: "Sustainability Lab: Modeling Carbon Footprint"
subtitle: "Same model, same hardware — 41x difference in carbon footprint."
---
::: {.callout-note}
## Prerequisites
This tutorial can be completed independently, but completing the [Hello World tutorial](hello_world.qmd) first provides useful context on how hardware performance relates to energy consumption.