mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
docs: clean up landing page and centralize math foundations
- Elevate 5-Layer Progressive Lowering mental model to architecture.qmd - Clean up landing page copy to be a punchy one-liner - Re-render architecture composition diagram as SVG for reliability - Move math derivations out of tutorials and into math.qmd with citations - Add DGX Spark to Silicon Zoo
This commit is contained in:
@@ -2,7 +2,6 @@
|
||||
title: "Distributed Training: 3D Parallelism and Scaling Efficiency"
|
||||
subtitle: "Discover why 1024 GPUs rarely deliver 1024× speedup — and how to minimize the gap."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Background: Why distributed training?
|
||||
|
||||
@@ -139,11 +138,11 @@ result_dp = solver.solve(
|
||||
)
|
||||
|
||||
node_perf = result_dp["node_performance"]
|
||||
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):.1f} ms/step")
|
||||
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):.2f} ms")
|
||||
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):.2f} ms")
|
||||
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):~.1f}/step")
|
||||
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):~.2f}")
|
||||
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):~.2f}")
|
||||
print(f"")
|
||||
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):.1f} ms")
|
||||
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):~.1f}")
|
||||
print(f"Scaling efficiency: {result_dp['scaling_efficiency']:.1%}")
|
||||
print(f"Effective throughput: {result_dp['effective_throughput'].magnitude:.0f} samples/s")
|
||||
print(f"Parallelism: DP={result_dp['parallelism']['dp']} TP={result_dp['parallelism']['tp']} PP={result_dp['parallelism']['pp']}")
|
||||
@@ -166,12 +165,12 @@ network bandwidth.
|
||||
## 4. Ring All-Reduce: The Network Tax
|
||||
|
||||
The `DP all-reduce overhead` comes from the **ring all-reduce algorithm**, which is the
|
||||
standard method for gradient synchronization. Its time depends on:
|
||||
standard method for gradient synchronization.
|
||||
|
||||
$$t_{\text{allreduce}} = 2 \times \frac{M \times (N-1)}{N \times B_{\text{eff}}}$$
|
||||
|
||||
Where $M$ is the message size (model gradient = 2× weights in fp16), $N$ is the number
|
||||
of data-parallel replicas, and $B_{\text{eff}}$ is the effective inter-node bandwidth.
|
||||
::: {.callout-note}
|
||||
## 🧮 See the Math
|
||||
For the full equation deriving All-Reduce overhead from model size, node count, and fabric bandwidth, see the [Mathematical Foundations: Ring All-Reduce](../math.qmd#ring-all-reduce-data-parallelism).
|
||||
:::
|
||||
|
||||
The following sweep shows how fabric bandwidth affects overhead:
|
||||
|
||||
@@ -228,9 +227,10 @@ The downside: a **pipeline bubble**. The first microbatch must flow through all
|
||||
the last stage can start processing the second microbatch. During that startup phase, most
|
||||
GPUs are idle.
|
||||
|
||||
$$\text{Bubble fraction} = \frac{P - 1}{P - 1 + M}$$
|
||||
|
||||
Where $P$ is the pipeline depth (number of stages) and $M$ is the number of microbatches.
|
||||
::: {.callout-note}
|
||||
## 🧮 See the Math
|
||||
For the full equation governing pipeline bubbles and interleaved 1F1B schedules, see the [Mathematical Foundations: Pipeline Parallelism Bubble](../math.qmd#pipeline-parallelism-bubble).
|
||||
:::
|
||||
|
||||
```{python}
|
||||
print(f"{'PP stages':>10} {'Microbatches':>13} {'Bubble %':>9} {'Comm (ms)':>10} {'Efficiency':>11}")
|
||||
|
||||
@@ -2,7 +2,6 @@
|
||||
title: "Hello World: Single-Node Roofline"
|
||||
subtitle: "Predict model performance on hardware before writing a single CUDA kernel."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Prerequisites
|
||||
Complete the [Getting Started](../getting-started.qmd) guide before this tutorial. It introduces the `Engine.solve` API and the MLSys Zoo.
|
||||
@@ -98,7 +97,7 @@ profile = Engine.solve(
|
||||
)
|
||||
|
||||
print(f"Bottleneck: {profile.bottleneck}")
|
||||
print(f"Latency: {profile.latency.to('ms'):.3f} ms per inference")
|
||||
print(f"Latency: {profile.latency.to('ms'):~.3f} per inference")
|
||||
print(f"Throughput: {profile.throughput:.0f} images/sec")
|
||||
```
|
||||
|
||||
@@ -129,7 +128,7 @@ for batch in [1, 4, 16, 32, 64, 128, 256]:
|
||||
print(
|
||||
f"{batch:>6} {p.bottleneck:<16} "
|
||||
f"{p.throughput:>10.0f}/s "
|
||||
f"{p.latency.to('ms'):>8.2f} ms"
|
||||
f"{p.latency.to('ms').magnitude:>8.2f} ms"
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 51 KiB After Width: | Height: | Size: 90 KiB |
@@ -1,11 +1,7 @@
|
||||
---
|
||||
title: "Tutorials"
|
||||
subtitle: "Step-by-step guides for modeling ML Systems."
|
||||
format:
|
||||
html:
|
||||
toc: false
|
||||
---
|
||||
|
||||
These tutorials are designed to build intuition for ML systems using the `mlsysim` framework.
|
||||
They map directly to chapters in the *Machine Learning Systems* textbook—start at the beginning
|
||||
or jump to any topic.
|
||||
|
||||
@@ -2,7 +2,6 @@
|
||||
title: "LLM Serving Lab: TTFT, ITL, and the Memory Wall"
|
||||
subtitle: "Model the two physical regimes of LLM inference before deploying a single server."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Background: What is an LLM and why is serving different?
|
||||
|
||||
@@ -139,12 +138,12 @@ print(f"Memory util: {result['memory_utilization']:.1%}")
|
||||
## 3. The KV-Cache Memory Wall
|
||||
|
||||
The KV-cache stores the Key and Value matrices from every attention layer for every token
|
||||
in the active context. Its size grows as:
|
||||
in the active context. This statefulness is what makes LLM decoding uniquely memory-bound.
|
||||
|
||||
$$\text{KV-Cache} = 2 \times L \times H_{kv} \times d_{head} \times S \times B \times \text{bpp}$$
|
||||
|
||||
Where $L$ = layers, $H_{kv}$ = KV heads, $S$ = sequence length, $B$ = batch size,
|
||||
$\text{bpp}$ = bytes per parameter.
|
||||
::: {.callout-note}
|
||||
## 🧮 See the Math
|
||||
To see the exact formula for how KV-Cache size scales with sequence length, batch size, and network architecture, see the [Mathematical Foundations: KV-Cache Size](../math.qmd#kv-cache-size).
|
||||
:::
|
||||
|
||||
This means doubling `batch_size` doubles the KV-cache. At some point, you hit the
|
||||
**memory wall** — the combined model + KV-cache exceeds the accelerator's HBM capacity.
|
||||
|
||||
@@ -2,7 +2,6 @@
|
||||
title: "Sustainability Lab: Modeling Carbon Footprint"
|
||||
subtitle: "Same model, same hardware — 41x difference in carbon footprint."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Prerequisites
|
||||
This tutorial can be completed independently, but completing the [Hello World tutorial](hello_world.qmd) first provides useful context on how hardware performance relates to energy consumption.
|
||||
|
||||
Reference in New Issue
Block a user