docs: clean up landing page and centralize math foundations

- Elevate 5-Layer Progressive Lowering mental model to architecture.qmd

- Clean up landing page copy to be a punchy one-liner

- Re-render architecture composition diagram as SVG for reliability

- Move math derivations out of tutorials and into math.qmd with citations

- Add DGX Spark to Silicon Zoo
This commit is contained in:
Vijay Janapa Reddi
2026-03-07 18:37:06 -05:00
parent a78f1bd8b0
commit aed43c5b81
36 changed files with 1247 additions and 311 deletions

View File

@@ -2,7 +2,6 @@
title: "Distributed Training: 3D Parallelism and Scaling Efficiency"
subtitle: "Discover why 1024 GPUs rarely deliver 1024× speedup — and how to minimize the gap."
---
::: {.callout-note}
## Background: Why distributed training?
@@ -139,11 +138,11 @@ result_dp = solver.solve(
)
node_perf = result_dp["node_performance"]
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):.1f} ms/step")
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):.2f} ms")
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):.2f} ms")
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):~.1f}/step")
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):~.2f}")
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):~.2f}")
print(f"")
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):.1f} ms")
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):~.1f}")
print(f"Scaling efficiency: {result_dp['scaling_efficiency']:.1%}")
print(f"Effective throughput: {result_dp['effective_throughput'].magnitude:.0f} samples/s")
print(f"Parallelism: DP={result_dp['parallelism']['dp']} TP={result_dp['parallelism']['tp']} PP={result_dp['parallelism']['pp']}")
@@ -166,12 +165,12 @@ network bandwidth.
## 4. Ring All-Reduce: The Network Tax
The `DP all-reduce overhead` comes from the **ring all-reduce algorithm**, which is the
standard method for gradient synchronization. Its time depends on:
standard method for gradient synchronization.
$$t_{\text{allreduce}} = 2 \times \frac{M \times (N-1)}{N \times B_{\text{eff}}}$$
Where $M$ is the message size (model gradient = 2× weights in fp16), $N$ is the number
of data-parallel replicas, and $B_{\text{eff}}$ is the effective inter-node bandwidth.
::: {.callout-note}
## 🧮 See the Math
For the full equation deriving All-Reduce overhead from model size, node count, and fabric bandwidth, see the [Mathematical Foundations: Ring All-Reduce](../math.qmd#ring-all-reduce-data-parallelism).
:::
The following sweep shows how fabric bandwidth affects overhead:
@@ -228,9 +227,10 @@ The downside: a **pipeline bubble**. The first microbatch must flow through all
the last stage can start processing the second microbatch. During that startup phase, most
GPUs are idle.
$$\text{Bubble fraction} = \frac{P - 1}{P - 1 + M}$$
Where $P$ is the pipeline depth (number of stages) and $M$ is the number of microbatches.
::: {.callout-note}
## 🧮 See the Math
For the full equation governing pipeline bubbles and interleaved 1F1B schedules, see the [Mathematical Foundations: Pipeline Parallelism Bubble](../math.qmd#pipeline-parallelism-bubble).
:::
```{python}
print(f"{'PP stages':>10} {'Microbatches':>13} {'Bubble %':>9} {'Comm (ms)':>10} {'Efficiency':>11}")