docs: clean up landing page and centralize math foundations

- Elevate 5-Layer Progressive Lowering mental model to architecture.qmd - Clean up landing page copy to be a punchy one-liner - Re-render architecture composition diagram as SVG for reliability - Move math derivations out of tutorials and into math.qmd with citations - Add DGX Spark to Silicon Zoo
2026-05-03 08:08:51 -05:00 · 2026-03-07 18:37:06 -05:00
parent a78f1bd8b0
commit aed43c5b81
36 changed files with 1247 additions and 311 deletions
--- a/mlsysim/docs/tutorials/distributed.qmd
+++ b/mlsysim/docs/tutorials/distributed.qmd
@@ -2,7 +2,6 @@
 title: "Distributed Training: 3D Parallelism and Scaling Efficiency"
 subtitle: "Discover why 1024 GPUs rarely deliver 1024× speedup — and how to minimize the gap."
 ---
-
 ::: {.callout-note}
 ## Background: Why distributed training?

@@ -139,11 +138,11 @@ result_dp = solver.solve(
 )

 node_perf = result_dp["node_performance"]
-print(f"Single-GPU compute time:     {node_perf.latency.to('ms'):.1f} ms/step")
-print(f"DP all-reduce overhead:      {result_dp['dp_communication_latency'].to('ms'):.2f} ms")
-print(f"Pipeline bubble:             {result_dp['pipeline_bubble_latency'].to('ms'):.2f} ms")
+print(f"Single-GPU compute time:     {node_perf.latency.to('ms'):~.1f}/step")
+print(f"DP all-reduce overhead:      {result_dp['dp_communication_latency'].to('ms'):~.2f}")
+print(f"Pipeline bubble:             {result_dp['pipeline_bubble_latency'].to('ms'):~.2f}")
 print(f"")
-print(f"Total step latency:          {result_dp['step_latency_total'].to('ms'):.1f} ms")
+print(f"Total step latency:          {result_dp['step_latency_total'].to('ms'):~.1f}")
 print(f"Scaling efficiency:          {result_dp['scaling_efficiency']:.1%}")
 print(f"Effective throughput:        {result_dp['effective_throughput'].magnitude:.0f} samples/s")
 print(f"Parallelism:                 DP={result_dp['parallelism']['dp']}  TP={result_dp['parallelism']['tp']}  PP={result_dp['parallelism']['pp']}")
@@ -166,12 +165,12 @@ network bandwidth.
 ## 4. Ring All-Reduce: The Network Tax

 The `DP all-reduce overhead` comes from the **ring all-reduce algorithm**, which is the
-standard method for gradient synchronization. Its time depends on:
+standard method for gradient synchronization.

-$$t_{\text{allreduce}} = 2 \times \frac{M \times (N-1)}{N \times B_{\text{eff}}}$$
-
-Where $M$ is the message size (model gradient = 2× weights in fp16), $N$ is the number
-of data-parallel replicas, and $B_{\text{eff}}$ is the effective inter-node bandwidth.
+::: {.callout-note}
+## 🧮 See the Math
+For the full equation deriving All-Reduce overhead from model size, node count, and fabric bandwidth, see the [Mathematical Foundations: Ring All-Reduce](../math.qmd#ring-all-reduce-data-parallelism).
+:::

 The following sweep shows how fabric bandwidth affects overhead:

@@ -228,9 +227,10 @@ The downside: a **pipeline bubble**. The first microbatch must flow through all
 the last stage can start processing the second microbatch. During that startup phase, most
 GPUs are idle.

-$$\text{Bubble fraction} = \frac{P - 1}{P - 1 + M}$$
-
-Where $P$ is the pipeline depth (number of stages) and $M$ is the number of microbatches.
+::: {.callout-note}
+## 🧮 See the Math
+For the full equation governing pipeline bubbles and interleaved 1F1B schedules, see the [Mathematical Foundations: Pipeline Parallelism Bubble](../math.qmd#pipeline-parallelism-bubble).
+:::

 ```{python}
 print(f"{'PP stages':>10}  {'Microbatches':>13}  {'Bubble %':>9}  {'Comm (ms)':>10}  {'Efficiency':>11}")