mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-03 08:08:51 -05:00
docs: clean up landing page and centralize math foundations
- Elevate 5-Layer Progressive Lowering mental model to architecture.qmd - Clean up landing page copy to be a punchy one-liner - Re-render architecture composition diagram as SVG for reliability - Move math derivations out of tutorials and into math.qmd with citations - Add DGX Spark to Silicon Zoo
This commit is contained in:
@@ -2,7 +2,6 @@
|
||||
title: "Distributed Training: 3D Parallelism and Scaling Efficiency"
|
||||
subtitle: "Discover why 1024 GPUs rarely deliver 1024× speedup — and how to minimize the gap."
|
||||
---
|
||||
|
||||
::: {.callout-note}
|
||||
## Background: Why distributed training?
|
||||
|
||||
@@ -139,11 +138,11 @@ result_dp = solver.solve(
|
||||
)
|
||||
|
||||
node_perf = result_dp["node_performance"]
|
||||
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):.1f} ms/step")
|
||||
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):.2f} ms")
|
||||
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):.2f} ms")
|
||||
print(f"Single-GPU compute time: {node_perf.latency.to('ms'):~.1f}/step")
|
||||
print(f"DP all-reduce overhead: {result_dp['dp_communication_latency'].to('ms'):~.2f}")
|
||||
print(f"Pipeline bubble: {result_dp['pipeline_bubble_latency'].to('ms'):~.2f}")
|
||||
print(f"")
|
||||
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):.1f} ms")
|
||||
print(f"Total step latency: {result_dp['step_latency_total'].to('ms'):~.1f}")
|
||||
print(f"Scaling efficiency: {result_dp['scaling_efficiency']:.1%}")
|
||||
print(f"Effective throughput: {result_dp['effective_throughput'].magnitude:.0f} samples/s")
|
||||
print(f"Parallelism: DP={result_dp['parallelism']['dp']} TP={result_dp['parallelism']['tp']} PP={result_dp['parallelism']['pp']}")
|
||||
@@ -166,12 +165,12 @@ network bandwidth.
|
||||
## 4. Ring All-Reduce: The Network Tax
|
||||
|
||||
The `DP all-reduce overhead` comes from the **ring all-reduce algorithm**, which is the
|
||||
standard method for gradient synchronization. Its time depends on:
|
||||
standard method for gradient synchronization.
|
||||
|
||||
$$t_{\text{allreduce}} = 2 \times \frac{M \times (N-1)}{N \times B_{\text{eff}}}$$
|
||||
|
||||
Where $M$ is the message size (model gradient = 2× weights in fp16), $N$ is the number
|
||||
of data-parallel replicas, and $B_{\text{eff}}$ is the effective inter-node bandwidth.
|
||||
::: {.callout-note}
|
||||
## 🧮 See the Math
|
||||
For the full equation deriving All-Reduce overhead from model size, node count, and fabric bandwidth, see the [Mathematical Foundations: Ring All-Reduce](../math.qmd#ring-all-reduce-data-parallelism).
|
||||
:::
|
||||
|
||||
The following sweep shows how fabric bandwidth affects overhead:
|
||||
|
||||
@@ -228,9 +227,10 @@ The downside: a **pipeline bubble**. The first microbatch must flow through all
|
||||
the last stage can start processing the second microbatch. During that startup phase, most
|
||||
GPUs are idle.
|
||||
|
||||
$$\text{Bubble fraction} = \frac{P - 1}{P - 1 + M}$$
|
||||
|
||||
Where $P$ is the pipeline depth (number of stages) and $M$ is the number of microbatches.
|
||||
::: {.callout-note}
|
||||
## 🧮 See the Math
|
||||
For the full equation governing pipeline bubbles and interleaved 1F1B schedules, see the [Mathematical Foundations: Pipeline Parallelism Bubble](../math.qmd#pipeline-parallelism-bubble).
|
||||
:::
|
||||
|
||||
```{python}
|
||||
print(f"{'PP stages':>10} {'Microbatches':>13} {'Bubble %':>9} {'Comm (ms)':>10} {'Efficiency':>11}")
|
||||
|
||||
Reference in New Issue
Block a user