import marimo
__generated_with = "0.19.6"
app = marimo.App(width="full")
# ─────────────────────────────────────────────────────────────────────────────
# LAB 05: THE PARALLELISM PARADOX
#
# Chapter: Distributed Training Systems (@sec-distributed-training-systems)
# Core Invariant: The Parallelism Paradox — adding more GPUs to data parallel
# training increases communication overhead, which can decrease
# MFU below single-GPU levels for large models. 3D parallelism
# (Tensor + Pipeline + Data) is required for models that don't
# fit on a single GPU, but each dimension adds overhead.
#
# 2-Act Structure (35-40 minutes):
# Act I — The Data Parallel Wall (12-15 min)
# A 7B model trained with DP across 8→64→512 GPUs shows MFU
# collapsing from 52% to 19%. The central question: why does MFU
# fall as we add more GPUs? Students must confront that communication
# time grows relative to compute time as cluster size grows.
#
# Act II — 3D Parallelism Design Challenge (20-25 min)
# Design the TP×PP×DP configuration for GPT-3 175B on 1024 H100s.
# The failure state: per-GPU memory exceeds 80 GB (model doesn't fit)
# and a bandwidth penalty warning when TP crosses node boundaries.
#
# Deployment Contexts:
# DP: Data Parallel — replicate model, sync gradients via AllReduce
# 3D Parallel: TP×PP×DP — within-node TP (NVLink), cross-node PP (IB), DP
#
# Hardware Constants:
# H100_TFLOPS_FP16 = 1979 # TFLOPS, H100 SXM5 with sparsity; source: NVIDIA spec
# H100_BW_GBS = 3350 # GB/s HBM3e; source: NVIDIA H100 spec sheet
# H100_RAM_GB = 80 # GB HBM3e; source: NVIDIA H100 spec sheet
# NVLINK4_BW_GBS = 900 # GB/s NVLink 4; source: NVIDIA DGX H100 spec
# IB_HDR200_BW_GBS = 400 # GB/s InfiniBand HDR200; source: Mellanox spec
# GPUS_PER_NODE = 8 # Standard DGX H100 node size
#
# Design Ledger: saves chapter="v2_05" with DP vs 3D context, parallelism
# degrees, MFU achieved, prediction accuracy, failure states.
# ─────────────────────────────────────────────────────────────────────────────
# ─── CELL 0: SETUP (hide_code=False — leave visible for instructor inspection) ─
@app.cell
def _():
import marimo as mo
import sys
import math
from pathlib import Path
import plotly.graph_objects as go
import numpy as np
from plotly.subplots import make_subplots
_root = Path(__file__).resolve().parents[2]
if str(_root) not in sys.path:
sys.path.insert(0, str(_root))
from labs.core.state import DesignLedger
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme
ledger = DesignLedger()
return COLORS, LAB_CSS, apply_plotly_theme, go, ledger, math, mo, np, make_subplots
# ─── CELL 1: HEADER ────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, LAB_CSS, mo):
_c_dp = COLORS["BlueLine"]
_c_3d = COLORS["Cloud"]
_c_s0 = COLORS["Surface0"]
_c_s1 = COLORS["Surface1"]
_header = mo.Html(f"""
{LAB_CSS}
Vol 2 · Lab 05 · Distributed Training Systems
The Parallelism Paradox
Adding GPUs to a data-parallel job can reduce Model FLOPs Utilization
below single-GPU levels. This lab forces you to confront the
communication-computation ratio and design the 3D parallelism
configuration that keeps 1024 H100s productive.
Parallelism Paradox
AllReduce Bandwidth Model
3D Parallel: TP × PP × DP
35–40 minutes · 2 Acts
Context A — Data Parallel
— 7B model · 8–512 GPUs · AllReduce via NVLink / IB
Context B — 3D Parallel
— 175B model · 1024 H100s · TP×PP×DP design
""")
_header
return
# ─── CELL 2: RECOMMENDED READING ───────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.callout(mo.md("""
**Recommended Reading** — Complete the following before this lab:
- **@sec-distributed-training-systems-systems-multimachine-scaling-fundamentals-ff96** — The Iron Law of Scale: `T_step(N) = T_compute/N + T_comm(N) - T_overlap` and the Communication-Computation Ratio
- **@sec-distributed-training-systems** — Why distribution is necessary: memory exhaustion, training duration, and dataset scale thresholds
- The Data Parallelism section — AllReduce gradient synchronization, Ring-AllReduce bandwidth formula, gradient bucketing
- The 3D Parallelism section — Tensor Parallelism (within-node), Pipeline Parallelism (across nodes), pipeline bubble fraction `B = (PP-1)/(PP * m)`
"""), kind="info")
return
# ─── CELL 3: CONTEXT TOGGLE ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, mo):
_c_muted = COLORS["TextMuted"]
context_toggle = mo.ui.radio(
options={
"Data Parallel (DP)": "dp",
"3D Parallel (TP+PP+DP)": "3d",
},
value="Data Parallel (DP)",
label="Deployment context for this session:",
inline=True,
)
mo.hstack([
mo.Html(f"""
Active Context:
"""),
context_toggle,
], justify="start", gap=0)
return (context_toggle,)
# ═══════════════════════════════════════════════════════════════════════════════
# ACT I — THE DATA PARALLEL WALL
# ═══════════════════════════════════════════════════════════════════════════════
# ─── ACT I: SECTION HEADER ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
---
## Act I — The Data Parallel Wall
*Calibration · 12–15 minutes*
""")
return
# ─── ACT I: STAKEHOLDER MESSAGE ────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, mo):
_color = COLORS["BlueLine"]
_bg = COLORS["BlueL"]
mo.Html(f"""
Incoming Message · Training Infrastructure Lead
"We are training a 7B parameter model using data parallelism. I measured MFU at
8 GPUs = 52%. At 64 GPUs it dropped to 38%. At 512 GPUs it's 19%. The model
hasn't changed. The batch size per GPU hasn't changed. We just added more GPUs
and it got worse. We're wasting $40,000 per day in idle compute. Can you tell
me exactly why MFU falls as we scale data parallelism?"
""")
return
# ─── ACT I: CONCEPT FRAMING ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
The training lead's observation is not a software bug. It is the **Parallelism Paradox**:
the physical law that governs every data-parallel training job.
Data parallel training replicates the model on every GPU, runs a forward and backward
pass on each device's local batch, then synchronizes gradients across all devices via
**AllReduce** before the optimizer step. The AllReduce is unavoidable — without it,
each replica would diverge. The critical question is how long AllReduce takes relative
to the compute step.
The **Communication-Computation Ratio** from @sec-distributed-training-systems determines
whether a cluster behaves as a supercomputer or as a collection of idling heaters:
- **Compute-Bound (Low Ratio)**: `T_compute >> T_comm`. GPUs spend most time on matrix
multiplications. This is the ideal state.
- **Communication-Bound (High Ratio)**: `T_comm ≈ T_compute`. GPUs spend significant
time waiting for gradients. This is the common state for LLMs at scale.
Before looking at any numbers, commit to a prediction about what causes MFU to fall.
""")
return
# ─── ACT I: PREDICTION LOCK ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("### Your Prediction")
return
@app.cell(hide_code=True)
def _(mo):
act1_pred = mo.ui.radio(
options={
"A) Software overhead in NCCL and collective libraries grows with cluster size": "A",
"B) AllReduce communication time grows with cluster size while compute time stays constant — the comm/compute ratio rises": "B",
"C) Larger clusters cause L2 cache pressure and HBM bandwidth saturation per GPU": "C",
"D) 512 GPUs exceeds optimal batch size — gradient quality degrades and more steps are needed": "D",
},
label="Why does Model FLOPs Utilization (MFU) fall as we scale a data-parallel job from 8 to 512 GPUs?",
)
act1_pred
return (act1_pred,)
@app.cell(hide_code=True)
def _(act1_pred, mo):
mo.stop(
act1_pred.value is None,
mo.callout(
mo.md("Select your prediction above to unlock the Act I instruments."),
kind="warn",
),
)
mo.md("")
return
# ─── ACT I: INSTRUMENT PANEL INTRO ─────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### Data Parallel Scaling Explorer
Adjust the parameters below to see how AllReduce communication time compares to
compute time — and how their ratio determines MFU at each scale point.
""")
return
# ─── ACT I: SLIDERS ────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
dp_model_b = mo.ui.slider(
start=1, stop=175, value=7, step=1,
label="Model size (B params)",
)
dp_gpus = mo.ui.slider(
start=8, stop=1024, value=8, step=8,
label="Number of GPUs (DP degree)",
)
dp_batch_per_gpu = mo.ui.slider(
start=8, stop=128, value=32, step=8,
label="Micro-batch size per GPU",
)
dp_interconnect = mo.ui.dropdown(
options={"NVLink 4 (within DGX node, 8 GPUs)": "nvlink", "InfiniBand HDR200 (cross-node)": "ib"},
value="NVLink 4 (within DGX node, 8 GPUs)",
label="Interconnect fabric",
)
mo.hstack([
mo.vstack([dp_model_b, dp_gpus]),
mo.vstack([dp_batch_per_gpu, dp_interconnect]),
], justify="center", gap=2)
return dp_batch_per_gpu, dp_gpus, dp_interconnect, dp_model_b
# ─── ACT I: PHYSICS ENGINE ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, apply_plotly_theme, dp_batch_per_gpu, dp_gpus, dp_interconnect, dp_model_b, go, math, mo, np):
# ── Hardware constants ───────────────────────────────────────────────────────
# Source: NVIDIA H100 SXM5 spec sheet, Mellanox InfiniBand HDR200 spec
H100_TFLOPS_FP16 = 1979.0 # TFLOPS H100 SXM5 with sparsity; NVIDIA spec
H100_RAM_GB = 80.0 # GB HBM3e; NVIDIA H100 spec sheet
NVLINK4_BW_GBS = 900.0 # GB/s NVLink 4 (per-direction); NVIDIA DGX H100 spec
IB_HDR200_BW_GBS = 400.0 # GB/s InfiniBand HDR200 (per-direction); Mellanox spec
GPUS_PER_NODE = 8 # DGX H100 node size; standard industry config
BYTES_PER_PARAM = 2 # FP16: 2 bytes per parameter
BYTES_PER_GRAD = 2 # FP16 gradients: 2 bytes per gradient
# ── Extract widget values ────────────────────────────────────────────────────
_params_b = dp_model_b.value # billions of params
_gpus = dp_gpus.value
_batch_gpu = dp_batch_per_gpu.value
_fabric = dp_interconnect.value
# ── Derived quantities ───────────────────────────────────────────────────────
_params = _params_b * 1e9 # total parameters
_grad_bytes = _params * BYTES_PER_GRAD # gradient tensor size in bytes
_grad_gb = _grad_bytes / 1e9 # gradient size in GB
# ── AllReduce bandwidth model ────────────────────────────────────────────────
# Ring-AllReduce transfers 2*(N-1)/N * data per device
# Source: @sec-distributed-training-systems AllReduce bandwidth analysis
# Effective bytes per GPU = 2*(N-1)/N * grad_gb
_allreduce_factor = 2.0 * (_gpus - 1) / _gpus
_allreduce_data_gb = _grad_gb * _allreduce_factor # GB transferred per GPU
# Interconnect selection and effective bandwidth
# Note: when >8 GPUs, traffic crosses node boundary via InfiniBand even if NVLink selected
_effective_bw = NVLINK4_BW_GBS if (_fabric == "nvlink" and _gpus <= GPUS_PER_NODE) else IB_HDR200_BW_GBS
_fabric_label = "NVLink 4" if _effective_bw == NVLINK4_BW_GBS else "InfiniBand HDR200"
_forced_ib = (_fabric == "nvlink" and _gpus > GPUS_PER_NODE)
_allreduce_time_s = _allreduce_data_gb / _effective_bw # seconds
# ── Compute step time ────────────────────────────────────────────────────────
# Forward + backward pass FLOPs ≈ 6 * params * batch_size
# Source: standard transformer FLOP count estimate (6N approximation)
_seq_len = 2048 # typical sequence length for a 7B-class model
_hidden_dim = 4096 # typical for 7B models
_flops_step = 6.0 * _params * _batch_gpu # forward + backward FLOPs
_mfu_ref = 0.52 # reference MFU at 8 GPUs (from stakeholder message)
# Compute time at reference MFU to anchor the physics
_compute_time_s = _flops_step / (H100_TFLOPS_FP16 * 1e12 * _mfu_ref)
# ── Effective MFU with AllReduce overhead ────────────────────────────────────
# MFU = T_compute / (T_compute + T_allreduce)
# When T_allreduce grows relative to T_compute, MFU falls
_total_time_s = _compute_time_s + _allreduce_time_s
_mfu_effective = (_compute_time_s / _total_time_s) * _mfu_ref # fractional
_mfu_pct = _mfu_effective * 100.0
# ── Comm/Compute ratio ───────────────────────────────────────────────────────
_cc_ratio = _allreduce_time_s / _compute_time_s
# ── Color coding ─────────────────────────────────────────────────────────────
_mfu_color = COLORS["GreenLine"] if _mfu_pct >= 45 else (COLORS["OrangeLine"] if _mfu_pct >= 25 else COLORS["RedLine"])
_cc_color = COLORS["GreenLine"] if _cc_ratio <= 0.3 else (COLORS["OrangeLine"] if _cc_ratio <= 0.8 else COLORS["RedLine"])
# ── Build MFU vs GPU count curve ─────────────────────────────────────────────
_gpu_range = [8, 16, 32, 64, 128, 256, 512, 1024]
_mfu_curve = []
for _g in _gpu_range:
_ar_factor_g = 2.0 * (_g - 1) / _g
_bw_g = NVLINK4_BW_GBS if _g <= GPUS_PER_NODE else IB_HDR200_BW_GBS
_ar_time_g = (_grad_gb * _ar_factor_g) / _bw_g
_total_g = _compute_time_s + _ar_time_g
_mfu_g = (_compute_time_s / _total_g) * _mfu_ref * 100.0
_mfu_curve.append(_mfu_g)
_fig = go.Figure()
_fig.add_trace(go.Scatter(
x=_gpu_range, y=_mfu_curve,
mode="lines+markers",
line=dict(color=COLORS["BlueLine"], width=2.5),
marker=dict(size=8, color=COLORS["BlueLine"]),
name="MFU (model)",
hovertemplate="%{x} GPUs
MFU: %{y:.1f}%",
))
# Mark the current selection
_fig.add_trace(go.Scatter(
x=[_gpus], y=[_mfu_pct],
mode="markers",
marker=dict(size=16, color=COLORS["RedLine"], symbol="diamond",
line=dict(color="white", width=2)),
name="Current config",
hovertemplate="Current: %{x} GPUs
MFU: %{y:.1f}%",
))
# Reference points from stakeholder message
_ref_x = [8, 64, 512]
_ref_y = [52, 38, 19]
_fig.add_trace(go.Scatter(
x=_ref_x, y=_ref_y,
mode="markers",
marker=dict(size=12, color=COLORS["OrangeLine"], symbol="x",
line=dict(color=COLORS["OrangeLine"], width=3)),
name="Measured (stakeholder)",
hovertemplate="Measured: %{x} GPUs
MFU: %{y:.0f}%",
))
_fig.add_hline(y=40.0, line=dict(color=COLORS["GreenLine"], width=1.5, dash="dash"),
annotation_text="40% — practical floor", annotation_position="bottom right")
_fig.update_layout(
height=320,
xaxis=dict(title="GPU Count (DP degree)", type="log",
tickvals=[8, 16, 32, 64, 128, 256, 512, 1024],
ticktext=["8", "16", "32", "64", "128", "256", "512", "1024"]),
yaxis=dict(title="Model FLOPs Utilization (%)", range=[0, 65]),
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
margin=dict(t=40, b=50, l=50, r=20),
)
apply_plotly_theme(_fig)
# ── Forced-IB warning ────────────────────────────────────────────────────────
_ib_warn = ""
if _forced_ib:
_ib_warn = f"""
Interconnect Upgrade Applied:
NVLink 4 operates within a single DGX node (8 GPUs). At {_gpus} GPUs, traffic crosses
node boundaries. Effective bandwidth is InfiniBand HDR200 ({IB_HDR200_BW_GBS:.0f} GB/s)
— {NVLINK4_BW_GBS/IB_HDR200_BW_GBS:.1f}× slower than NVLink 4.
"""
# ── Physics display ──────────────────────────────────────────────────────────
mo.vstack([
mo.Html(f"""
Physics — AllReduce Bandwidth Model
Gradient tensor = {_params_b}B params × {BYTES_PER_GRAD} bytes/param = {_grad_gb:.1f} GB
Ring-AllReduce factor = 2×(N-1)/N = 2×({_gpus}-1)/{_gpus} = {_allreduce_factor:.4f}
Data transferred per GPU = {_grad_gb:.1f} GB × {_allreduce_factor:.4f} = {_allreduce_data_gb:.2f} GB
Interconnect = {_fabric_label} — bandwidth = {_effective_bw:.0f} GB/s
T_allreduce = {_allreduce_data_gb:.2f} GB / {_effective_bw:.0f} GB/s = {_allreduce_time_s*1000:.1f} ms
T_compute (at {_mfu_ref*100:.0f}% MFU ref) = {_compute_time_s*1000:.1f} ms
Comm/Compute ratio = {_allreduce_time_s*1000:.1f} / {_compute_time_s*1000:.1f} = {_cc_ratio:.2f}
MFU_effective = T_compute / (T_compute + T_allreduce) × MFU_ref
= {_compute_time_s*1000:.1f} / {_total_time_s*1000:.1f} × {_mfu_ref*100:.0f}% = {_mfu_pct:.1f}%
{_ib_warn}
"""),
mo.Html(f"""
MFU
{_mfu_pct:.1f}%
model FLOPs utilization
Comm / Compute
{_cc_ratio:.2f}
ratio (lower = better)
AllReduce Time
{_allreduce_time_s*1000:.1f}ms
per step
Compute Time
{_compute_time_s*1000:.1f}ms
per step (ref MFU)
"""),
mo.ui.plotly(_fig),
])
return (
H100_RAM_GB,
H100_TFLOPS_FP16,
IB_HDR200_BW_GBS,
NVLINK4_BW_GBS,
GPUS_PER_NODE,
)
# ─── ACT I: PREDICTION REVEAL ──────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(act1_pred, mo):
_correct = act1_pred.value == "B"
if _correct:
mo.callout(mo.md(
"**Correct.** Option B identifies the root cause: the Ring-AllReduce transfer volume "
"is essentially constant (approximately 2 × gradient size regardless of N for large N), "
"but it must traverse InfiniBand at 400 GB/s when the cluster spans multiple nodes "
"instead of NVLink at 900 GB/s within a node. The compute time per GPU does not change "
"as you add GPUs. Therefore the comm/compute ratio rises with cluster size, "
"directly reducing MFU. The simulator above shows this as the cliff in the MFU curve "
"between 8 and 64 GPUs where traffic transitions from NVLink to InfiniBand."
), kind="success")
elif act1_pred.value == "A":
mo.callout(mo.md(
"**Not the primary cause.** NCCL is highly optimized and adds minimal overhead "
"relative to wire transfer time. The dominant factor is the physical bandwidth "
"of the interconnect, not the software library overhead. At 512 GPUs, "
"the AllReduce transfer itself consumes ~140 ms on InfiniBand while the compute "
"step takes ~60 ms — NCCL overhead is negligible compared to this ratio."
), kind="warn")
elif act1_pred.value == "C":
mo.callout(mo.md(
"**Not the primary cause.** Each GPU's local computation is unchanged — the "
"same model, same batch size per GPU, same forward and backward pass. "
"Cache pressure and HBM bandwidth utilization per GPU are essentially identical "
"regardless of whether you are running with 8 or 512 GPUs. The bottleneck "
"is between nodes, not within them."
), kind="warn")
elif act1_pred.value == "D":
mo.callout(mo.md(
"**A real phenomenon, but not the cause here.** Gradient quality degradation "
"with very large global batch sizes is a real concern (the linear scaling rule "
"breaks above a critical batch size), but the stakeholder explicitly notes that "
"batch size per GPU is unchanged. Total global batch = 512 GPUs × 32 = 16,384. "
"For a 7B model this is well within the stable scaling regime. The MFU drop "
"is communication-bound, not convergence-bound."
), kind="warn")
else:
mo.callout(mo.md("Select a prediction above to see the reveal."), kind="info")
return
# ─── ACT I: ACT I MATHPEEK ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.accordion({
"The governing equations — AllReduce bandwidth model and DP efficiency": mo.md("""
**Ring-AllReduce Transfer Volume (per GPU)**
```
T_allreduce = [2 × (N-1)/N × grad_bytes] / BW_interconnect
```
- `N` — number of data-parallel replicas (GPU count)
- `grad_bytes` — gradient tensor size = params × 2 bytes (FP16)
- `BW_interconnect` — 900 GB/s (NVLink 4, within node) or 400 GB/s (IB HDR200, cross-node)
- For large N: the factor 2×(N-1)/N → 2, so AllReduce volume saturates at ~2× gradient size
- **Key insight**: AllReduce volume does NOT grow linearly with N — it saturates. But the
bandwidth cliff when crossing the node boundary (NVLink → IB) creates a step-change in latency.
**DP Efficiency Formula**
```
MFU_effective = (T_compute / (T_compute + T_allreduce)) × MFU_ref
```
- `T_compute` — forward + backward FLOPs / (peak_TFLOPS × MFU_ref)
- `T_allreduce` — grows as cluster spans more nodes (IB replaces NVLink)
- When T_allreduce ≈ T_compute (ratio ≈ 1), effective MFU ≈ MFU_ref / 2
**Gradient Bucketing Analysis**
```
T_effective = max(T_compute_late_layers, T_allreduce_early_gradients)
```
Gradient bucketing starts AllReduce for early-layer gradients while later layers
are still computing. Ideal overlap: T_effective → T_compute (hiding communication).
In practice, overlapping achieves 60–80% communication hiding for large models.
""")
})
return
# ─── ACT I: REFLECTION ─────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### Reflection
Now that you have explored the AllReduce bottleneck, consider the primary technique
practitioners use to reclaim efficiency: overlapping communication with computation.
""")
return
@app.cell(hide_code=True)
def _(mo):
act1_reflect = mo.ui.radio(
options={
"A) Use faster GPUs to reduce compute time — this shrinks the total step time": "A",
"B) Gradient bucketing + async AllReduce — begin communicating early-layer gradients while computing late-layer gradients": "B",
"C) Reduce batch size per GPU to reduce the gradient tensor size and shorten AllReduce": "C",
"D) Quantize gradients to INT8 for AllReduce communication, then dequantize before the optimizer step": "D",
},
label="What is the primary technique to overlap communication with computation in data parallel training?",
)
act1_reflect
return (act1_reflect,)
@app.cell(hide_code=True)
def _(act1_reflect, mo):
mo.stop(
act1_reflect.value is None,
mo.callout(mo.md("Select an answer to see the explanation."), kind="warn"),
)
if act1_reflect.value == "B":
mo.callout(mo.md(
"**Correct.** Gradient bucketing partitions the gradient tensor into chunks. "
"During the backward pass, as soon as the gradients for the last few layers are "
"computed, AllReduce begins on those buckets while the backward pass continues "
"computing gradients for earlier layers. This overlaps the two operations. "
"PyTorch DDP implements this via `bucket_cap_mb` (default: 25 MB). "
"For a 7B model with 14 GB of gradients, effective overlap can hide 60–80% of "
"the AllReduce latency, recovering significant MFU at scale."
), kind="success")
elif act1_reflect.value == "A":
mo.callout(mo.md(
"**This does not reduce the comm/compute ratio.** A faster GPU shortens T_compute, "
"which makes the numerator in the ratio smaller — but it also reduces the time "
"available to overlap communication. The ratio T_allreduce/T_compute can actually "
"worsen as compute gets faster while interconnect bandwidth stays constant. "
"This is a common misconception: hardware upgrades on the compute side do not "
"solve interconnect-bound scaling."
), kind="warn")
elif act1_reflect.value == "C":
mo.callout(mo.md(
"**This reduces the wrong dimension.** Gradient dimensions are determined by model "
"architecture, not batch size. A 7B model has 7B parameters regardless of whether "
"the local batch is 8 or 128 samples. Reducing batch size per GPU does reduce "
"gradient noise (smaller effective batch = higher gradient variance), but it does "
"not reduce AllReduce volume. The gradient tensor size is `params × 2 bytes` in FP16."
), kind="warn")
elif act1_reflect.value == "D":
mo.callout(mo.md(
"**Partially true, but not the primary technique.** INT8 gradient compression "
"can reduce AllReduce volume by 2× compared to FP16, but it introduces gradient "
"quantization error that can harm convergence for sensitive training runs. "
"BF16 gradients are standard in modern training. The more reliable approach is "
"gradient bucketing and async AllReduce, which hides rather than reduces "
"communication — recovering throughput without precision loss."
), kind="warn")
return
# ═══════════════════════════════════════════════════════════════════════════════
# ACT II — 3D PARALLELISM DESIGN CHALLENGE
# ═══════════════════════════════════════════════════════════════════════════════
# ─── ACT II: SECTION HEADER ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
---
## Act II — 3D Parallelism Design Challenge
*Design Challenge · 20–25 minutes*
""")
return
# ─── ACT II: STAKEHOLDER MESSAGE ───────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, mo):
_color = COLORS["Cloud"]
_bg = COLORS["BlueLL"]
mo.Html(f"""
Incoming Message · MLOps Architect
"We need to train GPT-3 (175B parameters). A single H100 holds 80 GB of HBM3e.
With FP16 weights, FP32 optimizer states, and activation buffers, a 175B model
needs roughly 10 bytes per parameter in practice — about 1.75 TB total, which
doesn't fit in any single GPU. We have 1024 H100s available across 128 DGX nodes.
Design the 3D parallel configuration (TP × PP × DP) that maximizes MFU without
exceeding per-GPU memory or creating a pipeline bubble fraction above 10%."
""")
return
# ─── ACT II: CONCEPT FRAMING ───────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
When a model exceeds single-GPU memory capacity, data parallelism alone cannot help.
Three orthogonal strategies exist for distributing a model:
- **Tensor Parallelism (TP)**: Split individual matrix operations across GPUs within a layer.
Every forward pass requires an AllReduce across the TP group. TP must operate at high
bandwidth — otherwise the AllReduce overhead dominates. This constrains TP to
**within a single DGX node** (NVLink, 900 GB/s).
- **Pipeline Parallelism (PP)**: Assign consecutive layers to consecutive GPUs.
Requires microbatching to keep all pipeline stages busy. Introduces **bubble overhead**:
`B = (PP - 1) / (PP × m)` where `m` is the number of microbatches.
- **Data Parallelism (DP)**: Replicate the TP×PP model group and distribute the
global batch. This scales to the remaining GPU budget after TP and PP are fixed.
`DP = total_GPUs / (TP × PP)`.
The 3D configuration space has a hard constraint: `TP × PP × DP = 1024`.
Before using the configurator, predict the optimal configuration.
""")
return
# ─── ACT II: PREDICTION LOCK ───────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("### Your Configuration Prediction")
return
@app.cell(hide_code=True)
def _(mo):
act2_pred = mo.ui.radio(
options={
"A) TP=128, PP=1, DP=8 — maximize tensor parallelism to spread every layer across 128 GPUs": "A",
"B) TP=8, PP=4, DP=32 — within-node TP on NVLink, pipeline across nodes, DP for throughput scale": "B",
"C) TP=1, PP=1024, DP=1 — pure pipeline parallelism to avoid AllReduce entirely": "C",
"D) TP=4, PP=256, DP=1 — deep pipeline to maximize layer-level parallelism": "D",
},
label="Which 3D parallel configuration (TP × PP × DP) best balances memory, compute, and communication for GPT-3 175B on 1024 H100s?",
)
act2_pred
return (act2_pred,)
@app.cell(hide_code=True)
def _(act2_pred, mo):
mo.stop(
act2_pred.value is None,
mo.callout(
mo.md("Select your configuration prediction above to unlock the Act II instruments."),
kind="warn",
),
)
mo.md("")
return
# ─── ACT II: INSTRUMENT PANEL INTRO ────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### 3D Parallelism Configurator
Adjust TP and PP degrees. DP is computed automatically from the constraint
`TP × PP × DP = 1024`. The configurator will enforce per-GPU memory and
pipeline bubble constraints.
""")
return
# ─── ACT II: SLIDERS ───────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
tp_degree = mo.ui.slider(
start=1, stop=64, value=8, step=1,
label="Tensor Parallelism degree (TP)",
)
pp_degree = mo.ui.slider(
start=1, stop=64, value=4, step=1,
label="Pipeline Parallelism degree (PP)",
)
n_microbatches = mo.ui.slider(
start=1, stop=64, value=8, step=1,
label="Microbatches per pipeline flush (m)",
)
mo.hstack([
mo.vstack([tp_degree, pp_degree]),
mo.vstack([n_microbatches]),
], justify="center", gap=2)
return n_microbatches, pp_degree, tp_degree
# ─── ACT II: PHYSICS ENGINE ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(
COLORS,
GPUS_PER_NODE,
H100_RAM_GB,
H100_TFLOPS_FP16,
IB_HDR200_BW_GBS,
NVLINK4_BW_GBS,
apply_plotly_theme,
go,
math,
mo,
n_microbatches,
np,
pp_degree,
tp_degree,
):
# ── Model constants — GPT-3 175B ─────────────────────────────────────────────
# Source: @sec-distributed-training-systems, Brown et al. 2020 (GPT-3 paper)
GPT3_PARAMS_B = 175.0 # 175B parameters; Brown et al. 2020
GPT3_LAYERS = 96 # transformer layers; Brown et al. 2020
BYTES_PER_PARAM_FP16 = 2 # FP16 model weights
OPTIMIZER_OVERHEAD = 8 # FP32 optimizer states (m1+m2+master) ≈ 8 bytes/param
ACTIVATION_BYTES_GB = 8.0 # activation buffers per pipeline stage (estimate)
TOTAL_BYTES_PER_PARAM = 10 # practical: weights + grads + optimizer ≈ 10 bytes/param
TOTAL_GPUS = 1024 # available H100s
MFU_BASE = 0.52 # reference MFU for calibration
TP_ALLREDUCE_LAYERS = GPT3_LAYERS # TP AllReduce happens every layer
# ── Extract widget values ────────────────────────────────────────────────────
_tp = tp_degree.value
_pp = pp_degree.value
_m = n_microbatches.value
# ── Constraint: TP × PP × DP = 1024 ─────────────────────────────────────────
_tp_pp_product = _tp * _pp
_dp = TOTAL_GPUS // _tp_pp_product if _tp_pp_product <= TOTAL_GPUS else 0
_dp_remainder = TOTAL_GPUS % _tp_pp_product if _tp_pp_product > 0 else 1
_config_valid = (_dp > 0) and (_dp_remainder == 0)
# ── Memory analysis ──────────────────────────────────────────────────────────
# Per-GPU memory = model shards + optimizer + activations
# TP shards model parameters: each GPU holds 1/TP of each tensor
# PP assigns GPT3_LAYERS/PP layers to each stage
_params_per_gpu_b = GPT3_PARAMS_B / (_tp * _pp) # billions
_params_per_gpu = _params_per_gpu_b * 1e9
_model_mem_gb = _params_per_gpu * BYTES_PER_PARAM_FP16 / 1e9
_optim_mem_gb = _params_per_gpu * OPTIMIZER_OVERHEAD / 1e9
_activ_mem_gb = ACTIVATION_BYTES_GB * (_tp if _tp > 1 else 1) # activation replication
_total_mem_gb = _model_mem_gb + _optim_mem_gb + _activ_mem_gb
# ── Failure state: OOM ───────────────────────────────────────────────────────
_oom = _total_mem_gb > H100_RAM_GB
# ── Pipeline bubble fraction ─────────────────────────────────────────────────
# Source: @sec-distributed-training-systems pipeline parallelism section
# B = (PP - 1) / (PP × m)
_bubble_frac = (_pp - 1) / (_pp * _m) if _pp > 1 else 0.0
_bubble_pct = _bubble_frac * 100.0
_bubble_warn = _bubble_pct > 10.0
# ── TP communication overhead ────────────────────────────────────────────────
# TP AllReduce volume per layer = 2 × hidden_dim × seq_len × 2 bytes (FP16)
# Simplified: TP communication time relative to compute
# Each TP AllReduce per layer uses NVLink (within node) or IB (cross-node)
_tp_crosses_node = _tp > GPUS_PER_NODE
_tp_fabric_bw = IB_HDR200_BW_GBS if _tp_crosses_node else NVLINK4_BW_GBS
_tp_bw_penalty = NVLINK4_BW_GBS / _tp_fabric_bw # 1.0 if NVLink, 2.25 if IB
_tp_warn = _tp_crosses_node
# ── Effective MFU estimation ─────────────────────────────────────────────────
# TP penalty: communication overhead from intra-layer AllReduce
# PP penalty: pipeline bubble fraction
# DP penalty: AllReduce for gradients (small for large DP with gradient bucketing)
_tp_comm_penalty = 1.0 - (0.05 * math.log2(max(_tp, 1)) * _tp_bw_penalty) # rough empirical model
_pp_efficiency = 1.0 - _bubble_frac
_dp_comm_penalty = 1.0 - (0.02 * math.log2(max(_dp, 1))) # gradient AllReduce overhead
_mfu_effective = MFU_BASE * _tp_comm_penalty * _pp_efficiency * _dp_comm_penalty
_mfu_effective = max(0.0, min(_mfu_effective, MFU_BASE))
_mfu_pct_3d = _mfu_effective * 100.0
# ── Color coding ─────────────────────────────────────────────────────────────
_mem_color = COLORS["RedLine"] if _oom else (COLORS["OrangeLine"] if _total_mem_gb > 60 else COLORS["GreenLine"])
_bubble_color = COLORS["RedLine"] if _bubble_warn else (COLORS["OrangeLine"] if _bubble_pct > 5 else COLORS["GreenLine"])
_mfu_color_3d = COLORS["RedLine"] if _mfu_pct_3d < 25 else (COLORS["OrangeLine"] if _mfu_pct_3d < 40 else COLORS["GreenLine"])
_cfg_color = COLORS["GreenLine"] if _config_valid else COLORS["RedLine"]
# ── FAILURE STATE: OOM ───────────────────────────────────────────────────────
_oom_banner = ""
if _oom:
_oom_banner = f"""
OOM — Configuration Infeasible
Required per GPU: {_total_mem_gb:.1f} GB — exceeds H100 limit: {H100_RAM_GB:.0f} GB.
Model shard: {_model_mem_gb:.1f} GB | Optimizer states: {_optim_mem_gb:.1f} GB | Activations: {_activ_mem_gb:.1f} GB.
Increase TP or PP to reduce the per-GPU model shard below {H100_RAM_GB - _activ_mem_gb:.0f} GB (leaving room for activations).
"""
# ── WARNING STATE: TP crosses node boundary ───────────────────────────────────
_tp_bw_banner = ""
if _tp_warn:
_penalty_x = NVLINK4_BW_GBS / IB_HDR200_BW_GBS
_tp_bw_banner = f"""
Tensor Parallelism Crosses Node Boundary
TP={_tp} exceeds GPUS_PER_NODE={GPUS_PER_NODE}. TP AllReduce uses
InfiniBand HDR200 ({IB_HDR200_BW_GBS:.0f} GB/s) instead of NVLink 4
({NVLINK4_BW_GBS:.0f} GB/s) — a {_penalty_x:.1f}× bandwidth penalty
on every layer's AllReduce. TP should remain ≤ {GPUS_PER_NODE} to exploit
NVLink within a single DGX node.
"""
# ── Config validity warning ───────────────────────────────────────────────────
_cfg_banner = ""
if not _config_valid:
_cfg_banner = f"""
Invalid Configuration: TP × PP = {_tp} × {_pp} = {_tp_pp_product}
does not divide 1024 evenly. Choose TP and PP such that 1024 / (TP × PP)
is a positive integer.
"""
# ── Build bubble fraction vs PP/m chart ──────────────────────────────────────
_pp_range = list(range(1, 33))
_bubble_m1 = [(_p - 1) / (_p * 1) * 100 for _p in _pp_range]
_bubble_m4 = [(_p - 1) / (_p * 4) * 100 for _p in _pp_range]
_bubble_m8 = [(_p - 1) / (_p * 8) * 100 for _p in _pp_range]
_bubble_m16 = [(_p - 1) / (_p * 16) * 100 for _p in _pp_range]
_fig2 = go.Figure()
for _vals, _label, _clr in [
(_bubble_m1, "m=1 microbatch", "#cb202d"),
(_bubble_m4, "m=4 microbatches", "#cc5500"),
(_bubble_m8, "m=8 microbatches", "#006395"),
(_bubble_m16, "m=16 microbatches", "#008f45"),
]:
_fig2.add_trace(go.Scatter(
x=_pp_range, y=_vals, mode="lines", name=_label,
line=dict(color=_clr, width=2),
hovertemplate=f"PP=%{{x}} {_label}
Bubble: %{{y:.1f}}%",
))
_fig2.add_hline(y=10.0, line=dict(color="#1e293b", width=1.5, dash="dash"),
annotation_text="10% bubble ceiling", annotation_position="top right")
# Mark current config
_fig2.add_trace(go.Scatter(
x=[_pp], y=[_bubble_pct],
mode="markers", name="Current config",
marker=dict(size=14, color=COLORS["RedLine"], symbol="diamond",
line=dict(color="white", width=2)),
hovertemplate=f"PP={_pp}, m={_m}
Bubble: {_bubble_pct:.1f}%",
))
_fig2.update_layout(
height=300,
xaxis=dict(title="Pipeline Parallelism (PP stages)", range=[1, 32]),
yaxis=dict(title="Pipeline Bubble Fraction (%)", range=[0, 55]),
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
margin=dict(t=40, b=50, l=50, r=20),
)
apply_plotly_theme(_fig2)
# ── Render all outputs ────────────────────────────────────────────────────────
mo.vstack([
mo.Html(f"""
Physics — 3D Parallel Memory and Bubble Analysis
Configuration: TP={_tp} × PP={_pp} × DP={_dp if _config_valid else "N/A"}
{'= ' + str(TOTAL_GPUS) if _config_valid else '(INVALID: TP×PP=' + str(_tp_pp_product) + ' does not divide 1024)'}
Params per GPU = {GPT3_PARAMS_B}B / (TP={_tp} × PP={_pp}) = {_params_per_gpu_b:.2f}B params
Model memory (FP16) = {_params_per_gpu_b:.2f}B × 2 bytes = {_model_mem_gb:.1f} GB
Optimizer states (FP32) = {_params_per_gpu_b:.2f}B × 8 bytes = {_optim_mem_gb:.1f} GB
Activation buffer = {_activ_mem_gb:.1f} GB (estimated)
Total per-GPU memory = {_total_mem_gb:.1f} GB / {H100_RAM_GB:.0f} GB limit
Pipeline bubble B = (PP-1)/(PP×m) = ({_pp}-1)/({_pp}×{_m}) = {_bubble_pct:.1f}%
TP bandwidth = {'InfiniBand ' + str(IB_HDR200_BW_GBS) + ' GB/s (CROSS-NODE)' if _tp_crosses_node else 'NVLink 4 ' + str(NVLINK4_BW_GBS) + ' GB/s (within node)'}
Effective MFU = {_mfu_pct_3d:.1f}%
{_oom_banner}
{_tp_bw_banner}
{_cfg_banner}
"""),
mo.Html(f"""
Per-GPU Memory
{_total_mem_gb:.0f}GB
/ {H100_RAM_GB:.0f} GB limit
Pipeline Bubble
{_bubble_pct:.1f}%
ceiling: 10%
Effective MFU
{_mfu_pct_3d:.1f}%
3D parallel
DP degree
{_dp if _config_valid else 'N/A'}
= 1024 / (TP×PP)
"""),
mo.ui.plotly(_fig2),
])
return (
_bubble_pct,
_config_valid,
_dp,
_mfu_pct_3d,
_oom,
_total_mem_gb,
_tp_crosses_node,
)
# ─── ACT II: PREDICTION REVEAL ─────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(act2_pred, mo):
_correct = act2_pred.value == "B"
if _correct:
mo.callout(mo.md(
"**Correct.** TP=8, PP=4, DP=32 is the principled baseline for GPT-3 scale training. "
"TP=8 maps exactly to one DGX node (8 GPUs per node), keeping TP AllReduce on "
"NVLink at 900 GB/s. PP=4 assigns 96/4=24 transformer layers per stage, "
"requiring 4 nodes per pipeline. With 8 microbatches, the pipeline bubble "
"B=(4-1)/(4×8)=9.375% stays just under the 10% ceiling. DP=32 then "
"replicates the TP×PP group 32 times across the remaining 1024/(8×4)=32 GPU groups. "
"This matches the configuration used in real GPT-3-scale training runs "
"on DGX clusters (Megatron-LM, 2021)."
), kind="success")
elif act2_pred.value == "A":
mo.callout(mo.md(
"**Infeasible.** TP=128 distributes each layer across 128 GPUs. Each tensor "
"parallel AllReduce must traverse 16 DGX nodes (128/8=16), using InfiniBand "
"instead of NVLink — a 2.25× bandwidth penalty on every single layer forward and "
"backward pass. The AllReduce occurs 96 times per forward pass (once per transformer "
"layer). At IB bandwidth this becomes the dominant bottleneck, crushing MFU. "
"Configure TP in the simulator with TP > 8 to observe the bandwidth penalty warning."
), kind="warn")
elif act2_pred.value == "C":
mo.callout(mo.md(
"**Catastrophic bubble overhead.** PP=1024 with a single microbatch gives "
"B=(1024-1)/(1024×1)≈99.9% bubble fraction — the cluster is 99.9% idle. "
"Even with m=64 microbatches: B=(1024-1)/(1024×64)≈1.5%, but now each "
"gradient accumulation step is enormous, harming optimizer convergence. "
"Pure pipeline parallelism with depth matching GPU count is never used in practice. "
"Use the configurator to set PP=1024 and observe the bubble fraction."
), kind="warn")
elif act2_pred.value == "D":
mo.callout(mo.md(
"**Pipeline bubble too large.** PP=256 with m=8 microbatches gives "
"B=(256-1)/(256×8)=12.4% — already over the 10% ceiling. "
"You would need m=32 microbatches to bring the bubble to 3.1%, "
"but that requires a batch size of 32×256=8,192 sequences through the pipeline "
"before each optimizer step, creating a very large effective batch. "
"With DP=1, there is no data parallelism to amortize the batch size requirement. "
"This is an over-pipelined design."
), kind="warn")
else:
mo.callout(mo.md("Select a configuration prediction above to see the analysis."), kind="info")
return
# ─── ACT II: MATHPEEK ──────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.accordion({
"The governing equations — 3D parallel memory, bubble fraction, and TP communication": mo.md("""
**3D Parallel Per-GPU Memory**
```
mem_per_gpu = (params / (TP × PP)) × bytes_per_param
+ (params / (TP × PP)) × optimizer_bytes_per_param
+ activation_buffer
```
- `params` — total model parameters (e.g. 175B for GPT-3)
- `TP × PP` — reduces the parameter shard on each GPU
- `bytes_per_param` — FP16 = 2 bytes; FP32 master copy = 4 bytes
- `optimizer_bytes_per_param` — Adam states: 2 FP32 moments + master = ~8 bytes/param
- **Key insight**: TP and PP jointly reduce per-GPU memory — TP shards each matrix
horizontally, PP shards the depth (layers). DP does NOT reduce memory: every DP replica
holds the full TP×PP model shard.
**Pipeline Bubble Fraction**
```
B = (PP - 1) / (PP × m)
```
- `PP` — pipeline parallelism degree (stages)
- `m` — number of microbatches per pipeline flush
- At PP=4, m=8: B = 3/32 = 9.375%
- **Key insight**: increasing m (microbatches) reduces bubble but increases pipeline latency
and may harm optimizer convergence at very large effective batch sizes.
- Practical ceiling: B < 10% is standard in production (Megatron-LM guidelines).
**Tensor Parallelism Communication Volume (per layer)**
```
TP AllReduce per layer = 2 × (TP - 1)/TP × hidden_dim × seq_len × 2 bytes (FP16)
```
- Occurs **every layer** in both forward and backward passes
- At 900 GB/s (NVLink): ~0.5 ms per layer for a 175B model configuration
- At 400 GB/s (IB): ~1.1 ms per layer — 2.25× slower, applied 96 times per forward pass
- **Key insight**: TP communication is not a one-time cost — it is a per-layer tax.
This is why TP > 8 (crossing node boundary to IB) destroys MFU.
""")
})
return
# ─── ACT II: REFLECTION ────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### Reflection
You observed that TP=8 is the natural constraint boundary. Before finishing, confirm
your understanding of why this boundary is fundamental.
""")
return
@app.cell(hide_code=True)
def _(mo):
act2_reflect = mo.ui.radio(
options={
"A) PyTorch does not support cross-node tensor parallelism in its distributed primitives": "A",
"B) Tensor parallel AllReduce happens every layer — at InfiniBand bandwidth this becomes the dominant bottleneck": "B",
"C) Tensor parallelism requires shared GPU memory, which is unavailable across separate nodes": "C",
"D) Cross-node tensor parallelism causes numerical instability due to floating-point rounding across nodes": "D",
},
label="Why must tensor parallelism be confined within a single DGX node (TP ≤ 8)?",
)
act2_reflect
return (act2_reflect,)
@app.cell(hide_code=True)
def _(act2_reflect, mo):
mo.stop(
act2_reflect.value is None,
mo.callout(mo.md("Select an answer to see the explanation."), kind="warn"),
)
if act2_reflect.value == "B":
mo.callout(mo.md(
"**Correct.** Tensor parallelism introduces an AllReduce after every transformer "
"layer's matrix operations — both in the forward pass and the backward pass. "
"For a 96-layer model like GPT-3, that is 192 AllReduce calls per training step. "
"At NVLink bandwidth (900 GB/s) this adds ~1 ms per step — tolerable. "
"At InfiniBand bandwidth (400 GB/s), the penalty is 2.25× higher and accumulates "
"across all 96 layers, making TP communication the dominant step time. "
"The constraint TP ≤ GPUS_PER_NODE (≤ 8) is not a software limitation; "
"it is a bandwidth physics constraint."
), kind="success")
elif act2_reflect.value == "A":
mo.callout(mo.md(
"**Incorrect.** PyTorch (via Megatron-LM's column/row parallel linear layers) "
"and frameworks like DeepSpeed fully support cross-node tensor parallelism "
"using the standard NCCL AllReduce over InfiniBand. The constraint is physical, "
"not a software limitation. The code works fine; the bandwidth penalty is what "
"makes cross-node TP undesirable."
), kind="warn")
elif act2_reflect.value == "C":
mo.callout(mo.md(
"**Incorrect.** Tensor parallelism does not require shared memory. It is a "
"message-passing strategy: each GPU holds a shard of the weight matrix, "
"computes a partial matrix multiply on its shard, then the partial results are "
"reduced via AllReduce across all TP ranks. This works equally over NVLink "
"or InfiniBand — the difference is only bandwidth and therefore latency."
), kind="warn")
elif act2_reflect.value == "D":
mo.callout(mo.md(
"**Incorrect.** Floating-point arithmetic in distributed training uses deterministic "
"reduction primitives (NCCL's AllReduce). The numerical behavior is identical whether "
"the AllReduce traverses NVLink or InfiniBand — both use the same FP16/BF16 precision "
"operations. Numerical instability in distributed training typically arises from "
"gradient accumulation order (non-associative floating-point operations), not from "
"the physical transport medium."
), kind="warn")
return
# ─── LEDGER SAVE + HUD ─────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(
COLORS,
_bubble_pct,
_config_valid,
_dp,
_mfu_pct_3d,
_oom,
_total_mem_gb,
_tp_crosses_node,
act1_pred,
act2_pred,
act2_reflect,
act1_reflect,
ledger,
mo,
n_microbatches,
pp_degree,
tp_degree,
):
# ── Save to Design Ledger ────────────────────────────────────────────────────
_context = "3d_parallel" if tp_degree.value > 1 or pp_degree.value > 1 else "data_parallel"
ledger.save(
chapter="v2_05",
design={
"context": _context,
"tp_degree": tp_degree.value,
"pp_degree": pp_degree.value,
"dp_degree": _dp,
"total_gpus": 1024,
"mfu_percent": round(_mfu_pct_3d, 2),
"act1_prediction": act1_pred.value if act1_pred.value else "no_selection",
"act1_correct": act1_pred.value == "B",
"act1_reflect": act1_reflect.value if act1_reflect.value else "no_selection",
"act2_result": round(_mfu_pct_3d, 2),
"act2_decision": act2_pred.value if act2_pred.value else "no_selection",
"constraint_hit": _oom or _tp_crosses_node,
"memory_feasible": not _oom,
},
)
# ── Determine overall performance tier ──────────────────────────────────────
_act1_ok = act1_pred.value == "B"
_act2_ok = act2_pred.value == "B"
_mfu_ok = _mfu_pct_3d >= 40.0 and not _oom and _bubble_pct <= 10.0
_tier = "Optimal" if (_act1_ok and _act2_ok and _mfu_ok) else ("Partial" if (_act1_ok or _act2_ok) else "Developing")
_tier_color = COLORS["GreenLine"] if _tier == "Optimal" else (COLORS["OrangeLine"] if _tier == "Partial" else COLORS["TextMuted"])
# ── HUD Footer ───────────────────────────────────────────────────────────────
_hud = mo.Html(f"""
LAB
Vol2 · Lab 05
CHAPTER
v2_05 · Distributed Training
CONTEXT
{_context.upper()}
CONFIG
TP={tp_degree.value} × PP={pp_degree.value} × DP={_dp}
MFU
{_mfu_pct_3d:.1f}%
ACT I
{"CORRECT" if _act1_ok else "REVIEW"}
ACT II
{"CORRECT" if _act2_ok else "REVIEW"}
TIER
{_tier.upper()}
OOM
{"YES" if _oom else "NO"}
""")
_hud
return
# ─── CONNECTIONS ──────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
---
## Connections
**Textbook:** This lab explores the core concepts of
@sec-distributed-training-systems — the Iron Law of Scale, the
Communication-Computation Ratio, and the 3D Parallelism Cube
(@fig-3d-parallelism-cube).
**Prior Labs:** Lab 03 (Network Fabrics) established the physical bandwidth
limits — NVLink vs InfiniBand — that constrain TP degree here.
Lab 04 (Data Storage) established the I/O pipeline that feeds each DP replica.
**Next Lab:** Vol2 Lab 06 (Collective Communications) examines the
Ring-AllReduce and Tree-AllReduce algorithms in detail, quantifying
why Ring-AllReduce achieves near-linear scaling efficiency while
parameter-server approaches hit coordination bottlenecks.
""")
return
# ─── KEY TAKEAWAYS ─────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
## Key Takeaways
1. **The Parallelism Paradox is a bandwidth ratio, not a software bug.**
AllReduce volume saturates at approximately 2× gradient size regardless of GPU count,
but the transition from NVLink (900 GB/s, within node) to InfiniBand (400 GB/s, cross-node)
creates a step-change in communication time that drives the MFU cliff observed at
8→64 GPUs. MFU falls because T_allreduce grows while T_compute stays constant.
2. **The 3D parallelism constraint TP ≤ GPUS_PER_NODE is physics, not convention.**
Tensor parallelism performs AllReduce after every transformer layer. At 96 layers,
the per-layer AllReduce penalty accumulates into a step-time budget that InfiniBand
cannot satisfy. Confine TP within a single DGX node to keep every layer's
synchronization on NVLink. Then PP crosses nodes over InfiniBand — but PP
AllReduce happens only once per pipeline stage, not once per layer.
""")
return
if __name__ == "__main__":
app.run()