Files
cs249r_book/labs/vol2/lab_05_dist_train.py
Vijay Janapa Reddi 6f5732558f feat: add complete first-draft labs for both volumes (33 Marimo labs)
Add all Vol1 (labs 01-16) and Vol2 (labs 01-17) interactive Marimo labs
as the first full first-pass implementation of the ML Systems curriculum labs.

Each lab follows the PROTOCOL 2-Act structure (35-40 min):
- Act I: Calibration with prediction lock → instruments → overlay
- Act II: Design challenge with failure states and reflection

Key pedagogical instruments introduced progressively:
- Vol1: D·A·M Triad, Iron Law, Memory Ledger, Roofline, Amdahl's Law,
  Little's Law, P99 Histogram, Compression Frontier, Chouldechova theorem
- Vol2: NVLink vs PCIe cliff, Bisection BW, Young-Daly T*, Parallelism Paradox,
  AllReduce ring vs tree, KV-cache model, Jevons Paradox, DP ε-δ tradeoff,
  SLO composition, Adversarial Pareto, two-volume synthesis capstone

All 35 staged files pass AST syntax verification (36/36 including lab_00).

Also includes:
- labs/LABS_SPEC.md: authoritative sub-agent brief for all lab conventions
- labs/core/style.py: expanded unified design system with semantic color tokens
2026-03-01 19:59:04 -05:00

1364 lines
73 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
import marimo
__generated_with = "0.19.6"
app = marimo.App(width="full")
# ─────────────────────────────────────────────────────────────────────────────
# LAB 05: THE PARALLELISM PARADOX
#
# Chapter: Distributed Training Systems (@sec-distributed-training-systems)
# Core Invariant: The Parallelism Paradox — adding more GPUs to data parallel
# training increases communication overhead, which can decrease
# MFU below single-GPU levels for large models. 3D parallelism
# (Tensor + Pipeline + Data) is required for models that don't
# fit on a single GPU, but each dimension adds overhead.
#
# 2-Act Structure (35-40 minutes):
# Act I — The Data Parallel Wall (12-15 min)
# A 7B model trained with DP across 8→64→512 GPUs shows MFU
# collapsing from 52% to 19%. The central question: why does MFU
# fall as we add more GPUs? Students must confront that communication
# time grows relative to compute time as cluster size grows.
#
# Act II — 3D Parallelism Design Challenge (20-25 min)
# Design the TP×PP×DP configuration for GPT-3 175B on 1024 H100s.
# The failure state: per-GPU memory exceeds 80 GB (model doesn't fit)
# and a bandwidth penalty warning when TP crosses node boundaries.
#
# Deployment Contexts:
# DP: Data Parallel — replicate model, sync gradients via AllReduce
# 3D Parallel: TP×PP×DP — within-node TP (NVLink), cross-node PP (IB), DP
#
# Hardware Constants:
# H100_TFLOPS_FP16 = 1979 # TFLOPS, H100 SXM5 with sparsity; source: NVIDIA spec
# H100_BW_GBS = 3350 # GB/s HBM3e; source: NVIDIA H100 spec sheet
# H100_RAM_GB = 80 # GB HBM3e; source: NVIDIA H100 spec sheet
# NVLINK4_BW_GBS = 900 # GB/s NVLink 4; source: NVIDIA DGX H100 spec
# IB_HDR200_BW_GBS = 400 # GB/s InfiniBand HDR200; source: Mellanox spec
# GPUS_PER_NODE = 8 # Standard DGX H100 node size
#
# Design Ledger: saves chapter="v2_05" with DP vs 3D context, parallelism
# degrees, MFU achieved, prediction accuracy, failure states.
# ─────────────────────────────────────────────────────────────────────────────
# ─── CELL 0: SETUP (hide_code=False — leave visible for instructor inspection) ─
@app.cell
def _():
import marimo as mo
import sys
import math
from pathlib import Path
import plotly.graph_objects as go
import numpy as np
from plotly.subplots import make_subplots
_root = Path(__file__).resolve().parents[2]
if str(_root) not in sys.path:
sys.path.insert(0, str(_root))
from labs.core.state import DesignLedger
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme
ledger = DesignLedger()
return COLORS, LAB_CSS, apply_plotly_theme, go, ledger, math, mo, np, make_subplots
# ─── CELL 1: HEADER ────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, LAB_CSS, mo):
_c_dp = COLORS["BlueLine"]
_c_3d = COLORS["Cloud"]
_c_s0 = COLORS["Surface0"]
_c_s1 = COLORS["Surface1"]
_header = mo.Html(f"""
{LAB_CSS}
<div style="background: linear-gradient(135deg, {_c_s0} 0%, {_c_s1} 100%);
border-radius: 16px; padding: 32px 40px; margin-bottom: 8px;
border: 1px solid #2d3748;">
<div style="display: flex; justify-content: space-between; align-items: flex-start; flex-wrap: wrap; gap: 16px;">
<div>
<div style="font-size: 0.72rem; font-weight: 700; color: #94a3b8;
text-transform: uppercase; letter-spacing: 0.14em; margin-bottom: 8px;">
Vol 2 &middot; Lab 05 &middot; Distributed Training Systems
</div>
<div style="font-size: 2.0rem; font-weight: 800; color: #f1f5f9; line-height: 1.15; margin-bottom: 10px;">
The Parallelism Paradox
</div>
<div style="font-size: 0.95rem; color: #94a3b8; max-width: 600px; line-height: 1.6;">
Adding GPUs to a data-parallel job can reduce Model FLOPs Utilization
below single-GPU levels. This lab forces you to confront the
communication-computation ratio and design the 3D parallelism
configuration that keeps 1024 H100s productive.
</div>
</div>
<div style="display: flex; flex-direction: column; gap: 8px; flex-shrink: 0;">
<span class="badge badge-info">Parallelism Paradox</span>
<span class="badge badge-info">AllReduce Bandwidth Model</span>
<span class="badge badge-info">3D Parallel: TP &times; PP &times; DP</span>
<span class="badge badge-warn">35&ndash;40 minutes &middot; 2 Acts</span>
</div>
</div>
<div style="display: flex; gap: 16px; margin-top: 20px; flex-wrap: wrap;">
<div style="background: rgba(99,102,241,0.12); border: 1px solid rgba(99,102,241,0.35);
border-radius: 8px; padding: 10px 16px; font-size: 0.82rem;">
<span style="color: {_c_dp}; font-weight: 700;">Context A — Data Parallel</span>
<span style="color: #94a3b8;"> &mdash; 7B model &middot; 8&ndash;512 GPUs &middot; AllReduce via NVLink / IB</span>
</div>
<div style="background: rgba(99,102,241,0.12); border: 1px solid rgba(99,102,241,0.35);
border-radius: 8px; padding: 10px 16px; font-size: 0.82rem;">
<span style="color: {_c_3d}; font-weight: 700;">Context B — 3D Parallel</span>
<span style="color: #94a3b8;"> &mdash; 175B model &middot; 1024 H100s &middot; TP&times;PP&times;DP design</span>
</div>
</div>
</div>
""")
_header
return
# ─── CELL 2: RECOMMENDED READING ───────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.callout(mo.md("""
**Recommended Reading** — Complete the following before this lab:
- **@sec-distributed-training-systems-systems-multimachine-scaling-fundamentals-ff96** — The Iron Law of Scale: `T_step(N) = T_compute/N + T_comm(N) - T_overlap` and the Communication-Computation Ratio
- **@sec-distributed-training-systems** — Why distribution is necessary: memory exhaustion, training duration, and dataset scale thresholds
- The Data Parallelism section — AllReduce gradient synchronization, Ring-AllReduce bandwidth formula, gradient bucketing
- The 3D Parallelism section — Tensor Parallelism (within-node), Pipeline Parallelism (across nodes), pipeline bubble fraction `B = (PP-1)/(PP * m)`
"""), kind="info")
return
# ─── CELL 3: CONTEXT TOGGLE ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, mo):
_c_muted = COLORS["TextMuted"]
context_toggle = mo.ui.radio(
options={
"Data Parallel (DP)": "dp",
"3D Parallel (TP+PP+DP)": "3d",
},
value="Data Parallel (DP)",
label="Deployment context for this session:",
inline=True,
)
mo.hstack([
mo.Html(f"""
<div style="font-size:0.78rem; font-weight:700; color:{_c_muted};
text-transform:uppercase; letter-spacing:0.08em;
margin-right:8px; padding-top:2px;">
Active Context:
</div>
"""),
context_toggle,
], justify="start", gap=0)
return (context_toggle,)
# ═══════════════════════════════════════════════════════════════════════════════
# ACT I — THE DATA PARALLEL WALL
# ═══════════════════════════════════════════════════════════════════════════════
# ─── ACT I: SECTION HEADER ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
---
## Act I — The Data Parallel Wall
*Calibration &middot; 12&ndash;15 minutes*
""")
return
# ─── ACT I: STAKEHOLDER MESSAGE ────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, mo):
_color = COLORS["BlueLine"]
_bg = COLORS["BlueL"]
mo.Html(f"""
<div style="border-left: 4px solid {_color}; background: {_bg};
border-radius: 0 10px 10px 0; padding: 16px 22px; margin: 12px 0;">
<div style="font-size: 0.72rem; font-weight: 700; color: {_color};
text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 6px;">
Incoming Message &middot; Training Infrastructure Lead
</div>
<div style="font-style: italic; font-size: 1.0rem; color: #1e293b; line-height: 1.65;">
"We are training a 7B parameter model using data parallelism. I measured MFU at
8 GPUs = 52%. At 64 GPUs it dropped to 38%. At 512 GPUs it's 19%. The model
hasn't changed. The batch size per GPU hasn't changed. We just added more GPUs
and it got worse. We're wasting $40,000 per day in idle compute. Can you tell
me exactly why MFU falls as we scale data parallelism?"
</div>
</div>
""")
return
# ─── ACT I: CONCEPT FRAMING ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
The training lead's observation is not a software bug. It is the **Parallelism Paradox**:
the physical law that governs every data-parallel training job.
Data parallel training replicates the model on every GPU, runs a forward and backward
pass on each device's local batch, then synchronizes gradients across all devices via
**AllReduce** before the optimizer step. The AllReduce is unavoidable — without it,
each replica would diverge. The critical question is how long AllReduce takes relative
to the compute step.
The **Communication-Computation Ratio** from @sec-distributed-training-systems determines
whether a cluster behaves as a supercomputer or as a collection of idling heaters:
- **Compute-Bound (Low Ratio)**: `T_compute >> T_comm`. GPUs spend most time on matrix
multiplications. This is the ideal state.
- **Communication-Bound (High Ratio)**: `T_comm ≈ T_compute`. GPUs spend significant
time waiting for gradients. This is the common state for LLMs at scale.
Before looking at any numbers, commit to a prediction about what causes MFU to fall.
""")
return
# ─── ACT I: PREDICTION LOCK ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("### Your Prediction")
return
@app.cell(hide_code=True)
def _(mo):
act1_pred = mo.ui.radio(
options={
"A) Software overhead in NCCL and collective libraries grows with cluster size": "A",
"B) AllReduce communication time grows with cluster size while compute time stays constant — the comm/compute ratio rises": "B",
"C) Larger clusters cause L2 cache pressure and HBM bandwidth saturation per GPU": "C",
"D) 512 GPUs exceeds optimal batch size — gradient quality degrades and more steps are needed": "D",
},
label="Why does Model FLOPs Utilization (MFU) fall as we scale a data-parallel job from 8 to 512 GPUs?",
)
act1_pred
return (act1_pred,)
@app.cell(hide_code=True)
def _(act1_pred, mo):
mo.stop(
act1_pred.value is None,
mo.callout(
mo.md("Select your prediction above to unlock the Act I instruments."),
kind="warn",
),
)
mo.md("")
return
# ─── ACT I: INSTRUMENT PANEL INTRO ─────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### Data Parallel Scaling Explorer
Adjust the parameters below to see how AllReduce communication time compares to
compute time — and how their ratio determines MFU at each scale point.
""")
return
# ─── ACT I: SLIDERS ────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
dp_model_b = mo.ui.slider(
start=1, stop=175, value=7, step=1,
label="Model size (B params)",
)
dp_gpus = mo.ui.slider(
start=8, stop=1024, value=8, step=8,
label="Number of GPUs (DP degree)",
)
dp_batch_per_gpu = mo.ui.slider(
start=8, stop=128, value=32, step=8,
label="Micro-batch size per GPU",
)
dp_interconnect = mo.ui.dropdown(
options={"NVLink 4 (within DGX node, 8 GPUs)": "nvlink", "InfiniBand HDR200 (cross-node)": "ib"},
value="NVLink 4 (within DGX node, 8 GPUs)",
label="Interconnect fabric",
)
mo.hstack([
mo.vstack([dp_model_b, dp_gpus]),
mo.vstack([dp_batch_per_gpu, dp_interconnect]),
], justify="center", gap=2)
return dp_batch_per_gpu, dp_gpus, dp_interconnect, dp_model_b
# ─── ACT I: PHYSICS ENGINE ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, apply_plotly_theme, dp_batch_per_gpu, dp_gpus, dp_interconnect, dp_model_b, go, math, mo, np):
# ── Hardware constants ───────────────────────────────────────────────────────
# Source: NVIDIA H100 SXM5 spec sheet, Mellanox InfiniBand HDR200 spec
H100_TFLOPS_FP16 = 1979.0 # TFLOPS H100 SXM5 with sparsity; NVIDIA spec
H100_RAM_GB = 80.0 # GB HBM3e; NVIDIA H100 spec sheet
NVLINK4_BW_GBS = 900.0 # GB/s NVLink 4 (per-direction); NVIDIA DGX H100 spec
IB_HDR200_BW_GBS = 400.0 # GB/s InfiniBand HDR200 (per-direction); Mellanox spec
GPUS_PER_NODE = 8 # DGX H100 node size; standard industry config
BYTES_PER_PARAM = 2 # FP16: 2 bytes per parameter
BYTES_PER_GRAD = 2 # FP16 gradients: 2 bytes per gradient
# ── Extract widget values ────────────────────────────────────────────────────
_params_b = dp_model_b.value # billions of params
_gpus = dp_gpus.value
_batch_gpu = dp_batch_per_gpu.value
_fabric = dp_interconnect.value
# ── Derived quantities ───────────────────────────────────────────────────────
_params = _params_b * 1e9 # total parameters
_grad_bytes = _params * BYTES_PER_GRAD # gradient tensor size in bytes
_grad_gb = _grad_bytes / 1e9 # gradient size in GB
# ── AllReduce bandwidth model ────────────────────────────────────────────────
# Ring-AllReduce transfers 2*(N-1)/N * data per device
# Source: @sec-distributed-training-systems AllReduce bandwidth analysis
# Effective bytes per GPU = 2*(N-1)/N * grad_gb
_allreduce_factor = 2.0 * (_gpus - 1) / _gpus
_allreduce_data_gb = _grad_gb * _allreduce_factor # GB transferred per GPU
# Interconnect selection and effective bandwidth
# Note: when >8 GPUs, traffic crosses node boundary via InfiniBand even if NVLink selected
_effective_bw = NVLINK4_BW_GBS if (_fabric == "nvlink" and _gpus <= GPUS_PER_NODE) else IB_HDR200_BW_GBS
_fabric_label = "NVLink 4" if _effective_bw == NVLINK4_BW_GBS else "InfiniBand HDR200"
_forced_ib = (_fabric == "nvlink" and _gpus > GPUS_PER_NODE)
_allreduce_time_s = _allreduce_data_gb / _effective_bw # seconds
# ── Compute step time ────────────────────────────────────────────────────────
# Forward + backward pass FLOPs ≈ 6 * params * batch_size
# Source: standard transformer FLOP count estimate (6N approximation)
_seq_len = 2048 # typical sequence length for a 7B-class model
_hidden_dim = 4096 # typical for 7B models
_flops_step = 6.0 * _params * _batch_gpu # forward + backward FLOPs
_mfu_ref = 0.52 # reference MFU at 8 GPUs (from stakeholder message)
# Compute time at reference MFU to anchor the physics
_compute_time_s = _flops_step / (H100_TFLOPS_FP16 * 1e12 * _mfu_ref)
# ── Effective MFU with AllReduce overhead ────────────────────────────────────
# MFU = T_compute / (T_compute + T_allreduce)
# When T_allreduce grows relative to T_compute, MFU falls
_total_time_s = _compute_time_s + _allreduce_time_s
_mfu_effective = (_compute_time_s / _total_time_s) * _mfu_ref # fractional
_mfu_pct = _mfu_effective * 100.0
# ── Comm/Compute ratio ───────────────────────────────────────────────────────
_cc_ratio = _allreduce_time_s / _compute_time_s
# ── Color coding ─────────────────────────────────────────────────────────────
_mfu_color = COLORS["GreenLine"] if _mfu_pct >= 45 else (COLORS["OrangeLine"] if _mfu_pct >= 25 else COLORS["RedLine"])
_cc_color = COLORS["GreenLine"] if _cc_ratio <= 0.3 else (COLORS["OrangeLine"] if _cc_ratio <= 0.8 else COLORS["RedLine"])
# ── Build MFU vs GPU count curve ─────────────────────────────────────────────
_gpu_range = [8, 16, 32, 64, 128, 256, 512, 1024]
_mfu_curve = []
for _g in _gpu_range:
_ar_factor_g = 2.0 * (_g - 1) / _g
_bw_g = NVLINK4_BW_GBS if _g <= GPUS_PER_NODE else IB_HDR200_BW_GBS
_ar_time_g = (_grad_gb * _ar_factor_g) / _bw_g
_total_g = _compute_time_s + _ar_time_g
_mfu_g = (_compute_time_s / _total_g) * _mfu_ref * 100.0
_mfu_curve.append(_mfu_g)
_fig = go.Figure()
_fig.add_trace(go.Scatter(
x=_gpu_range, y=_mfu_curve,
mode="lines+markers",
line=dict(color=COLORS["BlueLine"], width=2.5),
marker=dict(size=8, color=COLORS["BlueLine"]),
name="MFU (model)",
hovertemplate="<b>%{x} GPUs</b><br>MFU: %{y:.1f}%<extra></extra>",
))
# Mark the current selection
_fig.add_trace(go.Scatter(
x=[_gpus], y=[_mfu_pct],
mode="markers",
marker=dict(size=16, color=COLORS["RedLine"], symbol="diamond",
line=dict(color="white", width=2)),
name="Current config",
hovertemplate="<b>Current: %{x} GPUs</b><br>MFU: %{y:.1f}%<extra></extra>",
))
# Reference points from stakeholder message
_ref_x = [8, 64, 512]
_ref_y = [52, 38, 19]
_fig.add_trace(go.Scatter(
x=_ref_x, y=_ref_y,
mode="markers",
marker=dict(size=12, color=COLORS["OrangeLine"], symbol="x",
line=dict(color=COLORS["OrangeLine"], width=3)),
name="Measured (stakeholder)",
hovertemplate="<b>Measured: %{x} GPUs</b><br>MFU: %{y:.0f}%<extra></extra>",
))
_fig.add_hline(y=40.0, line=dict(color=COLORS["GreenLine"], width=1.5, dash="dash"),
annotation_text="40% — practical floor", annotation_position="bottom right")
_fig.update_layout(
height=320,
xaxis=dict(title="GPU Count (DP degree)", type="log",
tickvals=[8, 16, 32, 64, 128, 256, 512, 1024],
ticktext=["8", "16", "32", "64", "128", "256", "512", "1024"]),
yaxis=dict(title="Model FLOPs Utilization (%)", range=[0, 65]),
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
margin=dict(t=40, b=50, l=50, r=20),
)
apply_plotly_theme(_fig)
# ── Forced-IB warning ────────────────────────────────────────────────────────
_ib_warn = ""
if _forced_ib:
_ib_warn = f"""
<div style="background:{COLORS['OrangeLL']}; border:1px solid {COLORS['OrangeLine']};
border-radius:8px; padding:10px 14px; margin:8px 0; font-size:0.85rem;">
<strong style="color:{COLORS['OrangeLine']};">Interconnect Upgrade Applied:</strong>
NVLink 4 operates within a single DGX node (8 GPUs). At {_gpus} GPUs, traffic crosses
node boundaries. Effective bandwidth is InfiniBand HDR200 ({IB_HDR200_BW_GBS:.0f} GB/s)
&mdash; {NVLINK4_BW_GBS/IB_HDR200_BW_GBS:.1f}&times; slower than NVLink 4.
</div>
"""
# ── Physics display ──────────────────────────────────────────────────────────
mo.vstack([
mo.Html(f"""
<div style="background:{COLORS['Surface2']}; border:1px solid {COLORS['Border']};
border-radius:12px; padding:16px 20px; margin:8px 0; font-family:monospace;
font-size:0.83rem; line-height:1.8;">
<div style="font-size:0.72rem; font-weight:700; color:{COLORS['TextMuted']};
text-transform:uppercase; letter-spacing:0.1em; margin-bottom:8px;
font-family:sans-serif;">
Physics — AllReduce Bandwidth Model
</div>
<div>Gradient tensor = {_params_b}B params &times; {BYTES_PER_GRAD} bytes/param = <strong>{_grad_gb:.1f} GB</strong></div>
<div>Ring-AllReduce factor = 2&times;(N-1)/N = 2&times;({_gpus}-1)/{_gpus} = <strong>{_allreduce_factor:.4f}</strong></div>
<div>Data transferred per GPU = {_grad_gb:.1f} GB &times; {_allreduce_factor:.4f} = <strong>{_allreduce_data_gb:.2f} GB</strong></div>
<div>Interconnect = <strong>{_fabric_label}</strong> &mdash; bandwidth = <strong>{_effective_bw:.0f} GB/s</strong></div>
<div>T_allreduce = {_allreduce_data_gb:.2f} GB / {_effective_bw:.0f} GB/s = <strong>{_allreduce_time_s*1000:.1f} ms</strong></div>
<div>T_compute (at {_mfu_ref*100:.0f}% MFU ref) = <strong>{_compute_time_s*1000:.1f} ms</strong></div>
<div>Comm/Compute ratio = {_allreduce_time_s*1000:.1f} / {_compute_time_s*1000:.1f} = <strong>{_cc_ratio:.2f}</strong></div>
<div>MFU_effective = T_compute / (T_compute + T_allreduce) &times; MFU_ref</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = {_compute_time_s*1000:.1f} / {_total_time_s*1000:.1f} &times; {_mfu_ref*100:.0f}% = <strong>{_mfu_pct:.1f}%</strong></div>
</div>
{_ib_warn}
"""),
mo.Html(f"""
<div style="display:flex; gap:16px; justify-content:center; margin:8px 0; flex-wrap:wrap;">
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">MFU</div>
<div style="font-size:2.2rem; font-weight:800; color:{_mfu_color};
font-family:monospace;">{_mfu_pct:.1f}%</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">model FLOPs utilization</div>
</div>
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">Comm / Compute</div>
<div style="font-size:2.2rem; font-weight:800; color:{_cc_color};
font-family:monospace;">{_cc_ratio:.2f}</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">ratio (lower = better)</div>
</div>
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">AllReduce Time</div>
<div style="font-size:2.2rem; font-weight:800; color:{COLORS['BlueLine']};
font-family:monospace;">{_allreduce_time_s*1000:.1f}ms</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">per step</div>
</div>
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">Compute Time</div>
<div style="font-size:2.2rem; font-weight:800; color:{COLORS['BlueLine']};
font-family:monospace;">{_compute_time_s*1000:.1f}ms</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">per step (ref MFU)</div>
</div>
</div>
"""),
mo.ui.plotly(_fig),
])
return (
H100_RAM_GB,
H100_TFLOPS_FP16,
IB_HDR200_BW_GBS,
NVLINK4_BW_GBS,
GPUS_PER_NODE,
)
# ─── ACT I: PREDICTION REVEAL ──────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(act1_pred, mo):
_correct = act1_pred.value == "B"
if _correct:
mo.callout(mo.md(
"**Correct.** Option B identifies the root cause: the Ring-AllReduce transfer volume "
"is essentially constant (approximately 2 × gradient size regardless of N for large N), "
"but it must traverse InfiniBand at 400 GB/s when the cluster spans multiple nodes "
"instead of NVLink at 900 GB/s within a node. The compute time per GPU does not change "
"as you add GPUs. Therefore the comm/compute ratio rises with cluster size, "
"directly reducing MFU. The simulator above shows this as the cliff in the MFU curve "
"between 8 and 64 GPUs where traffic transitions from NVLink to InfiniBand."
), kind="success")
elif act1_pred.value == "A":
mo.callout(mo.md(
"**Not the primary cause.** NCCL is highly optimized and adds minimal overhead "
"relative to wire transfer time. The dominant factor is the physical bandwidth "
"of the interconnect, not the software library overhead. At 512 GPUs, "
"the AllReduce transfer itself consumes ~140 ms on InfiniBand while the compute "
"step takes ~60 ms — NCCL overhead is negligible compared to this ratio."
), kind="warn")
elif act1_pred.value == "C":
mo.callout(mo.md(
"**Not the primary cause.** Each GPU's local computation is unchanged — the "
"same model, same batch size per GPU, same forward and backward pass. "
"Cache pressure and HBM bandwidth utilization per GPU are essentially identical "
"regardless of whether you are running with 8 or 512 GPUs. The bottleneck "
"is between nodes, not within them."
), kind="warn")
elif act1_pred.value == "D":
mo.callout(mo.md(
"**A real phenomenon, but not the cause here.** Gradient quality degradation "
"with very large global batch sizes is a real concern (the linear scaling rule "
"breaks above a critical batch size), but the stakeholder explicitly notes that "
"batch size per GPU is unchanged. Total global batch = 512 GPUs × 32 = 16,384. "
"For a 7B model this is well within the stable scaling regime. The MFU drop "
"is communication-bound, not convergence-bound."
), kind="warn")
else:
mo.callout(mo.md("Select a prediction above to see the reveal."), kind="info")
return
# ─── ACT I: ACT I MATHPEEK ─────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.accordion({
"The governing equations — AllReduce bandwidth model and DP efficiency": mo.md("""
**Ring-AllReduce Transfer Volume (per GPU)**
```
T_allreduce = [2 × (N-1)/N × grad_bytes] / BW_interconnect
```
- `N` — number of data-parallel replicas (GPU count)
- `grad_bytes` — gradient tensor size = params × 2 bytes (FP16)
- `BW_interconnect` — 900 GB/s (NVLink 4, within node) or 400 GB/s (IB HDR200, cross-node)
- For large N: the factor 2×(N-1)/N → 2, so AllReduce volume saturates at ~2× gradient size
- **Key insight**: AllReduce volume does NOT grow linearly with N — it saturates. But the
bandwidth cliff when crossing the node boundary (NVLink → IB) creates a step-change in latency.
**DP Efficiency Formula**
```
MFU_effective = (T_compute / (T_compute + T_allreduce)) × MFU_ref
```
- `T_compute` — forward + backward FLOPs / (peak_TFLOPS × MFU_ref)
- `T_allreduce` — grows as cluster spans more nodes (IB replaces NVLink)
- When T_allreduce ≈ T_compute (ratio ≈ 1), effective MFU ≈ MFU_ref / 2
**Gradient Bucketing Analysis**
```
T_effective = max(T_compute_late_layers, T_allreduce_early_gradients)
```
Gradient bucketing starts AllReduce for early-layer gradients while later layers
are still computing. Ideal overlap: T_effective → T_compute (hiding communication).
In practice, overlapping achieves 6080% communication hiding for large models.
""")
})
return
# ─── ACT I: REFLECTION ─────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### Reflection
Now that you have explored the AllReduce bottleneck, consider the primary technique
practitioners use to reclaim efficiency: overlapping communication with computation.
""")
return
@app.cell(hide_code=True)
def _(mo):
act1_reflect = mo.ui.radio(
options={
"A) Use faster GPUs to reduce compute time — this shrinks the total step time": "A",
"B) Gradient bucketing + async AllReduce — begin communicating early-layer gradients while computing late-layer gradients": "B",
"C) Reduce batch size per GPU to reduce the gradient tensor size and shorten AllReduce": "C",
"D) Quantize gradients to INT8 for AllReduce communication, then dequantize before the optimizer step": "D",
},
label="What is the primary technique to overlap communication with computation in data parallel training?",
)
act1_reflect
return (act1_reflect,)
@app.cell(hide_code=True)
def _(act1_reflect, mo):
mo.stop(
act1_reflect.value is None,
mo.callout(mo.md("Select an answer to see the explanation."), kind="warn"),
)
if act1_reflect.value == "B":
mo.callout(mo.md(
"**Correct.** Gradient bucketing partitions the gradient tensor into chunks. "
"During the backward pass, as soon as the gradients for the last few layers are "
"computed, AllReduce begins on those buckets while the backward pass continues "
"computing gradients for earlier layers. This overlaps the two operations. "
"PyTorch DDP implements this via `bucket_cap_mb` (default: 25 MB). "
"For a 7B model with 14 GB of gradients, effective overlap can hide 6080% of "
"the AllReduce latency, recovering significant MFU at scale."
), kind="success")
elif act1_reflect.value == "A":
mo.callout(mo.md(
"**This does not reduce the comm/compute ratio.** A faster GPU shortens T_compute, "
"which makes the numerator in the ratio smaller — but it also reduces the time "
"available to overlap communication. The ratio T_allreduce/T_compute can actually "
"worsen as compute gets faster while interconnect bandwidth stays constant. "
"This is a common misconception: hardware upgrades on the compute side do not "
"solve interconnect-bound scaling."
), kind="warn")
elif act1_reflect.value == "C":
mo.callout(mo.md(
"**This reduces the wrong dimension.** Gradient dimensions are determined by model "
"architecture, not batch size. A 7B model has 7B parameters regardless of whether "
"the local batch is 8 or 128 samples. Reducing batch size per GPU does reduce "
"gradient noise (smaller effective batch = higher gradient variance), but it does "
"not reduce AllReduce volume. The gradient tensor size is `params × 2 bytes` in FP16."
), kind="warn")
elif act1_reflect.value == "D":
mo.callout(mo.md(
"**Partially true, but not the primary technique.** INT8 gradient compression "
"can reduce AllReduce volume by 2× compared to FP16, but it introduces gradient "
"quantization error that can harm convergence for sensitive training runs. "
"BF16 gradients are standard in modern training. The more reliable approach is "
"gradient bucketing and async AllReduce, which hides rather than reduces "
"communication — recovering throughput without precision loss."
), kind="warn")
return
# ═══════════════════════════════════════════════════════════════════════════════
# ACT II — 3D PARALLELISM DESIGN CHALLENGE
# ═══════════════════════════════════════════════════════════════════════════════
# ─── ACT II: SECTION HEADER ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
---
## Act II — 3D Parallelism Design Challenge
*Design Challenge &middot; 20&ndash;25 minutes*
""")
return
# ─── ACT II: STAKEHOLDER MESSAGE ───────────────────────────────────────────────
@app.cell(hide_code=True)
def _(COLORS, mo):
_color = COLORS["Cloud"]
_bg = COLORS["BlueLL"]
mo.Html(f"""
<div style="border-left: 4px solid {_color}; background: {_bg};
border-radius: 0 10px 10px 0; padding: 16px 22px; margin: 12px 0;">
<div style="font-size: 0.72rem; font-weight: 700; color: {_color};
text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 6px;">
Incoming Message &middot; MLOps Architect
</div>
<div style="font-style: italic; font-size: 1.0rem; color: #1e293b; line-height: 1.65;">
"We need to train GPT-3 (175B parameters). A single H100 holds 80 GB of HBM3e.
With FP16 weights, FP32 optimizer states, and activation buffers, a 175B model
needs roughly 10 bytes per parameter in practice — about 1.75 TB total, which
doesn't fit in any single GPU. We have 1024 H100s available across 128 DGX nodes.
Design the 3D parallel configuration (TP × PP × DP) that maximizes MFU without
exceeding per-GPU memory or creating a pipeline bubble fraction above 10%."
</div>
</div>
""")
return
# ─── ACT II: CONCEPT FRAMING ───────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
When a model exceeds single-GPU memory capacity, data parallelism alone cannot help.
Three orthogonal strategies exist for distributing a model:
- **Tensor Parallelism (TP)**: Split individual matrix operations across GPUs within a layer.
Every forward pass requires an AllReduce across the TP group. TP must operate at high
bandwidth — otherwise the AllReduce overhead dominates. This constrains TP to
**within a single DGX node** (NVLink, 900 GB/s).
- **Pipeline Parallelism (PP)**: Assign consecutive layers to consecutive GPUs.
Requires microbatching to keep all pipeline stages busy. Introduces **bubble overhead**:
`B = (PP - 1) / (PP × m)` where `m` is the number of microbatches.
- **Data Parallelism (DP)**: Replicate the TP×PP model group and distribute the
global batch. This scales to the remaining GPU budget after TP and PP are fixed.
`DP = total_GPUs / (TP × PP)`.
The 3D configuration space has a hard constraint: `TP × PP × DP = 1024`.
Before using the configurator, predict the optimal configuration.
""")
return
# ─── ACT II: PREDICTION LOCK ───────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("### Your Configuration Prediction")
return
@app.cell(hide_code=True)
def _(mo):
act2_pred = mo.ui.radio(
options={
"A) TP=128, PP=1, DP=8 — maximize tensor parallelism to spread every layer across 128 GPUs": "A",
"B) TP=8, PP=4, DP=32 — within-node TP on NVLink, pipeline across nodes, DP for throughput scale": "B",
"C) TP=1, PP=1024, DP=1 — pure pipeline parallelism to avoid AllReduce entirely": "C",
"D) TP=4, PP=256, DP=1 — deep pipeline to maximize layer-level parallelism": "D",
},
label="Which 3D parallel configuration (TP × PP × DP) best balances memory, compute, and communication for GPT-3 175B on 1024 H100s?",
)
act2_pred
return (act2_pred,)
@app.cell(hide_code=True)
def _(act2_pred, mo):
mo.stop(
act2_pred.value is None,
mo.callout(
mo.md("Select your configuration prediction above to unlock the Act II instruments."),
kind="warn",
),
)
mo.md("")
return
# ─── ACT II: INSTRUMENT PANEL INTRO ────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### 3D Parallelism Configurator
Adjust TP and PP degrees. DP is computed automatically from the constraint
`TP × PP × DP = 1024`. The configurator will enforce per-GPU memory and
pipeline bubble constraints.
""")
return
# ─── ACT II: SLIDERS ───────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
tp_degree = mo.ui.slider(
start=1, stop=64, value=8, step=1,
label="Tensor Parallelism degree (TP)",
)
pp_degree = mo.ui.slider(
start=1, stop=64, value=4, step=1,
label="Pipeline Parallelism degree (PP)",
)
n_microbatches = mo.ui.slider(
start=1, stop=64, value=8, step=1,
label="Microbatches per pipeline flush (m)",
)
mo.hstack([
mo.vstack([tp_degree, pp_degree]),
mo.vstack([n_microbatches]),
], justify="center", gap=2)
return n_microbatches, pp_degree, tp_degree
# ─── ACT II: PHYSICS ENGINE ────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(
COLORS,
GPUS_PER_NODE,
H100_RAM_GB,
H100_TFLOPS_FP16,
IB_HDR200_BW_GBS,
NVLINK4_BW_GBS,
apply_plotly_theme,
go,
math,
mo,
n_microbatches,
np,
pp_degree,
tp_degree,
):
# ── Model constants — GPT-3 175B ─────────────────────────────────────────────
# Source: @sec-distributed-training-systems, Brown et al. 2020 (GPT-3 paper)
GPT3_PARAMS_B = 175.0 # 175B parameters; Brown et al. 2020
GPT3_LAYERS = 96 # transformer layers; Brown et al. 2020
BYTES_PER_PARAM_FP16 = 2 # FP16 model weights
OPTIMIZER_OVERHEAD = 8 # FP32 optimizer states (m1+m2+master) ≈ 8 bytes/param
ACTIVATION_BYTES_GB = 8.0 # activation buffers per pipeline stage (estimate)
TOTAL_BYTES_PER_PARAM = 10 # practical: weights + grads + optimizer ≈ 10 bytes/param
TOTAL_GPUS = 1024 # available H100s
MFU_BASE = 0.52 # reference MFU for calibration
TP_ALLREDUCE_LAYERS = GPT3_LAYERS # TP AllReduce happens every layer
# ── Extract widget values ────────────────────────────────────────────────────
_tp = tp_degree.value
_pp = pp_degree.value
_m = n_microbatches.value
# ── Constraint: TP × PP × DP = 1024 ─────────────────────────────────────────
_tp_pp_product = _tp * _pp
_dp = TOTAL_GPUS // _tp_pp_product if _tp_pp_product <= TOTAL_GPUS else 0
_dp_remainder = TOTAL_GPUS % _tp_pp_product if _tp_pp_product > 0 else 1
_config_valid = (_dp > 0) and (_dp_remainder == 0)
# ── Memory analysis ──────────────────────────────────────────────────────────
# Per-GPU memory = model shards + optimizer + activations
# TP shards model parameters: each GPU holds 1/TP of each tensor
# PP assigns GPT3_LAYERS/PP layers to each stage
_params_per_gpu_b = GPT3_PARAMS_B / (_tp * _pp) # billions
_params_per_gpu = _params_per_gpu_b * 1e9
_model_mem_gb = _params_per_gpu * BYTES_PER_PARAM_FP16 / 1e9
_optim_mem_gb = _params_per_gpu * OPTIMIZER_OVERHEAD / 1e9
_activ_mem_gb = ACTIVATION_BYTES_GB * (_tp if _tp > 1 else 1) # activation replication
_total_mem_gb = _model_mem_gb + _optim_mem_gb + _activ_mem_gb
# ── Failure state: OOM ───────────────────────────────────────────────────────
_oom = _total_mem_gb > H100_RAM_GB
# ── Pipeline bubble fraction ─────────────────────────────────────────────────
# Source: @sec-distributed-training-systems pipeline parallelism section
# B = (PP - 1) / (PP × m)
_bubble_frac = (_pp - 1) / (_pp * _m) if _pp > 1 else 0.0
_bubble_pct = _bubble_frac * 100.0
_bubble_warn = _bubble_pct > 10.0
# ── TP communication overhead ────────────────────────────────────────────────
# TP AllReduce volume per layer = 2 × hidden_dim × seq_len × 2 bytes (FP16)
# Simplified: TP communication time relative to compute
# Each TP AllReduce per layer uses NVLink (within node) or IB (cross-node)
_tp_crosses_node = _tp > GPUS_PER_NODE
_tp_fabric_bw = IB_HDR200_BW_GBS if _tp_crosses_node else NVLINK4_BW_GBS
_tp_bw_penalty = NVLINK4_BW_GBS / _tp_fabric_bw # 1.0 if NVLink, 2.25 if IB
_tp_warn = _tp_crosses_node
# ── Effective MFU estimation ─────────────────────────────────────────────────
# TP penalty: communication overhead from intra-layer AllReduce
# PP penalty: pipeline bubble fraction
# DP penalty: AllReduce for gradients (small for large DP with gradient bucketing)
_tp_comm_penalty = 1.0 - (0.05 * math.log2(max(_tp, 1)) * _tp_bw_penalty) # rough empirical model
_pp_efficiency = 1.0 - _bubble_frac
_dp_comm_penalty = 1.0 - (0.02 * math.log2(max(_dp, 1))) # gradient AllReduce overhead
_mfu_effective = MFU_BASE * _tp_comm_penalty * _pp_efficiency * _dp_comm_penalty
_mfu_effective = max(0.0, min(_mfu_effective, MFU_BASE))
_mfu_pct_3d = _mfu_effective * 100.0
# ── Color coding ─────────────────────────────────────────────────────────────
_mem_color = COLORS["RedLine"] if _oom else (COLORS["OrangeLine"] if _total_mem_gb > 60 else COLORS["GreenLine"])
_bubble_color = COLORS["RedLine"] if _bubble_warn else (COLORS["OrangeLine"] if _bubble_pct > 5 else COLORS["GreenLine"])
_mfu_color_3d = COLORS["RedLine"] if _mfu_pct_3d < 25 else (COLORS["OrangeLine"] if _mfu_pct_3d < 40 else COLORS["GreenLine"])
_cfg_color = COLORS["GreenLine"] if _config_valid else COLORS["RedLine"]
# ── FAILURE STATE: OOM ───────────────────────────────────────────────────────
_oom_banner = ""
if _oom:
_oom_banner = f"""
<div style="background:{COLORS['RedLL']}; border:2px solid {COLORS['RedLine']};
border-radius:10px; padding:14px 18px; margin:10px 0;">
<div style="font-size:0.88rem; font-weight:800; color:{COLORS['RedLine']}; margin-bottom:4px;">
OOM — Configuration Infeasible
</div>
<div style="font-size:0.85rem; color:#7f1d1d; line-height:1.6;">
<strong>Required per GPU: {_total_mem_gb:.1f} GB</strong> &mdash; exceeds H100 limit: {H100_RAM_GB:.0f} GB.<br>
Model shard: {_model_mem_gb:.1f} GB &nbsp;|&nbsp; Optimizer states: {_optim_mem_gb:.1f} GB &nbsp;|&nbsp; Activations: {_activ_mem_gb:.1f} GB.<br>
Increase TP or PP to reduce the per-GPU model shard below {H100_RAM_GB - _activ_mem_gb:.0f} GB (leaving room for activations).
</div>
</div>
"""
# ── WARNING STATE: TP crosses node boundary ───────────────────────────────────
_tp_bw_banner = ""
if _tp_warn:
_penalty_x = NVLINK4_BW_GBS / IB_HDR200_BW_GBS
_tp_bw_banner = f"""
<div style="background:{COLORS['OrangeLL']}; border:1px solid {COLORS['OrangeLine']};
border-radius:8px; padding:12px 16px; margin:8px 0;">
<div style="font-size:0.85rem; font-weight:700; color:{COLORS['OrangeLine']}; margin-bottom:4px;">
Tensor Parallelism Crosses Node Boundary
</div>
<div style="font-size:0.83rem; color:#7c2d12; line-height:1.6;">
TP={_tp} exceeds GPUS_PER_NODE={GPUS_PER_NODE}. TP AllReduce uses
InfiniBand HDR200 ({IB_HDR200_BW_GBS:.0f} GB/s) instead of NVLink 4
({NVLINK4_BW_GBS:.0f} GB/s) &mdash; a <strong>{_penalty_x:.1f}&times; bandwidth penalty</strong>
on every layer's AllReduce. TP should remain &le; {GPUS_PER_NODE} to exploit
NVLink within a single DGX node.
</div>
</div>
"""
# ── Config validity warning ───────────────────────────────────────────────────
_cfg_banner = ""
if not _config_valid:
_cfg_banner = f"""
<div style="background:{COLORS['RedLL']}; border:1px solid {COLORS['RedLine']};
border-radius:8px; padding:12px 16px; margin:8px 0;">
<div style="font-size:0.85rem; font-weight:700; color:{COLORS['RedLine']};">
Invalid Configuration: TP &times; PP = {_tp} &times; {_pp} = {_tp_pp_product}
does not divide 1024 evenly. Choose TP and PP such that 1024 / (TP &times; PP)
is a positive integer.
</div>
</div>
"""
# ── Build bubble fraction vs PP/m chart ──────────────────────────────────────
_pp_range = list(range(1, 33))
_bubble_m1 = [(_p - 1) / (_p * 1) * 100 for _p in _pp_range]
_bubble_m4 = [(_p - 1) / (_p * 4) * 100 for _p in _pp_range]
_bubble_m8 = [(_p - 1) / (_p * 8) * 100 for _p in _pp_range]
_bubble_m16 = [(_p - 1) / (_p * 16) * 100 for _p in _pp_range]
_fig2 = go.Figure()
for _vals, _label, _clr in [
(_bubble_m1, "m=1 microbatch", "#cb202d"),
(_bubble_m4, "m=4 microbatches", "#cc5500"),
(_bubble_m8, "m=8 microbatches", "#006395"),
(_bubble_m16, "m=16 microbatches", "#008f45"),
]:
_fig2.add_trace(go.Scatter(
x=_pp_range, y=_vals, mode="lines", name=_label,
line=dict(color=_clr, width=2),
hovertemplate=f"PP=%{{x}} {_label}<br>Bubble: %{{y:.1f}}%<extra></extra>",
))
_fig2.add_hline(y=10.0, line=dict(color="#1e293b", width=1.5, dash="dash"),
annotation_text="10% bubble ceiling", annotation_position="top right")
# Mark current config
_fig2.add_trace(go.Scatter(
x=[_pp], y=[_bubble_pct],
mode="markers", name="Current config",
marker=dict(size=14, color=COLORS["RedLine"], symbol="diamond",
line=dict(color="white", width=2)),
hovertemplate=f"PP={_pp}, m={_m}<br>Bubble: {_bubble_pct:.1f}%<extra></extra>",
))
_fig2.update_layout(
height=300,
xaxis=dict(title="Pipeline Parallelism (PP stages)", range=[1, 32]),
yaxis=dict(title="Pipeline Bubble Fraction (%)", range=[0, 55]),
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
margin=dict(t=40, b=50, l=50, r=20),
)
apply_plotly_theme(_fig2)
# ── Render all outputs ────────────────────────────────────────────────────────
mo.vstack([
mo.Html(f"""
<div style="background:{COLORS['Surface2']}; border:1px solid {COLORS['Border']};
border-radius:12px; padding:16px 20px; margin:8px 0; font-family:monospace;
font-size:0.83rem; line-height:1.8;">
<div style="font-size:0.72rem; font-weight:700; color:{COLORS['TextMuted']};
text-transform:uppercase; letter-spacing:0.1em; margin-bottom:8px;
font-family:sans-serif;">
Physics — 3D Parallel Memory and Bubble Analysis
</div>
<div>Configuration: TP={_tp} &times; PP={_pp} &times; DP={_dp if _config_valid else "N/A"}
{'= ' + str(TOTAL_GPUS) if _config_valid else '(INVALID: TP&times;PP=' + str(_tp_pp_product) + ' does not divide 1024)'}</div>
<div>Params per GPU = {GPT3_PARAMS_B}B / (TP={_tp} &times; PP={_pp}) = <strong>{_params_per_gpu_b:.2f}B params</strong></div>
<div>Model memory (FP16) = {_params_per_gpu_b:.2f}B &times; 2 bytes = <strong>{_model_mem_gb:.1f} GB</strong></div>
<div>Optimizer states (FP32) = {_params_per_gpu_b:.2f}B &times; 8 bytes = <strong>{_optim_mem_gb:.1f} GB</strong></div>
<div>Activation buffer = <strong>{_activ_mem_gb:.1f} GB</strong> (estimated)</div>
<div>Total per-GPU memory = <strong style="color:{_mem_color};">{_total_mem_gb:.1f} GB</strong> / {H100_RAM_GB:.0f} GB limit</div>
<div>Pipeline bubble B = (PP-1)/(PP&times;m) = ({_pp}-1)/({_pp}&times;{_m}) = <strong style="color:{_bubble_color};">{_bubble_pct:.1f}%</strong></div>
<div>TP bandwidth = <strong>{'InfiniBand ' + str(IB_HDR200_BW_GBS) + ' GB/s (CROSS-NODE)' if _tp_crosses_node else 'NVLink 4 ' + str(NVLINK4_BW_GBS) + ' GB/s (within node)'}</strong></div>
<div>Effective MFU = <strong style="color:{_mfu_color_3d};">{_mfu_pct_3d:.1f}%</strong></div>
</div>
{_oom_banner}
{_tp_bw_banner}
{_cfg_banner}
"""),
mo.Html(f"""
<div style="display:flex; gap:16px; justify-content:center; margin:8px 0; flex-wrap:wrap;">
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">Per-GPU Memory</div>
<div style="font-size:2.2rem; font-weight:800; color:{_mem_color};
font-family:monospace;">{_total_mem_gb:.0f}GB</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">/ {H100_RAM_GB:.0f} GB limit</div>
</div>
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">Pipeline Bubble</div>
<div style="font-size:2.2rem; font-weight:800; color:{_bubble_color};
font-family:monospace;">{_bubble_pct:.1f}%</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">ceiling: 10%</div>
</div>
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">Effective MFU</div>
<div style="font-size:2.2rem; font-weight:800; color:{_mfu_color_3d};
font-family:monospace;">{_mfu_pct_3d:.1f}%</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">3D parallel</div>
</div>
<div style="padding:18px 24px; border:1px solid {COLORS['Border']}; border-radius:10px;
width:160px; text-align:center; background:white;">
<div style="color:{COLORS['TextMuted']}; font-size:0.82rem; font-weight:600;
text-transform:uppercase; letter-spacing:0.06em;">DP degree</div>
<div style="font-size:2.2rem; font-weight:800; color:{_cfg_color};
font-family:monospace;">{_dp if _config_valid else 'N/A'}</div>
<div style="font-size:0.72rem; color:{COLORS['TextMuted']}; margin-top:2px;">= 1024 / (TP&times;PP)</div>
</div>
</div>
"""),
mo.ui.plotly(_fig2),
])
return (
_bubble_pct,
_config_valid,
_dp,
_mfu_pct_3d,
_oom,
_total_mem_gb,
_tp_crosses_node,
)
# ─── ACT II: PREDICTION REVEAL ─────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(act2_pred, mo):
_correct = act2_pred.value == "B"
if _correct:
mo.callout(mo.md(
"**Correct.** TP=8, PP=4, DP=32 is the principled baseline for GPT-3 scale training. "
"TP=8 maps exactly to one DGX node (8 GPUs per node), keeping TP AllReduce on "
"NVLink at 900 GB/s. PP=4 assigns 96/4=24 transformer layers per stage, "
"requiring 4 nodes per pipeline. With 8 microbatches, the pipeline bubble "
"B=(4-1)/(4×8)=9.375% stays just under the 10% ceiling. DP=32 then "
"replicates the TP×PP group 32 times across the remaining 1024/(8×4)=32 GPU groups. "
"This matches the configuration used in real GPT-3-scale training runs "
"on DGX clusters (Megatron-LM, 2021)."
), kind="success")
elif act2_pred.value == "A":
mo.callout(mo.md(
"**Infeasible.** TP=128 distributes each layer across 128 GPUs. Each tensor "
"parallel AllReduce must traverse 16 DGX nodes (128/8=16), using InfiniBand "
"instead of NVLink — a 2.25× bandwidth penalty on every single layer forward and "
"backward pass. The AllReduce occurs 96 times per forward pass (once per transformer "
"layer). At IB bandwidth this becomes the dominant bottleneck, crushing MFU. "
"Configure TP in the simulator with TP > 8 to observe the bandwidth penalty warning."
), kind="warn")
elif act2_pred.value == "C":
mo.callout(mo.md(
"**Catastrophic bubble overhead.** PP=1024 with a single microbatch gives "
"B=(1024-1)/(1024×1)≈99.9% bubble fraction — the cluster is 99.9% idle. "
"Even with m=64 microbatches: B=(1024-1)/(1024×64)≈1.5%, but now each "
"gradient accumulation step is enormous, harming optimizer convergence. "
"Pure pipeline parallelism with depth matching GPU count is never used in practice. "
"Use the configurator to set PP=1024 and observe the bubble fraction."
), kind="warn")
elif act2_pred.value == "D":
mo.callout(mo.md(
"**Pipeline bubble too large.** PP=256 with m=8 microbatches gives "
"B=(256-1)/(256×8)=12.4% — already over the 10% ceiling. "
"You would need m=32 microbatches to bring the bubble to 3.1%, "
"but that requires a batch size of 32×256=8,192 sequences through the pipeline "
"before each optimizer step, creating a very large effective batch. "
"With DP=1, there is no data parallelism to amortize the batch size requirement. "
"This is an over-pipelined design."
), kind="warn")
else:
mo.callout(mo.md("Select a configuration prediction above to see the analysis."), kind="info")
return
# ─── ACT II: MATHPEEK ──────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.accordion({
"The governing equations — 3D parallel memory, bubble fraction, and TP communication": mo.md("""
**3D Parallel Per-GPU Memory**
```
mem_per_gpu = (params / (TP × PP)) × bytes_per_param
+ (params / (TP × PP)) × optimizer_bytes_per_param
+ activation_buffer
```
- `params` — total model parameters (e.g. 175B for GPT-3)
- `TP × PP` — reduces the parameter shard on each GPU
- `bytes_per_param` — FP16 = 2 bytes; FP32 master copy = 4 bytes
- `optimizer_bytes_per_param` — Adam states: 2 FP32 moments + master = ~8 bytes/param
- **Key insight**: TP and PP jointly reduce per-GPU memory — TP shards each matrix
horizontally, PP shards the depth (layers). DP does NOT reduce memory: every DP replica
holds the full TP×PP model shard.
**Pipeline Bubble Fraction**
```
B = (PP - 1) / (PP × m)
```
- `PP` — pipeline parallelism degree (stages)
- `m` — number of microbatches per pipeline flush
- At PP=4, m=8: B = 3/32 = 9.375%
- **Key insight**: increasing m (microbatches) reduces bubble but increases pipeline latency
and may harm optimizer convergence at very large effective batch sizes.
- Practical ceiling: B < 10% is standard in production (Megatron-LM guidelines).
**Tensor Parallelism Communication Volume (per layer)**
```
TP AllReduce per layer = 2 × (TP - 1)/TP × hidden_dim × seq_len × 2 bytes (FP16)
```
- Occurs **every layer** in both forward and backward passes
- At 900 GB/s (NVLink): ~0.5 ms per layer for a 175B model configuration
- At 400 GB/s (IB): ~1.1 ms per layer — 2.25× slower, applied 96 times per forward pass
- **Key insight**: TP communication is not a one-time cost — it is a per-layer tax.
This is why TP > 8 (crossing node boundary to IB) destroys MFU.
""")
})
return
# ─── ACT II: REFLECTION ────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### Reflection
You observed that TP=8 is the natural constraint boundary. Before finishing, confirm
your understanding of why this boundary is fundamental.
""")
return
@app.cell(hide_code=True)
def _(mo):
act2_reflect = mo.ui.radio(
options={
"A) PyTorch does not support cross-node tensor parallelism in its distributed primitives": "A",
"B) Tensor parallel AllReduce happens every layer — at InfiniBand bandwidth this becomes the dominant bottleneck": "B",
"C) Tensor parallelism requires shared GPU memory, which is unavailable across separate nodes": "C",
"D) Cross-node tensor parallelism causes numerical instability due to floating-point rounding across nodes": "D",
},
label="Why must tensor parallelism be confined within a single DGX node (TP ≤ 8)?",
)
act2_reflect
return (act2_reflect,)
@app.cell(hide_code=True)
def _(act2_reflect, mo):
mo.stop(
act2_reflect.value is None,
mo.callout(mo.md("Select an answer to see the explanation."), kind="warn"),
)
if act2_reflect.value == "B":
mo.callout(mo.md(
"**Correct.** Tensor parallelism introduces an AllReduce after every transformer "
"layer's matrix operations — both in the forward pass and the backward pass. "
"For a 96-layer model like GPT-3, that is 192 AllReduce calls per training step. "
"At NVLink bandwidth (900 GB/s) this adds ~1 ms per step — tolerable. "
"At InfiniBand bandwidth (400 GB/s), the penalty is 2.25× higher and accumulates "
"across all 96 layers, making TP communication the dominant step time. "
"The constraint TP ≤ GPUS_PER_NODE (≤ 8) is not a software limitation; "
"it is a bandwidth physics constraint."
), kind="success")
elif act2_reflect.value == "A":
mo.callout(mo.md(
"**Incorrect.** PyTorch (via Megatron-LM's column/row parallel linear layers) "
"and frameworks like DeepSpeed fully support cross-node tensor parallelism "
"using the standard NCCL AllReduce over InfiniBand. The constraint is physical, "
"not a software limitation. The code works fine; the bandwidth penalty is what "
"makes cross-node TP undesirable."
), kind="warn")
elif act2_reflect.value == "C":
mo.callout(mo.md(
"**Incorrect.** Tensor parallelism does not require shared memory. It is a "
"message-passing strategy: each GPU holds a shard of the weight matrix, "
"computes a partial matrix multiply on its shard, then the partial results are "
"reduced via AllReduce across all TP ranks. This works equally over NVLink "
"or InfiniBand — the difference is only bandwidth and therefore latency."
), kind="warn")
elif act2_reflect.value == "D":
mo.callout(mo.md(
"**Incorrect.** Floating-point arithmetic in distributed training uses deterministic "
"reduction primitives (NCCL's AllReduce). The numerical behavior is identical whether "
"the AllReduce traverses NVLink or InfiniBand — both use the same FP16/BF16 precision "
"operations. Numerical instability in distributed training typically arises from "
"gradient accumulation order (non-associative floating-point operations), not from "
"the physical transport medium."
), kind="warn")
return
# ─── LEDGER SAVE + HUD ─────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(
COLORS,
_bubble_pct,
_config_valid,
_dp,
_mfu_pct_3d,
_oom,
_total_mem_gb,
_tp_crosses_node,
act1_pred,
act2_pred,
act2_reflect,
act1_reflect,
ledger,
mo,
n_microbatches,
pp_degree,
tp_degree,
):
# ── Save to Design Ledger ────────────────────────────────────────────────────
_context = "3d_parallel" if tp_degree.value > 1 or pp_degree.value > 1 else "data_parallel"
ledger.save(
chapter="v2_05",
design={
"context": _context,
"tp_degree": tp_degree.value,
"pp_degree": pp_degree.value,
"dp_degree": _dp,
"total_gpus": 1024,
"mfu_percent": round(_mfu_pct_3d, 2),
"act1_prediction": act1_pred.value if act1_pred.value else "no_selection",
"act1_correct": act1_pred.value == "B",
"act1_reflect": act1_reflect.value if act1_reflect.value else "no_selection",
"act2_result": round(_mfu_pct_3d, 2),
"act2_decision": act2_pred.value if act2_pred.value else "no_selection",
"constraint_hit": _oom or _tp_crosses_node,
"memory_feasible": not _oom,
},
)
# ── Determine overall performance tier ──────────────────────────────────────
_act1_ok = act1_pred.value == "B"
_act2_ok = act2_pred.value == "B"
_mfu_ok = _mfu_pct_3d >= 40.0 and not _oom and _bubble_pct <= 10.0
_tier = "Optimal" if (_act1_ok and _act2_ok and _mfu_ok) else ("Partial" if (_act1_ok or _act2_ok) else "Developing")
_tier_color = COLORS["GreenLine"] if _tier == "Optimal" else (COLORS["OrangeLine"] if _tier == "Partial" else COLORS["TextMuted"])
# ── HUD Footer ───────────────────────────────────────────────────────────────
_hud = mo.Html(f"""
<div class="lab-hud">
<div>
<span class="hud-label">LAB</span>&nbsp;
<span class="hud-value">Vol2 · Lab 05</span>
</div>
<div>
<span class="hud-label">CHAPTER</span>&nbsp;
<span class="hud-value">v2_05 · Distributed Training</span>
</div>
<div>
<span class="hud-label">CONTEXT</span>&nbsp;
<span class="hud-value">{_context.upper()}</span>
</div>
<div>
<span class="hud-label">CONFIG</span>&nbsp;
<span class="hud-value">TP={tp_degree.value} &times; PP={pp_degree.value} &times; DP={_dp}</span>
</div>
<div>
<span class="hud-label">MFU</span>&nbsp;
<span style="color:{COLORS['GreenLine'] if _mfu_pct_3d >= 40 else COLORS['OrangeLine']}; font-family:var(--font-mono); font-size:0.8rem;">
{_mfu_pct_3d:.1f}%
</span>
</div>
<div>
<span class="hud-label">ACT I</span>&nbsp;
<span class="{'hud-active' if _act1_ok else 'hud-none'}">&nbsp;{"CORRECT" if _act1_ok else "REVIEW"}</span>
</div>
<div>
<span class="hud-label">ACT II</span>&nbsp;
<span class="{'hud-active' if _act2_ok else 'hud-none'}">&nbsp;{"CORRECT" if _act2_ok else "REVIEW"}</span>
</div>
<div>
<span class="hud-label">TIER</span>&nbsp;
<span style="color:{_tier_color}; font-family:var(--font-mono); font-size:0.8rem;">{_tier.upper()}</span>
</div>
<div>
<span class="hud-label">OOM</span>&nbsp;
<span class="{'hud-none' if _oom else 'hud-active'}">&nbsp;{"YES" if _oom else "NO"}</span>
</div>
</div>
""")
_hud
return
# ─── CONNECTIONS ──────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
---
## Connections
**Textbook:** This lab explores the core concepts of
@sec-distributed-training-systems — the Iron Law of Scale, the
Communication-Computation Ratio, and the 3D Parallelism Cube
(@fig-3d-parallelism-cube).
**Prior Labs:** Lab 03 (Network Fabrics) established the physical bandwidth
limits — NVLink vs InfiniBand — that constrain TP degree here.
Lab 04 (Data Storage) established the I/O pipeline that feeds each DP replica.
**Next Lab:** Vol2 Lab 06 (Collective Communications) examines the
Ring-AllReduce and Tree-AllReduce algorithms in detail, quantifying
why Ring-AllReduce achieves near-linear scaling efficiency while
parameter-server approaches hit coordination bottlenecks.
""")
return
# ─── KEY TAKEAWAYS ─────────────────────────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.md("""
## Key Takeaways
1. **The Parallelism Paradox is a bandwidth ratio, not a software bug.**
AllReduce volume saturates at approximately 2× gradient size regardless of GPU count,
but the transition from NVLink (900 GB/s, within node) to InfiniBand (400 GB/s, cross-node)
creates a step-change in communication time that drives the MFU cliff observed at
8→64 GPUs. MFU falls because T_allreduce grows while T_compute stays constant.
2. **The 3D parallelism constraint TP ≤ GPUS_PER_NODE is physics, not convention.**
Tensor parallelism performs AllReduce after every transformer layer. At 96 layers,
the per-layer AllReduce penalty accumulates into a step-time budget that InfiniBand
cannot satisfy. Confine TP within a single DGX node to keep every layer's
synchronization on NVLink. Then PP crosses nodes over InfiniBand — but PP
AllReduce happens only once per pipeline stage, not once per layer.
""")
return
if __name__ == "__main__":
app.run()