import marimo __generated_with = "0.19.6" app = marimo.App(width="full") # ───────────────────────────────────────────────────────────────────────────── # LAB 05: THE PARALLELISM PARADOX # # Chapter: Distributed Training Systems (@sec-distributed-training-systems) # Core Invariant: The Parallelism Paradox — adding more GPUs to data parallel # training increases communication overhead, which can decrease # MFU below single-GPU levels for large models. 3D parallelism # (Tensor + Pipeline + Data) is required for models that don't # fit on a single GPU, but each dimension adds overhead. # # 2-Act Structure (35-40 minutes): # Act I — The Data Parallel Wall (12-15 min) # A 7B model trained with DP across 8→64→512 GPUs shows MFU # collapsing from 52% to 19%. The central question: why does MFU # fall as we add more GPUs? Students must confront that communication # time grows relative to compute time as cluster size grows. # # Act II — 3D Parallelism Design Challenge (20-25 min) # Design the TP×PP×DP configuration for GPT-3 175B on 1024 H100s. # The failure state: per-GPU memory exceeds 80 GB (model doesn't fit) # and a bandwidth penalty warning when TP crosses node boundaries. # # Deployment Contexts: # DP: Data Parallel — replicate model, sync gradients via AllReduce # 3D Parallel: TP×PP×DP — within-node TP (NVLink), cross-node PP (IB), DP # # Hardware Constants: # H100_TFLOPS_FP16 = 1979 # TFLOPS, H100 SXM5 with sparsity; source: NVIDIA spec # H100_BW_GBS = 3350 # GB/s HBM3e; source: NVIDIA H100 spec sheet # H100_RAM_GB = 80 # GB HBM3e; source: NVIDIA H100 spec sheet # NVLINK4_BW_GBS = 900 # GB/s NVLink 4; source: NVIDIA DGX H100 spec # IB_HDR200_BW_GBS = 400 # GB/s InfiniBand HDR200; source: Mellanox spec # GPUS_PER_NODE = 8 # Standard DGX H100 node size # # Design Ledger: saves chapter="v2_05" with DP vs 3D context, parallelism # degrees, MFU achieved, prediction accuracy, failure states. # ───────────────────────────────────────────────────────────────────────────── # ─── CELL 0: SETUP (hide_code=False — leave visible for instructor inspection) ─ @app.cell def _(): import marimo as mo import sys import math from pathlib import Path import plotly.graph_objects as go import numpy as np from plotly.subplots import make_subplots _root = Path(__file__).resolve().parents[2] if str(_root) not in sys.path: sys.path.insert(0, str(_root)) from labs.core.state import DesignLedger from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme ledger = DesignLedger() return COLORS, LAB_CSS, apply_plotly_theme, go, ledger, math, mo, np, make_subplots # ─── CELL 1: HEADER ──────────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(COLORS, LAB_CSS, mo): _c_dp = COLORS["BlueLine"] _c_3d = COLORS["Cloud"] _c_s0 = COLORS["Surface0"] _c_s1 = COLORS["Surface1"] _header = mo.Html(f""" {LAB_CSS}

Vol 2 · Lab 05 · Distributed Training Systems

The Parallelism Paradox

Adding GPUs to a data-parallel job can reduce Model FLOPs Utilization below single-GPU levels. This lab forces you to confront the communication-computation ratio and design the 3D parallelism configuration that keeps 1024 H100s productive.

Parallelism Paradox AllReduce Bandwidth Model 3D Parallel: TP × PP × DP 35–40 minutes · 2 Acts

Context A — Data Parallel — 7B model · 8–512 GPUs · AllReduce via NVLink / IB

Context B — 3D Parallel — 175B model · 1024 H100s · TP×PP×DP design

""") _header return # ─── CELL 2: RECOMMENDED READING ─────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.callout(mo.md(""" **Recommended Reading** — Complete the following before this lab: - **@sec-distributed-training-systems-systems-multimachine-scaling-fundamentals-ff96** — The Iron Law of Scale: `T_step(N) = T_compute/N + T_comm(N) - T_overlap` and the Communication-Computation Ratio - **@sec-distributed-training-systems** — Why distribution is necessary: memory exhaustion, training duration, and dataset scale thresholds - The Data Parallelism section — AllReduce gradient synchronization, Ring-AllReduce bandwidth formula, gradient bucketing - The 3D Parallelism section — Tensor Parallelism (within-node), Pipeline Parallelism (across nodes), pipeline bubble fraction `B = (PP-1)/(PP * m)` """), kind="info") return # ─── CELL 3: CONTEXT TOGGLE ───────────────────────────────────────────────────── @app.cell(hide_code=True) def _(COLORS, mo): _c_muted = COLORS["TextMuted"] context_toggle = mo.ui.radio( options={ "Data Parallel (DP)": "dp", "3D Parallel (TP+PP+DP)": "3d", }, value="Data Parallel (DP)", label="Deployment context for this session:", inline=True, ) mo.hstack([ mo.Html(f"""

Active Context:

"""), context_toggle, ], justify="start", gap=0) return (context_toggle,) # ═══════════════════════════════════════════════════════════════════════════════ # ACT I — THE DATA PARALLEL WALL # ═══════════════════════════════════════════════════════════════════════════════ # ─── ACT I: SECTION HEADER ───────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" --- ## Act I — The Data Parallel Wall *Calibration · 12–15 minutes* """) return # ─── ACT I: STAKEHOLDER MESSAGE ──────────────────────────────────────────────── @app.cell(hide_code=True) def _(COLORS, mo): _color = COLORS["BlueLine"] _bg = COLORS["BlueL"] mo.Html(f"""

Incoming Message · Training Infrastructure Lead

"We are training a 7B parameter model using data parallelism. I measured MFU at 8 GPUs = 52%. At 64 GPUs it dropped to 38%. At 512 GPUs it's 19%. The model hasn't changed. The batch size per GPU hasn't changed. We just added more GPUs and it got worse. We're wasting $40,000 per day in idle compute. Can you tell me exactly why MFU falls as we scale data parallelism?"

""") return # ─── ACT I: CONCEPT FRAMING ──────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" The training lead's observation is not a software bug. It is the **Parallelism Paradox**: the physical law that governs every data-parallel training job. Data parallel training replicates the model on every GPU, runs a forward and backward pass on each device's local batch, then synchronizes gradients across all devices via **AllReduce** before the optimizer step. The AllReduce is unavoidable — without it, each replica would diverge. The critical question is how long AllReduce takes relative to the compute step. The **Communication-Computation Ratio** from @sec-distributed-training-systems determines whether a cluster behaves as a supercomputer or as a collection of idling heaters: - **Compute-Bound (Low Ratio)**: `T_compute >> T_comm`. GPUs spend most time on matrix multiplications. This is the ideal state. - **Communication-Bound (High Ratio)**: `T_comm ≈ T_compute`. GPUs spend significant time waiting for gradients. This is the common state for LLMs at scale. Before looking at any numbers, commit to a prediction about what causes MFU to fall. """) return # ─── ACT I: PREDICTION LOCK ──────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md("### Your Prediction") return @app.cell(hide_code=True) def _(mo): act1_pred = mo.ui.radio( options={ "A) Software overhead in NCCL and collective libraries grows with cluster size": "A", "B) AllReduce communication time grows with cluster size while compute time stays constant — the comm/compute ratio rises": "B", "C) Larger clusters cause L2 cache pressure and HBM bandwidth saturation per GPU": "C", "D) 512 GPUs exceeds optimal batch size — gradient quality degrades and more steps are needed": "D", }, label="Why does Model FLOPs Utilization (MFU) fall as we scale a data-parallel job from 8 to 512 GPUs?", ) act1_pred return (act1_pred,) @app.cell(hide_code=True) def _(act1_pred, mo): mo.stop( act1_pred.value is None, mo.callout( mo.md("Select your prediction above to unlock the Act I instruments."), kind="warn", ), ) mo.md("") return # ─── ACT I: INSTRUMENT PANEL INTRO ───────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" ### Data Parallel Scaling Explorer Adjust the parameters below to see how AllReduce communication time compares to compute time — and how their ratio determines MFU at each scale point. """) return # ─── ACT I: SLIDERS ──────────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): dp_model_b = mo.ui.slider( start=1, stop=175, value=7, step=1, label="Model size (B params)", ) dp_gpus = mo.ui.slider( start=8, stop=1024, value=8, step=8, label="Number of GPUs (DP degree)", ) dp_batch_per_gpu = mo.ui.slider( start=8, stop=128, value=32, step=8, label="Micro-batch size per GPU", ) dp_interconnect = mo.ui.dropdown( options={"NVLink 4 (within DGX node, 8 GPUs)": "nvlink", "InfiniBand HDR200 (cross-node)": "ib"}, value="NVLink 4 (within DGX node, 8 GPUs)", label="Interconnect fabric", ) mo.hstack([ mo.vstack([dp_model_b, dp_gpus]), mo.vstack([dp_batch_per_gpu, dp_interconnect]), ], justify="center", gap=2) return dp_batch_per_gpu, dp_gpus, dp_interconnect, dp_model_b # ─── ACT I: PHYSICS ENGINE ───────────────────────────────────────────────────── @app.cell(hide_code=True) def _(COLORS, apply_plotly_theme, dp_batch_per_gpu, dp_gpus, dp_interconnect, dp_model_b, go, math, mo, np): # ── Hardware constants ─────────────────────────────────────────────────────── # Source: NVIDIA H100 SXM5 spec sheet, Mellanox InfiniBand HDR200 spec H100_TFLOPS_FP16 = 1979.0 # TFLOPS H100 SXM5 with sparsity; NVIDIA spec H100_RAM_GB = 80.0 # GB HBM3e; NVIDIA H100 spec sheet NVLINK4_BW_GBS = 900.0 # GB/s NVLink 4 (per-direction); NVIDIA DGX H100 spec IB_HDR200_BW_GBS = 400.0 # GB/s InfiniBand HDR200 (per-direction); Mellanox spec GPUS_PER_NODE = 8 # DGX H100 node size; standard industry config BYTES_PER_PARAM = 2 # FP16: 2 bytes per parameter BYTES_PER_GRAD = 2 # FP16 gradients: 2 bytes per gradient # ── Extract widget values ──────────────────────────────────────────────────── _params_b = dp_model_b.value # billions of params _gpus = dp_gpus.value _batch_gpu = dp_batch_per_gpu.value _fabric = dp_interconnect.value # ── Derived quantities ─────────────────────────────────────────────────────── _params = _params_b * 1e9 # total parameters _grad_bytes = _params * BYTES_PER_GRAD # gradient tensor size in bytes _grad_gb = _grad_bytes / 1e9 # gradient size in GB # ── AllReduce bandwidth model ──────────────────────────────────────────────── # Ring-AllReduce transfers 2*(N-1)/N * data per device # Source: @sec-distributed-training-systems AllReduce bandwidth analysis # Effective bytes per GPU = 2*(N-1)/N * grad_gb _allreduce_factor = 2.0 * (_gpus - 1) / _gpus _allreduce_data_gb = _grad_gb * _allreduce_factor # GB transferred per GPU # Interconnect selection and effective bandwidth # Note: when >8 GPUs, traffic crosses node boundary via InfiniBand even if NVLink selected _effective_bw = NVLINK4_BW_GBS if (_fabric == "nvlink" and _gpus <= GPUS_PER_NODE) else IB_HDR200_BW_GBS _fabric_label = "NVLink 4" if _effective_bw == NVLINK4_BW_GBS else "InfiniBand HDR200" _forced_ib = (_fabric == "nvlink" and _gpus > GPUS_PER_NODE) _allreduce_time_s = _allreduce_data_gb / _effective_bw # seconds # ── Compute step time ──────────────────────────────────────────────────────── # Forward + backward pass FLOPs ≈ 6 * params * batch_size # Source: standard transformer FLOP count estimate (6N approximation) _seq_len = 2048 # typical sequence length for a 7B-class model _hidden_dim = 4096 # typical for 7B models _flops_step = 6.0 * _params * _batch_gpu # forward + backward FLOPs _mfu_ref = 0.52 # reference MFU at 8 GPUs (from stakeholder message) # Compute time at reference MFU to anchor the physics _compute_time_s = _flops_step / (H100_TFLOPS_FP16 * 1e12 * _mfu_ref) # ── Effective MFU with AllReduce overhead ──────────────────────────────────── # MFU = T_compute / (T_compute + T_allreduce) # When T_allreduce grows relative to T_compute, MFU falls _total_time_s = _compute_time_s + _allreduce_time_s _mfu_effective = (_compute_time_s / _total_time_s) * _mfu_ref # fractional _mfu_pct = _mfu_effective * 100.0 # ── Comm/Compute ratio ─────────────────────────────────────────────────────── _cc_ratio = _allreduce_time_s / _compute_time_s # ── Color coding ───────────────────────────────────────────────────────────── _mfu_color = COLORS["GreenLine"] if _mfu_pct >= 45 else (COLORS["OrangeLine"] if _mfu_pct >= 25 else COLORS["RedLine"]) _cc_color = COLORS["GreenLine"] if _cc_ratio <= 0.3 else (COLORS["OrangeLine"] if _cc_ratio <= 0.8 else COLORS["RedLine"]) # ── Build MFU vs GPU count curve ───────────────────────────────────────────── _gpu_range = [8, 16, 32, 64, 128, 256, 512, 1024] _mfu_curve = [] for _g in _gpu_range: _ar_factor_g = 2.0 * (_g - 1) / _g _bw_g = NVLINK4_BW_GBS if _g <= GPUS_PER_NODE else IB_HDR200_BW_GBS _ar_time_g = (_grad_gb * _ar_factor_g) / _bw_g _total_g = _compute_time_s + _ar_time_g _mfu_g = (_compute_time_s / _total_g) * _mfu_ref * 100.0 _mfu_curve.append(_mfu_g) _fig = go.Figure() _fig.add_trace(go.Scatter( x=_gpu_range, y=_mfu_curve, mode="lines+markers", line=dict(color=COLORS["BlueLine"], width=2.5), marker=dict(size=8, color=COLORS["BlueLine"]), name="MFU (model)", hovertemplate="%{x} GPUs
MFU: %{y:.1f}%", )) # Mark the current selection _fig.add_trace(go.Scatter( x=[_gpus], y=[_mfu_pct], mode="markers", marker=dict(size=16, color=COLORS["RedLine"], symbol="diamond", line=dict(color="white", width=2)), name="Current config", hovertemplate="Current: %{x} GPUs
MFU: %{y:.1f}%", )) # Reference points from stakeholder message _ref_x = [8, 64, 512] _ref_y = [52, 38, 19] _fig.add_trace(go.Scatter( x=_ref_x, y=_ref_y, mode="markers", marker=dict(size=12, color=COLORS["OrangeLine"], symbol="x", line=dict(color=COLORS["OrangeLine"], width=3)), name="Measured (stakeholder)", hovertemplate="Measured: %{x} GPUs
MFU: %{y:.0f}%", )) _fig.add_hline(y=40.0, line=dict(color=COLORS["GreenLine"], width=1.5, dash="dash"), annotation_text="40% — practical floor", annotation_position="bottom right") _fig.update_layout( height=320, xaxis=dict(title="GPU Count (DP degree)", type="log", tickvals=[8, 16, 32, 64, 128, 256, 512, 1024], ticktext=["8", "16", "32", "64", "128", "256", "512", "1024"]), yaxis=dict(title="Model FLOPs Utilization (%)", range=[0, 65]), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1), margin=dict(t=40, b=50, l=50, r=20), ) apply_plotly_theme(_fig) # ── Forced-IB warning ──────────────────────────────────────────────────────── _ib_warn = "" if _forced_ib: _ib_warn = f"""

Interconnect Upgrade Applied: NVLink 4 operates within a single DGX node (8 GPUs). At {_gpus} GPUs, traffic crosses node boundaries. Effective bandwidth is InfiniBand HDR200 ({IB_HDR200_BW_GBS:.0f} GB/s) — {NVLINK4_BW_GBS/IB_HDR200_BW_GBS:.1f}× slower than NVLink 4.

""" # ── Physics display ────────────────────────────────────────────────────────── mo.vstack([ mo.Html(f"""

                Physics — AllReduce Bandwidth Model
            
Gradient tensor = {_params_b}B params × {BYTES_PER_GRAD} bytes/param = {_grad_gb:.1f} GB
Ring-AllReduce factor = 2×(N-1)/N = 2×({_gpus}-1)/{_gpus} = {_allreduce_factor:.4f}
Data transferred per GPU = {_grad_gb:.1f} GB × {_allreduce_factor:.4f} = {_allreduce_data_gb:.2f} GB
Interconnect = {_fabric_label} — bandwidth = {_effective_bw:.0f} GB/s
T_allreduce = {_allreduce_data_gb:.2f} GB / {_effective_bw:.0f} GB/s = {_allreduce_time_s*1000:.1f} ms
T_compute (at {_mfu_ref*100:.0f}% MFU ref) = {_compute_time_s*1000:.1f} ms
Comm/Compute ratio = {_allreduce_time_s*1000:.1f} / {_compute_time_s*1000:.1f} = {_cc_ratio:.2f}
MFU_effective = T_compute / (T_compute + T_allreduce) × MFU_ref
               = {_compute_time_s*1000:.1f} / {_total_time_s*1000:.1f} × {_mfu_ref*100:.0f}% = {_mfu_pct:.1f}%

{_ib_warn} """), mo.Html(f"""

MFU

{_mfu_pct:.1f}%

model FLOPs utilization

Comm / Compute

{_cc_ratio:.2f}

ratio (lower = better)

AllReduce Time

{_allreduce_time_s*1000:.1f}ms

per step

Compute Time

{_compute_time_s*1000:.1f}ms

per step (ref MFU)

"""), mo.ui.plotly(_fig), ]) return ( H100_RAM_GB, H100_TFLOPS_FP16, IB_HDR200_BW_GBS, NVLINK4_BW_GBS, GPUS_PER_NODE, ) # ─── ACT I: PREDICTION REVEAL ────────────────────────────────────────────────── @app.cell(hide_code=True) def _(act1_pred, mo): _correct = act1_pred.value == "B" if _correct: mo.callout(mo.md( "**Correct.** Option B identifies the root cause: the Ring-AllReduce transfer volume " "is essentially constant (approximately 2 × gradient size regardless of N for large N), " "but it must traverse InfiniBand at 400 GB/s when the cluster spans multiple nodes " "instead of NVLink at 900 GB/s within a node. The compute time per GPU does not change " "as you add GPUs. Therefore the comm/compute ratio rises with cluster size, " "directly reducing MFU. The simulator above shows this as the cliff in the MFU curve " "between 8 and 64 GPUs where traffic transitions from NVLink to InfiniBand." ), kind="success") elif act1_pred.value == "A": mo.callout(mo.md( "**Not the primary cause.** NCCL is highly optimized and adds minimal overhead " "relative to wire transfer time. The dominant factor is the physical bandwidth " "of the interconnect, not the software library overhead. At 512 GPUs, " "the AllReduce transfer itself consumes ~140 ms on InfiniBand while the compute " "step takes ~60 ms — NCCL overhead is negligible compared to this ratio." ), kind="warn") elif act1_pred.value == "C": mo.callout(mo.md( "**Not the primary cause.** Each GPU's local computation is unchanged — the " "same model, same batch size per GPU, same forward and backward pass. " "Cache pressure and HBM bandwidth utilization per GPU are essentially identical " "regardless of whether you are running with 8 or 512 GPUs. The bottleneck " "is between nodes, not within them." ), kind="warn") elif act1_pred.value == "D": mo.callout(mo.md( "**A real phenomenon, but not the cause here.** Gradient quality degradation " "with very large global batch sizes is a real concern (the linear scaling rule " "breaks above a critical batch size), but the stakeholder explicitly notes that " "batch size per GPU is unchanged. Total global batch = 512 GPUs × 32 = 16,384. " "For a 7B model this is well within the stable scaling regime. The MFU drop " "is communication-bound, not convergence-bound." ), kind="warn") else: mo.callout(mo.md("Select a prediction above to see the reveal."), kind="info") return # ─── ACT I: ACT I MATHPEEK ───────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.accordion({ "The governing equations — AllReduce bandwidth model and DP efficiency": mo.md(""" **Ring-AllReduce Transfer Volume (per GPU)** ``` T_allreduce = [2 × (N-1)/N × grad_bytes] / BW_interconnect ``` - `N` — number of data-parallel replicas (GPU count) - `grad_bytes` — gradient tensor size = params × 2 bytes (FP16) - `BW_interconnect` — 900 GB/s (NVLink 4, within node) or 400 GB/s (IB HDR200, cross-node) - For large N: the factor 2×(N-1)/N → 2, so AllReduce volume saturates at ~2× gradient size - **Key insight**: AllReduce volume does NOT grow linearly with N — it saturates. But the bandwidth cliff when crossing the node boundary (NVLink → IB) creates a step-change in latency. **DP Efficiency Formula** ``` MFU_effective = (T_compute / (T_compute + T_allreduce)) × MFU_ref ``` - `T_compute` — forward + backward FLOPs / (peak_TFLOPS × MFU_ref) - `T_allreduce` — grows as cluster spans more nodes (IB replaces NVLink) - When T_allreduce ≈ T_compute (ratio ≈ 1), effective MFU ≈ MFU_ref / 2 **Gradient Bucketing Analysis** ``` T_effective = max(T_compute_late_layers, T_allreduce_early_gradients) ``` Gradient bucketing starts AllReduce for early-layer gradients while later layers are still computing. Ideal overlap: T_effective → T_compute (hiding communication). In practice, overlapping achieves 60–80% communication hiding for large models. """) }) return # ─── ACT I: REFLECTION ───────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" ### Reflection Now that you have explored the AllReduce bottleneck, consider the primary technique practitioners use to reclaim efficiency: overlapping communication with computation. """) return @app.cell(hide_code=True) def _(mo): act1_reflect = mo.ui.radio( options={ "A) Use faster GPUs to reduce compute time — this shrinks the total step time": "A", "B) Gradient bucketing + async AllReduce — begin communicating early-layer gradients while computing late-layer gradients": "B", "C) Reduce batch size per GPU to reduce the gradient tensor size and shorten AllReduce": "C", "D) Quantize gradients to INT8 for AllReduce communication, then dequantize before the optimizer step": "D", }, label="What is the primary technique to overlap communication with computation in data parallel training?", ) act1_reflect return (act1_reflect,) @app.cell(hide_code=True) def _(act1_reflect, mo): mo.stop( act1_reflect.value is None, mo.callout(mo.md("Select an answer to see the explanation."), kind="warn"), ) if act1_reflect.value == "B": mo.callout(mo.md( "**Correct.** Gradient bucketing partitions the gradient tensor into chunks. " "During the backward pass, as soon as the gradients for the last few layers are " "computed, AllReduce begins on those buckets while the backward pass continues " "computing gradients for earlier layers. This overlaps the two operations. " "PyTorch DDP implements this via `bucket_cap_mb` (default: 25 MB). " "For a 7B model with 14 GB of gradients, effective overlap can hide 60–80% of " "the AllReduce latency, recovering significant MFU at scale." ), kind="success") elif act1_reflect.value == "A": mo.callout(mo.md( "**This does not reduce the comm/compute ratio.** A faster GPU shortens T_compute, " "which makes the numerator in the ratio smaller — but it also reduces the time " "available to overlap communication. The ratio T_allreduce/T_compute can actually " "worsen as compute gets faster while interconnect bandwidth stays constant. " "This is a common misconception: hardware upgrades on the compute side do not " "solve interconnect-bound scaling." ), kind="warn") elif act1_reflect.value == "C": mo.callout(mo.md( "**This reduces the wrong dimension.** Gradient dimensions are determined by model " "architecture, not batch size. A 7B model has 7B parameters regardless of whether " "the local batch is 8 or 128 samples. Reducing batch size per GPU does reduce " "gradient noise (smaller effective batch = higher gradient variance), but it does " "not reduce AllReduce volume. The gradient tensor size is `params × 2 bytes` in FP16." ), kind="warn") elif act1_reflect.value == "D": mo.callout(mo.md( "**Partially true, but not the primary technique.** INT8 gradient compression " "can reduce AllReduce volume by 2× compared to FP16, but it introduces gradient " "quantization error that can harm convergence for sensitive training runs. " "BF16 gradients are standard in modern training. The more reliable approach is " "gradient bucketing and async AllReduce, which hides rather than reduces " "communication — recovering throughput without precision loss." ), kind="warn") return # ═══════════════════════════════════════════════════════════════════════════════ # ACT II — 3D PARALLELISM DESIGN CHALLENGE # ═══════════════════════════════════════════════════════════════════════════════ # ─── ACT II: SECTION HEADER ──────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" --- ## Act II — 3D Parallelism Design Challenge *Design Challenge · 20–25 minutes* """) return # ─── ACT II: STAKEHOLDER MESSAGE ─────────────────────────────────────────────── @app.cell(hide_code=True) def _(COLORS, mo): _color = COLORS["Cloud"] _bg = COLORS["BlueLL"] mo.Html(f"""

Incoming Message · MLOps Architect

"We need to train GPT-3 (175B parameters). A single H100 holds 80 GB of HBM3e. With FP16 weights, FP32 optimizer states, and activation buffers, a 175B model needs roughly 10 bytes per parameter in practice — about 1.75 TB total, which doesn't fit in any single GPU. We have 1024 H100s available across 128 DGX nodes. Design the 3D parallel configuration (TP × PP × DP) that maximizes MFU without exceeding per-GPU memory or creating a pipeline bubble fraction above 10%."

""") return # ─── ACT II: CONCEPT FRAMING ─────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" When a model exceeds single-GPU memory capacity, data parallelism alone cannot help. Three orthogonal strategies exist for distributing a model: - **Tensor Parallelism (TP)**: Split individual matrix operations across GPUs within a layer. Every forward pass requires an AllReduce across the TP group. TP must operate at high bandwidth — otherwise the AllReduce overhead dominates. This constrains TP to **within a single DGX node** (NVLink, 900 GB/s). - **Pipeline Parallelism (PP)**: Assign consecutive layers to consecutive GPUs. Requires microbatching to keep all pipeline stages busy. Introduces **bubble overhead**: `B = (PP - 1) / (PP × m)` where `m` is the number of microbatches. - **Data Parallelism (DP)**: Replicate the TP×PP model group and distribute the global batch. This scales to the remaining GPU budget after TP and PP are fixed. `DP = total_GPUs / (TP × PP)`. The 3D configuration space has a hard constraint: `TP × PP × DP = 1024`. Before using the configurator, predict the optimal configuration. """) return # ─── ACT II: PREDICTION LOCK ─────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md("### Your Configuration Prediction") return @app.cell(hide_code=True) def _(mo): act2_pred = mo.ui.radio( options={ "A) TP=128, PP=1, DP=8 — maximize tensor parallelism to spread every layer across 128 GPUs": "A", "B) TP=8, PP=4, DP=32 — within-node TP on NVLink, pipeline across nodes, DP for throughput scale": "B", "C) TP=1, PP=1024, DP=1 — pure pipeline parallelism to avoid AllReduce entirely": "C", "D) TP=4, PP=256, DP=1 — deep pipeline to maximize layer-level parallelism": "D", }, label="Which 3D parallel configuration (TP × PP × DP) best balances memory, compute, and communication for GPT-3 175B on 1024 H100s?", ) act2_pred return (act2_pred,) @app.cell(hide_code=True) def _(act2_pred, mo): mo.stop( act2_pred.value is None, mo.callout( mo.md("Select your configuration prediction above to unlock the Act II instruments."), kind="warn", ), ) mo.md("") return # ─── ACT II: INSTRUMENT PANEL INTRO ──────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" ### 3D Parallelism Configurator Adjust TP and PP degrees. DP is computed automatically from the constraint `TP × PP × DP = 1024`. The configurator will enforce per-GPU memory and pipeline bubble constraints. """) return # ─── ACT II: SLIDERS ─────────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): tp_degree = mo.ui.slider( start=1, stop=64, value=8, step=1, label="Tensor Parallelism degree (TP)", ) pp_degree = mo.ui.slider( start=1, stop=64, value=4, step=1, label="Pipeline Parallelism degree (PP)", ) n_microbatches = mo.ui.slider( start=1, stop=64, value=8, step=1, label="Microbatches per pipeline flush (m)", ) mo.hstack([ mo.vstack([tp_degree, pp_degree]), mo.vstack([n_microbatches]), ], justify="center", gap=2) return n_microbatches, pp_degree, tp_degree # ─── ACT II: PHYSICS ENGINE ──────────────────────────────────────────────────── @app.cell(hide_code=True) def _( COLORS, GPUS_PER_NODE, H100_RAM_GB, H100_TFLOPS_FP16, IB_HDR200_BW_GBS, NVLINK4_BW_GBS, apply_plotly_theme, go, math, mo, n_microbatches, np, pp_degree, tp_degree, ): # ── Model constants — GPT-3 175B ───────────────────────────────────────────── # Source: @sec-distributed-training-systems, Brown et al. 2020 (GPT-3 paper) GPT3_PARAMS_B = 175.0 # 175B parameters; Brown et al. 2020 GPT3_LAYERS = 96 # transformer layers; Brown et al. 2020 BYTES_PER_PARAM_FP16 = 2 # FP16 model weights OPTIMIZER_OVERHEAD = 8 # FP32 optimizer states (m1+m2+master) ≈ 8 bytes/param ACTIVATION_BYTES_GB = 8.0 # activation buffers per pipeline stage (estimate) TOTAL_BYTES_PER_PARAM = 10 # practical: weights + grads + optimizer ≈ 10 bytes/param TOTAL_GPUS = 1024 # available H100s MFU_BASE = 0.52 # reference MFU for calibration TP_ALLREDUCE_LAYERS = GPT3_LAYERS # TP AllReduce happens every layer # ── Extract widget values ──────────────────────────────────────────────────── _tp = tp_degree.value _pp = pp_degree.value _m = n_microbatches.value # ── Constraint: TP × PP × DP = 1024 ───────────────────────────────────────── _tp_pp_product = _tp * _pp _dp = TOTAL_GPUS // _tp_pp_product if _tp_pp_product <= TOTAL_GPUS else 0 _dp_remainder = TOTAL_GPUS % _tp_pp_product if _tp_pp_product > 0 else 1 _config_valid = (_dp > 0) and (_dp_remainder == 0) # ── Memory analysis ────────────────────────────────────────────────────────── # Per-GPU memory = model shards + optimizer + activations # TP shards model parameters: each GPU holds 1/TP of each tensor # PP assigns GPT3_LAYERS/PP layers to each stage _params_per_gpu_b = GPT3_PARAMS_B / (_tp * _pp) # billions _params_per_gpu = _params_per_gpu_b * 1e9 _model_mem_gb = _params_per_gpu * BYTES_PER_PARAM_FP16 / 1e9 _optim_mem_gb = _params_per_gpu * OPTIMIZER_OVERHEAD / 1e9 _activ_mem_gb = ACTIVATION_BYTES_GB * (_tp if _tp > 1 else 1) # activation replication _total_mem_gb = _model_mem_gb + _optim_mem_gb + _activ_mem_gb # ── Failure state: OOM ─────────────────────────────────────────────────────── _oom = _total_mem_gb > H100_RAM_GB # ── Pipeline bubble fraction ───────────────────────────────────────────────── # Source: @sec-distributed-training-systems pipeline parallelism section # B = (PP - 1) / (PP × m) _bubble_frac = (_pp - 1) / (_pp * _m) if _pp > 1 else 0.0 _bubble_pct = _bubble_frac * 100.0 _bubble_warn = _bubble_pct > 10.0 # ── TP communication overhead ──────────────────────────────────────────────── # TP AllReduce volume per layer = 2 × hidden_dim × seq_len × 2 bytes (FP16) # Simplified: TP communication time relative to compute # Each TP AllReduce per layer uses NVLink (within node) or IB (cross-node) _tp_crosses_node = _tp > GPUS_PER_NODE _tp_fabric_bw = IB_HDR200_BW_GBS if _tp_crosses_node else NVLINK4_BW_GBS _tp_bw_penalty = NVLINK4_BW_GBS / _tp_fabric_bw # 1.0 if NVLink, 2.25 if IB _tp_warn = _tp_crosses_node # ── Effective MFU estimation ───────────────────────────────────────────────── # TP penalty: communication overhead from intra-layer AllReduce # PP penalty: pipeline bubble fraction # DP penalty: AllReduce for gradients (small for large DP with gradient bucketing) _tp_comm_penalty = 1.0 - (0.05 * math.log2(max(_tp, 1)) * _tp_bw_penalty) # rough empirical model _pp_efficiency = 1.0 - _bubble_frac _dp_comm_penalty = 1.0 - (0.02 * math.log2(max(_dp, 1))) # gradient AllReduce overhead _mfu_effective = MFU_BASE * _tp_comm_penalty * _pp_efficiency * _dp_comm_penalty _mfu_effective = max(0.0, min(_mfu_effective, MFU_BASE)) _mfu_pct_3d = _mfu_effective * 100.0 # ── Color coding ───────────────────────────────────────────────────────────── _mem_color = COLORS["RedLine"] if _oom else (COLORS["OrangeLine"] if _total_mem_gb > 60 else COLORS["GreenLine"]) _bubble_color = COLORS["RedLine"] if _bubble_warn else (COLORS["OrangeLine"] if _bubble_pct > 5 else COLORS["GreenLine"]) _mfu_color_3d = COLORS["RedLine"] if _mfu_pct_3d < 25 else (COLORS["OrangeLine"] if _mfu_pct_3d < 40 else COLORS["GreenLine"]) _cfg_color = COLORS["GreenLine"] if _config_valid else COLORS["RedLine"] # ── FAILURE STATE: OOM ─────────────────────────────────────────────────────── _oom_banner = "" if _oom: _oom_banner = f"""

OOM — Configuration Infeasible

Required per GPU: {_total_mem_gb:.1f} GB — exceeds H100 limit: {H100_RAM_GB:.0f} GB.
Model shard: {_model_mem_gb:.1f} GB | Optimizer states: {_optim_mem_gb:.1f} GB | Activations: {_activ_mem_gb:.1f} GB.
Increase TP or PP to reduce the per-GPU model shard below {H100_RAM_GB - _activ_mem_gb:.0f} GB (leaving room for activations).

""" # ── WARNING STATE: TP crosses node boundary ─────────────────────────────────── _tp_bw_banner = "" if _tp_warn: _penalty_x = NVLINK4_BW_GBS / IB_HDR200_BW_GBS _tp_bw_banner = f"""

Tensor Parallelism Crosses Node Boundary

TP={_tp} exceeds GPUS_PER_NODE={GPUS_PER_NODE}. TP AllReduce uses InfiniBand HDR200 ({IB_HDR200_BW_GBS:.0f} GB/s) instead of NVLink 4 ({NVLINK4_BW_GBS:.0f} GB/s) — a {_penalty_x:.1f}× bandwidth penalty on every layer's AllReduce. TP should remain ≤ {GPUS_PER_NODE} to exploit NVLink within a single DGX node.

""" # ── Config validity warning ─────────────────────────────────────────────────── _cfg_banner = "" if not _config_valid: _cfg_banner = f"""

Invalid Configuration: TP × PP = {_tp} × {_pp} = {_tp_pp_product} does not divide 1024 evenly. Choose TP and PP such that 1024 / (TP × PP) is a positive integer.

""" # ── Build bubble fraction vs PP/m chart ────────────────────────────────────── _pp_range = list(range(1, 33)) _bubble_m1 = [(_p - 1) / (_p * 1) * 100 for _p in _pp_range] _bubble_m4 = [(_p - 1) / (_p * 4) * 100 for _p in _pp_range] _bubble_m8 = [(_p - 1) / (_p * 8) * 100 for _p in _pp_range] _bubble_m16 = [(_p - 1) / (_p * 16) * 100 for _p in _pp_range] _fig2 = go.Figure() for _vals, _label, _clr in [ (_bubble_m1, "m=1 microbatch", "#cb202d"), (_bubble_m4, "m=4 microbatches", "#cc5500"), (_bubble_m8, "m=8 microbatches", "#006395"), (_bubble_m16, "m=16 microbatches", "#008f45"), ]: _fig2.add_trace(go.Scatter( x=_pp_range, y=_vals, mode="lines", name=_label, line=dict(color=_clr, width=2), hovertemplate=f"PP=%{{x}} {_label}
Bubble: %{{y:.1f}}%", )) _fig2.add_hline(y=10.0, line=dict(color="#1e293b", width=1.5, dash="dash"), annotation_text="10% bubble ceiling", annotation_position="top right") # Mark current config _fig2.add_trace(go.Scatter( x=[_pp], y=[_bubble_pct], mode="markers", name="Current config", marker=dict(size=14, color=COLORS["RedLine"], symbol="diamond", line=dict(color="white", width=2)), hovertemplate=f"PP={_pp}, m={_m}
Bubble: {_bubble_pct:.1f}%", )) _fig2.update_layout( height=300, xaxis=dict(title="Pipeline Parallelism (PP stages)", range=[1, 32]), yaxis=dict(title="Pipeline Bubble Fraction (%)", range=[0, 55]), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1), margin=dict(t=40, b=50, l=50, r=20), ) apply_plotly_theme(_fig2) # ── Render all outputs ──────────────────────────────────────────────────────── mo.vstack([ mo.Html(f"""

                Physics — 3D Parallel Memory and Bubble Analysis
            
Configuration: TP={_tp} × PP={_pp} × DP={_dp if _config_valid else "N/A"}
                 {'= ' + str(TOTAL_GPUS) if _config_valid else '(INVALID: TP×PP=' + str(_tp_pp_product) + ' does not divide 1024)'}
Params per GPU = {GPT3_PARAMS_B}B / (TP={_tp} × PP={_pp}) = {_params_per_gpu_b:.2f}B params
Model memory (FP16) = {_params_per_gpu_b:.2f}B × 2 bytes = {_model_mem_gb:.1f} GB
Optimizer states (FP32) = {_params_per_gpu_b:.2f}B × 8 bytes = {_optim_mem_gb:.1f} GB
Activation buffer = {_activ_mem_gb:.1f} GB (estimated)
Total per-GPU memory = {_total_mem_gb:.1f} GB / {H100_RAM_GB:.0f} GB limit
Pipeline bubble B = (PP-1)/(PP×m) = ({_pp}-1)/({_pp}×{_m}) = {_bubble_pct:.1f}%
TP bandwidth = {'InfiniBand ' + str(IB_HDR200_BW_GBS) + ' GB/s (CROSS-NODE)' if _tp_crosses_node else 'NVLink 4 ' + str(NVLINK4_BW_GBS) + ' GB/s (within node)'}
Effective MFU = {_mfu_pct_3d:.1f}%

{_oom_banner} {_tp_bw_banner} {_cfg_banner} """), mo.Html(f"""

Per-GPU Memory

{_total_mem_gb:.0f}GB

/ {H100_RAM_GB:.0f} GB limit

Pipeline Bubble

{_bubble_pct:.1f}%

ceiling: 10%

Effective MFU

{_mfu_pct_3d:.1f}%

3D parallel

DP degree

{_dp if _config_valid else 'N/A'}

= 1024 / (TP×PP)

"""), mo.ui.plotly(_fig2), ]) return ( _bubble_pct, _config_valid, _dp, _mfu_pct_3d, _oom, _total_mem_gb, _tp_crosses_node, ) # ─── ACT II: PREDICTION REVEAL ───────────────────────────────────────────────── @app.cell(hide_code=True) def _(act2_pred, mo): _correct = act2_pred.value == "B" if _correct: mo.callout(mo.md( "**Correct.** TP=8, PP=4, DP=32 is the principled baseline for GPT-3 scale training. " "TP=8 maps exactly to one DGX node (8 GPUs per node), keeping TP AllReduce on " "NVLink at 900 GB/s. PP=4 assigns 96/4=24 transformer layers per stage, " "requiring 4 nodes per pipeline. With 8 microbatches, the pipeline bubble " "B=(4-1)/(4×8)=9.375% stays just under the 10% ceiling. DP=32 then " "replicates the TP×PP group 32 times across the remaining 1024/(8×4)=32 GPU groups. " "This matches the configuration used in real GPT-3-scale training runs " "on DGX clusters (Megatron-LM, 2021)." ), kind="success") elif act2_pred.value == "A": mo.callout(mo.md( "**Infeasible.** TP=128 distributes each layer across 128 GPUs. Each tensor " "parallel AllReduce must traverse 16 DGX nodes (128/8=16), using InfiniBand " "instead of NVLink — a 2.25× bandwidth penalty on every single layer forward and " "backward pass. The AllReduce occurs 96 times per forward pass (once per transformer " "layer). At IB bandwidth this becomes the dominant bottleneck, crushing MFU. " "Configure TP in the simulator with TP > 8 to observe the bandwidth penalty warning." ), kind="warn") elif act2_pred.value == "C": mo.callout(mo.md( "**Catastrophic bubble overhead.** PP=1024 with a single microbatch gives " "B=(1024-1)/(1024×1)≈99.9% bubble fraction — the cluster is 99.9% idle. " "Even with m=64 microbatches: B=(1024-1)/(1024×64)≈1.5%, but now each " "gradient accumulation step is enormous, harming optimizer convergence. " "Pure pipeline parallelism with depth matching GPU count is never used in practice. " "Use the configurator to set PP=1024 and observe the bubble fraction." ), kind="warn") elif act2_pred.value == "D": mo.callout(mo.md( "**Pipeline bubble too large.** PP=256 with m=8 microbatches gives " "B=(256-1)/(256×8)=12.4% — already over the 10% ceiling. " "You would need m=32 microbatches to bring the bubble to 3.1%, " "but that requires a batch size of 32×256=8,192 sequences through the pipeline " "before each optimizer step, creating a very large effective batch. " "With DP=1, there is no data parallelism to amortize the batch size requirement. " "This is an over-pipelined design." ), kind="warn") else: mo.callout(mo.md("Select a configuration prediction above to see the analysis."), kind="info") return # ─── ACT II: MATHPEEK ────────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.accordion({ "The governing equations — 3D parallel memory, bubble fraction, and TP communication": mo.md(""" **3D Parallel Per-GPU Memory** ``` mem_per_gpu = (params / (TP × PP)) × bytes_per_param + (params / (TP × PP)) × optimizer_bytes_per_param + activation_buffer ``` - `params` — total model parameters (e.g. 175B for GPT-3) - `TP × PP` — reduces the parameter shard on each GPU - `bytes_per_param` — FP16 = 2 bytes; FP32 master copy = 4 bytes - `optimizer_bytes_per_param` — Adam states: 2 FP32 moments + master = ~8 bytes/param - **Key insight**: TP and PP jointly reduce per-GPU memory — TP shards each matrix horizontally, PP shards the depth (layers). DP does NOT reduce memory: every DP replica holds the full TP×PP model shard. **Pipeline Bubble Fraction** ``` B = (PP - 1) / (PP × m) ``` - `PP` — pipeline parallelism degree (stages) - `m` — number of microbatches per pipeline flush - At PP=4, m=8: B = 3/32 = 9.375% - **Key insight**: increasing m (microbatches) reduces bubble but increases pipeline latency and may harm optimizer convergence at very large effective batch sizes. - Practical ceiling: B < 10% is standard in production (Megatron-LM guidelines). **Tensor Parallelism Communication Volume (per layer)** ``` TP AllReduce per layer = 2 × (TP - 1)/TP × hidden_dim × seq_len × 2 bytes (FP16) ``` - Occurs **every layer** in both forward and backward passes - At 900 GB/s (NVLink): ~0.5 ms per layer for a 175B model configuration - At 400 GB/s (IB): ~1.1 ms per layer — 2.25× slower, applied 96 times per forward pass - **Key insight**: TP communication is not a one-time cost — it is a per-layer tax. This is why TP > 8 (crossing node boundary to IB) destroys MFU. """) }) return # ─── ACT II: REFLECTION ──────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" ### Reflection You observed that TP=8 is the natural constraint boundary. Before finishing, confirm your understanding of why this boundary is fundamental. """) return @app.cell(hide_code=True) def _(mo): act2_reflect = mo.ui.radio( options={ "A) PyTorch does not support cross-node tensor parallelism in its distributed primitives": "A", "B) Tensor parallel AllReduce happens every layer — at InfiniBand bandwidth this becomes the dominant bottleneck": "B", "C) Tensor parallelism requires shared GPU memory, which is unavailable across separate nodes": "C", "D) Cross-node tensor parallelism causes numerical instability due to floating-point rounding across nodes": "D", }, label="Why must tensor parallelism be confined within a single DGX node (TP ≤ 8)?", ) act2_reflect return (act2_reflect,) @app.cell(hide_code=True) def _(act2_reflect, mo): mo.stop( act2_reflect.value is None, mo.callout(mo.md("Select an answer to see the explanation."), kind="warn"), ) if act2_reflect.value == "B": mo.callout(mo.md( "**Correct.** Tensor parallelism introduces an AllReduce after every transformer " "layer's matrix operations — both in the forward pass and the backward pass. " "For a 96-layer model like GPT-3, that is 192 AllReduce calls per training step. " "At NVLink bandwidth (900 GB/s) this adds ~1 ms per step — tolerable. " "At InfiniBand bandwidth (400 GB/s), the penalty is 2.25× higher and accumulates " "across all 96 layers, making TP communication the dominant step time. " "The constraint TP ≤ GPUS_PER_NODE (≤ 8) is not a software limitation; " "it is a bandwidth physics constraint." ), kind="success") elif act2_reflect.value == "A": mo.callout(mo.md( "**Incorrect.** PyTorch (via Megatron-LM's column/row parallel linear layers) " "and frameworks like DeepSpeed fully support cross-node tensor parallelism " "using the standard NCCL AllReduce over InfiniBand. The constraint is physical, " "not a software limitation. The code works fine; the bandwidth penalty is what " "makes cross-node TP undesirable." ), kind="warn") elif act2_reflect.value == "C": mo.callout(mo.md( "**Incorrect.** Tensor parallelism does not require shared memory. It is a " "message-passing strategy: each GPU holds a shard of the weight matrix, " "computes a partial matrix multiply on its shard, then the partial results are " "reduced via AllReduce across all TP ranks. This works equally over NVLink " "or InfiniBand — the difference is only bandwidth and therefore latency." ), kind="warn") elif act2_reflect.value == "D": mo.callout(mo.md( "**Incorrect.** Floating-point arithmetic in distributed training uses deterministic " "reduction primitives (NCCL's AllReduce). The numerical behavior is identical whether " "the AllReduce traverses NVLink or InfiniBand — both use the same FP16/BF16 precision " "operations. Numerical instability in distributed training typically arises from " "gradient accumulation order (non-associative floating-point operations), not from " "the physical transport medium." ), kind="warn") return # ─── LEDGER SAVE + HUD ───────────────────────────────────────────────────────── @app.cell(hide_code=True) def _( COLORS, _bubble_pct, _config_valid, _dp, _mfu_pct_3d, _oom, _total_mem_gb, _tp_crosses_node, act1_pred, act2_pred, act2_reflect, act1_reflect, ledger, mo, n_microbatches, pp_degree, tp_degree, ): # ── Save to Design Ledger ──────────────────────────────────────────────────── _context = "3d_parallel" if tp_degree.value > 1 or pp_degree.value > 1 else "data_parallel" ledger.save( chapter="v2_05", design={ "context": _context, "tp_degree": tp_degree.value, "pp_degree": pp_degree.value, "dp_degree": _dp, "total_gpus": 1024, "mfu_percent": round(_mfu_pct_3d, 2), "act1_prediction": act1_pred.value if act1_pred.value else "no_selection", "act1_correct": act1_pred.value == "B", "act1_reflect": act1_reflect.value if act1_reflect.value else "no_selection", "act2_result": round(_mfu_pct_3d, 2), "act2_decision": act2_pred.value if act2_pred.value else "no_selection", "constraint_hit": _oom or _tp_crosses_node, "memory_feasible": not _oom, }, ) # ── Determine overall performance tier ────────────────────────────────────── _act1_ok = act1_pred.value == "B" _act2_ok = act2_pred.value == "B" _mfu_ok = _mfu_pct_3d >= 40.0 and not _oom and _bubble_pct <= 10.0 _tier = "Optimal" if (_act1_ok and _act2_ok and _mfu_ok) else ("Partial" if (_act1_ok or _act2_ok) else "Developing") _tier_color = COLORS["GreenLine"] if _tier == "Optimal" else (COLORS["OrangeLine"] if _tier == "Partial" else COLORS["TextMuted"]) # ── HUD Footer ─────────────────────────────────────────────────────────────── _hud = mo.Html(f"""

LAB Vol2 · Lab 05

CHAPTER v2_05 · Distributed Training

CONTEXT {_context.upper()}

CONFIG TP={tp_degree.value} × PP={pp_degree.value} × DP={_dp}

MFU {_mfu_pct_3d:.1f}%

ACT I {"CORRECT" if _act1_ok else "REVIEW"}

ACT II {"CORRECT" if _act2_ok else "REVIEW"}

TIER {_tier.upper()}

OOM {"YES" if _oom else "NO"}

""") _hud return # ─── CONNECTIONS ────────────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" --- ## Connections **Textbook:** This lab explores the core concepts of @sec-distributed-training-systems — the Iron Law of Scale, the Communication-Computation Ratio, and the 3D Parallelism Cube (@fig-3d-parallelism-cube). **Prior Labs:** Lab 03 (Network Fabrics) established the physical bandwidth limits — NVLink vs InfiniBand — that constrain TP degree here. Lab 04 (Data Storage) established the I/O pipeline that feeds each DP replica. **Next Lab:** Vol2 Lab 06 (Collective Communications) examines the Ring-AllReduce and Tree-AllReduce algorithms in detail, quantifying why Ring-AllReduce achieves near-linear scaling efficiency while parameter-server approaches hit coordination bottlenecks. """) return # ─── KEY TAKEAWAYS ───────────────────────────────────────────────────────────── @app.cell(hide_code=True) def _(mo): mo.md(""" ## Key Takeaways 1. **The Parallelism Paradox is a bandwidth ratio, not a software bug.** AllReduce volume saturates at approximately 2× gradient size regardless of GPU count, but the transition from NVLink (900 GB/s, within node) to InfiniBand (400 GB/s, cross-node) creates a step-change in communication time that drives the MFU cliff observed at 8→64 GPUs. MFU falls because T_allreduce grows while T_compute stays constant. 2. **The 3D parallelism constraint TP ≤ GPUS_PER_NODE is physics, not convention.** Tensor parallelism performs AllReduce after every transformer layer. At 96 layers, the per-layer AllReduce penalty accumulates into a step-time budget that InfiniBand cannot satisfy. Confine TP within a single DGX node to keep every layer's synchronization on NVLink. Then PP crosses nodes over InfiniBand — but PP AllReduce happens only once per pipeline stage, not once per layer. """) return if __name__ == "__main__": app.run()