Refactors Volume II content and structure

Restructures Volume II to improve narrative flow and address scale impediments, including reordering of sections and addition of introductory material.

Introduces "Master Map" to guide readers through the volume's layered progression.

Adds callout notes to bridge concepts between sections.

Moves references.qmd to backmatter and adjusts chapter organization for clarity.

Updates hardware parameterization and network performance modeling within code blocks.
This commit is contained in:
Vijay Janapa Reddi
2026-02-19 14:39:54 -05:00
parent 45f46ad70d
commit b6b2c94988
8 changed files with 596 additions and 122 deletions

View File

@@ -85,7 +85,6 @@ book:
- contents/vol2/sustainable_ai/sustainable_ai.qmd
- contents/vol2/responsible_ai/responsible_ai.qmd
- contents/vol2/conclusion/conclusion.qmd
- contents/vol2/backmatter/references.qmd
appendices:
- contents/vol2/backmatter/glossary/glossary.qmd

View File

@@ -59,9 +59,15 @@ This transition reveals a fundamental asymmetry in how computation and communica
***Gradient Synchronization*** is the collective communication phase in distributed training where workers exchange and aggregate their locally computed gradients. This step ensures that all workers update their model parameters with the same global gradient information, maintaining mathematical equivalence to single-device training.
:::
The requirement for gradient synchronization is not a design choice; it is a mathematical necessity for convergence. If different GPUs apply different gradient updates to their local copies of the model, the copies diverge. After enough steps, the models on different GPUs represent entirely different functions, and the training process no longer approximates stochastic gradient descent on the global loss. Synchronization ensures that all copies remain identical (within floating-point precision) at every step, preserving the theoretical convergence guarantees of the optimization algorithm.
The requirement for gradient synchronization is not a design choice; it is a mathematical necessity for convergence. If different GPUs apply different gradient updates to their local copies of the model, the copies diverge. After enough steps, the models on different GPUs represent entirely different functions, and the training process no longer approximates stochastic gradient descent on the global loss. Synchronization ensures that all copies remain identical (within floating-point precision) at every step, preserving the theoretical convergence guarantees of the optimization algorithm.[^fn-history-ps]
The volume of data that must be synchronized is proportional to the model size. A model with $P$ parameters stored in BF16 (2 bytes per parameter) generates $2P$ bytes of gradient data per training step per GPU. For a 70 billion parameter model, this is 140 GB of gradients that every GPU must send and receive. At frontier scale (hundreds of billions of parameters across thousands of GPUs), gradient synchronization dominates the training step time, consuming 30--70% of wall-clock time unless aggressive optimization techniques are applied. The remainder of this chapter develops those techniques systematically.
[^fn-history-ps]: **The Parameter Server Era**: Early large-scale systems like Google's **DistBelief** [@dean2012distbelief] used a "Star" topology where workers sent gradients to a central *Parameter Server* (PS). This created a massive bottleneck: the PS's bandwidth had to match the aggregate bandwidth of all workers. As clusters scaled from 10 to 1,000 nodes, the PS model collapsed, leading to the adoption of peer-to-peer *Collective Communication*.
The volume of data that must be synchronized is proportional to the model size. A model with $P$ parameters stored in BF16 (2 bytes per parameter) generates $2P$ bytes of gradient data per training step per GPU. For a 70 billion parameter model, this is 140 GB of gradients that every GPU must send and receive.[^fn-history-ring]
[^fn-history-ring]: **The Baidu Breakthrough**: In 2017, researchers at Baidu [@gibiansky2017baidu] introduced the *Ring AllReduce* algorithm to the ML community. By arranging GPUs in a logical ring, they proved that gradient synchronization time could be made independent of the number of GPUs ($N$), with each GPU sending only $\approx 2 \times ModelSize / N$ bytes. This breakthrough enabled the "Linear Scaling" era of deep learning.
At frontier scale (hundreds of billions of parameters across thousands of GPUs), gradient synchronization dominates the training step time, consuming 30--70% of wall-clock time unless aggressive optimization techniques are applied. The remainder of this chapter develops those techniques systematically.
### The Physics of Data Movement {#sec-communication-collective-operations-physics-data-movement}

View File

@@ -95,6 +95,61 @@ The system failure rate becomes $N\lambda$, and @eq-system-mtbf expresses how th
$$ MTBF_{system} = \frac{1}{N\lambda} = \frac{MTBF_{component}}{N} $$ {#eq-system-mtbf}
### The Young-Daly Law: Optimal Checkpointing {#sec-fault-tolerance-young-daly}
*When* failure is inevitable, the key engineering decision is how often to save progress. Checkpointing too frequently wastes time on I/O; checkpointing too rarely wastes time re-computing work after a failure.
The **Young-Daly formula**[^fn-young-daly] identifies the "sweet spot" that minimizes total wasted work.
[^fn-young-daly]: **The Young-Daly Law**: Independently derived by Young (1974) and Daly (2006). It provides the first-order approximation for the optimal checkpoint interval in a system with constant failure rates and significant checkpointing overhead.
::: {#fig-young-daly fig-env="figure" fig-pos="htb" fig-cap="**The Young-Daly Optimal Checkpoint**. Total wasted work is the sum of *checkpointing overhead* (which decreases with interval $\tau$) and *rework cost* (which increases with $\tau$). The minimum point defines the optimal interval $\tau_{opt} = \sqrt{2 \cdot T_{write} \cdot \text{MTBF}}$. For a cluster with 5-hour MTBF and 15-minute write time, the optimal interval is ~1.2 hours." fig-alt="Plot of Overhead vs Checkpoint Interval. A red curve for Rework Cost increases linearly. A blue curve for Checkpoint Overhead decreases hyperbolically. Their sum (green curve) shows a clear minimum point labeled Optimal Interval."}
``` {python}
#| label: fig-young-daly
#| echo: false
import numpy as np
import matplotlib.pyplot as plt
from mlsys import viz
viz.set_book_style()
COLORS = viz.COLORS
# Parameters for visualization
mtbf = 5.0 # hours
t_write = 0.25 # hours (15 mins)
tau = np.linspace(0.1, 5.0, 100)
over_ckpt = t_write / tau
over_rework = tau / (2 * mtbf)
total_waste = over_ckpt + over_rework
tau_opt = np.sqrt(2 * t_write * mtbf)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(tau, over_ckpt, label='Checkpoint Overhead ($T_{write}/\\tau$)', color=COLORS['BlueLine'], linestyle='--')
ax.plot(tau, over_rework, label='Expected Rework ($\\tau/2\\text{MTBF}$)', color=COLORS['RedLine'], linestyle='--')
ax.plot(tau, total_waste, label='Total Wasted Work', color=COLORS['GreenLine'], linewidth=2.5)
# Optimal point
ax.scatter([tau_opt], [np.sqrt(2*t_write/mtbf)], color='black', zorder=5)
ax.annotate(f'$\\tau_{{opt}} \\approx {tau_opt:.1f}$h',
xy=(tau_opt, np.sqrt(2*t_write/mtbf)),
xytext=(tau_opt + 0.2, 0.6),
arrowprops=dict(facecolor='black', shrink=0.05, width=1, headwidth=5))
ax.set_xlabel('Checkpoint Interval $\\tau$ (Hours)')
ax.set_ylabel('Fraction of Total Time Wasted')
ax.set_ylim(0, 1.0)
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
```
:::
The formula $ \tau_{opt} = \sqrt{2 \cdot T_{write} \cdot \text{MTBF}} $ reveals a critical scaling property: as clusters grow larger ($MTBF \downarrow$), we must checkpoint more frequently. This, in turn, demands higher-bandwidth storage systems (@sec-data-systems) to keep $T_{write}$ small, otherwise the "Checkpoint Tax" will consume most of the cluster's compute capacity.
::: {.callout-notebook title="The 9s of Reliability"}
**Problem**: You have a cluster of **10,000 GPUs**. Each GPU is incredibly reliable, with **99.99%** availability (only 52 minutes of downtime per year). What is the probability that the **entire cluster** is up for **1 hour**?

View File

@@ -163,6 +163,13 @@ To make the scheduling problem concrete, consider a research organization operat
The economic stakes make these decisions consequential at every scale. A `{python} cluster_size_str`-GPU cluster at $`{python} f"{gpu_hour_cost_usd:.0f}"` per GPU-hour costs $`{python} daily_cost_str` per day to operate, whether the GPUs are computing useful work or sitting idle in a queue. If scheduling inefficiencies leave `{python} f"{idle_fraction*100:.0f}"` percent of GPUs idle, that translates to $`{python} daily_waste_str` per day in wasted capacity, or over $`{python} annual_waste_str` million annually. Conversely, improving utilization from `{python} f"{util_low*100:.0f}"` percent to `{python} f"{util_high*100:.0f}"` percent effectively adds `{python} equivalent_gpus_str` GPUs worth of productive capacity without purchasing additional hardware. At this scale, a one-percentage-point improvement in utilization is worth more annually than the salary of the engineer who achieves it. Scheduling is not operational overhead; it is one of the highest-leverage engineering investments in ML infrastructure.
::: {.callout-note title="Bridge: Resource Orchestration"}
Fleet Orchestration is the "Resource Negotiator" of the fleet. It solves two critical problems that traditional schedulers aren't designed for:
* **Gang Scheduling**: The "all-or-nothing" requirement. A distributed training job needing 1,024 GPUs cannot make progress with 1,023. The scheduler must allocate the entire "gang" atomically.
* **Topology Awareness**: The "locality" requirement. To support **Tensor Parallelism** (covered in @sec-compute-design), the scheduler must ensure GPUs are placed within the same high-bandwidth rack, rather than scattered across the datacenter.
Tools like **Slurm** (from the HPC world) and **Kubernetes** (from the web world) are the two primary vehicles for implementing these policies.
:::
ML workloads present scheduling challenges that distinguish them fundamentally from traditional high-performance computing and cloud computing. **Gang scheduling**[^fn-gang] represents the most critical difference: a distributed training job requiring 1,024 GPUs cannot make progress with only 512. Unlike traditional HPC simulations that can often scale to whatever resources are available, synchronous data parallelism demands all-or-nothing allocation. Every worker must participate in every AllReduce operation, and a missing worker blocks all others. A scheduler that partially allocates resources creates deadlocks where multiple jobs each hold some GPUs while waiting for more, with none able to proceed. This all-or-nothing requirement means the scheduler cannot simply "pack jobs tightly" as a traditional bin packer would; it must reason about atomic, multi-resource allocations across the entire cluster.
[^fn-gang]: **Gang scheduling**: A scheduling policy that allocates resources for multi-component jobs atomically, ensuring all components start simultaneously or none do. First developed for parallel computing in the 1980s by Ousterhout [@ousterhout1982scheduling], gang scheduling prevents deadlock scenarios where jobs partially acquire resources and block each other. For ML training, gang scheduling ensures all workers are ready before training begins, avoiding wasted GPU cycles on partially started jobs.

View File

@@ -1893,6 +1893,60 @@ The cache size grows with context as calculated by @eq-kv-cache-size:
$$\text{KV cache size} = 2 \times L \times H \times S \times B \times P$$ {#eq-kv-cache-size}
### The KV Cache Wall: Memory-Bound Capacity {#sec-inference-scale-kv-cache-wall}
*Why* can't we simply increase the batch size to maximize throughput in LLM serving? We hit the **KV Cache Wall**. Model weights represent a fixed "Static Tax" on GPU memory, while the KV cache grows linearly with both batch size and sequence length.
::: {#fig-kv-cache-wall fig-env="figure" fig-pos="htb" fig-cap="**The KV Cache Wall**. Total GPU memory usage for a 4-bit quantized 70B model (35GB weights) on an 80GB H100. While the model fits easily at small context lengths, the KV cache (sloped lines) eventually consumes all remaining HBM. At 128K context, a batch size of 2 is physically impossible on a single GPU ($35 \text{GB} + 82 \text{GB} > 80 \text{GB}$), forcing the system to either reduce batch size (killing throughput) or shard the model." fig-alt="Memory usage plot. Horizontal dashed line at 80GB shows capacity. Solid blue region at bottom is 35GB weights. Sloped lines for Batch 1, 4, 8 show memory rising with sequence length. Batch 8 hits the 80GB line very early (~16K tokens)."}
``` {python}
#| label: fig-kv-cache-wall
#| echo: false
import numpy as np
import matplotlib.pyplot as plt
from mlsys import viz
viz.set_book_style()
COLORS = viz.COLORS
# Parameters for Llama-70B GQA
weights_gb = 35.0 # 4-bit
hbm_cap_gb = 80.0
# KV cache: 2 * layers * kv_heads * head_dim * precision
# 2 * 80 * 8 * 128 * 2 bytes = 327,680 bytes/token
mb_per_token = 0.32768 / 1024 # GB per 1k tokens
seq_len_k = np.linspace(0, 128, 100) # up to 128k tokens
fig, ax = plt.subplots(figsize=(8, 5))
# Plot base weights
ax.fill_between(seq_len_k, 0, weights_gb, color=COLORS['BlueLine'], alpha=0.3, label='Weights (4-bit 70B)')
ax.axhline(y=weights_gb, color=COLORS['BlueLine'], linestyle='-', linewidth=1)
# Plot Batch Sizes
for b in [1, 2, 4, 8]:
total_mem = weights_gb + (b * seq_len_k * mb_per_token)
# Only plot where it fits
# mask = total_mem <= hbm_cap_gb * 1.1
ax.plot(seq_len_k, total_mem, label=f'Batch {b}')
# HBM Limit
ax.axhline(y=hbm_cap_gb, color=COLORS['RedLine'], linestyle='--', linewidth=2)
ax.text(5, hbm_cap_gb + 2, 'HBM Capacity (80GB)', color=COLORS['RedLine'], fontweight='bold')
ax.set_xlabel('Sequence Length (1,000s of Tokens)')
ax.set_ylabel('Total GPU Memory (GB)')
ax.set_ylim(0, 100)
ax.set_xlim(0, 128)
ax.legend(loc='upper left', fontsize=9)
ax.grid(True, alpha=0.2)
plt.show()
```
:::
This visualization reveals why the **Distribution Layer** must sometimes shard models that would otherwise fit on a single GPU. Sharding doesn't just reduce compute time; it provides the **Memory Headroom** needed to maintain high batch sizes for long-context requests. Without sharding, a 128K context request effectively "evicts" all other users from the GPU.
where:
- $L$ = number of layers

View File

@@ -302,6 +302,63 @@ These scale-induced challenges drive infrastructure investment by the largest AI
[^fn-exaflops]: **ExaFLOPS**: One quintillion (10^18) floating-point operations per second. For context, a single NVIDIA H100 GPU delivers approximately `{python} h100_fp8_tflops` TFLOPS (teraFLOPS, or 10^12 FLOPS) of FP8 compute; achieving 1 exaFLOPS thus requires roughly 500 such GPUs operating in parallel with perfect efficiency. Google's TPU v4 pods reaching 1.1 exaFLOPS represent one of the first single-system installations to cross this threshold. The scale progression from megaFLOPS (10^6, 1980s workstations) through gigaFLOPS (10^9, 1990s servers), teraFLOPS (10^12, 2000s GPUs), petaFLOPS (10^15, 2010s supercomputers), to exaFLOPS (10^18, 2020s AI clusters) illustrates the exponential growth enabling modern ML capabilities.
## The Master Map: Navigating Volume II {#sec-vol2-introduction-master-map}
Volume II follows a layered progression from the physical substrate to societal governance. Each Part addresses a fundamental "Scale Impediment" that prevents a single-machine solution from working at production scale.
### Part I: The Fleet (Core Infrastructure)
**The Impediment**: *Physical Limits*. No single server has enough memory, power, or cooling to train a frontier model.
* **Compute Infrastructure (@sec-compute-infrastructure)**: Building the engine. Mastering the physics of high-density silicon, liquid cooling, and megawatt-scale power ramp rates.
* **Network Fabrics (@sec-network-fabrics)**: The transmission. Connecting thousands of accelerators through a high-bandwidth "Gradient Bus" that acts as the cluster-scale system bus.
* **Scalable Data Systems (@sec-data-systems)**: The fuel line. Architecting storage hierarchies that can feed terabytes of data to hungry accelerators without stalling the math.
### Part II: Distributed ML (The Logic of Scale)
**The Impediment**: *The Coordination Tax*. Splitting math across machines creates synchronization bottlenecks and frequent failures.
* **Distributed Training (@sec-distributed-training-systems)**: Splitting the math. Strategies for partitioning 100-trillion-parameter models across thousands of GPUs.
* **Collective Communication (@sec-collective-communication)**: The traffic control. Implementing the coordination algorithms (AllReduce, AllGather) that bind independent nodes into a coherent computer.
* **Fault Tolerance (@sec-fault-tolerance-reliability)**: The immune system. Engineering for a regime where hardware fails every few hours, making recovery speed more important than uptime.
* **Fleet Orchestration (@sec-fleet-orchestration)**: The resource negotiator. Managing multi-tenant clusters where "Gang Scheduling" is required to prevent deadlocks.
### Part III: Deployment at Scale (The Serving Pipeline)
**The Impediment**: *Operational Economics*. Inference costs eventually dwarf training costs, requiring a fundamental shift from throughput to latency optimization.
* **Inference at Scale (@sec-inference-scale)**: The interface. Serving models to millions of users simultaneously while managing the "KV Cache Wall."
* **Performance Engineering (@sec-performance-engineering)**: The efficiency frontier. Closing the gap between hardware peak and actual throughput through kernel fusion and compilation.
* **Edge Intelligence (@sec-edge-intelligence)**: The frontier. Moving intelligence from the datacenter to the user's device, constrained by milliwatt power budgets.
* **Operations at Scale (@sec-ops-scale)**: The control plane. Monitoring the fleet's health, drift, and performance across global deployments.
### Part IV: The Responsible Fleet (The Governance Layer)
**The Impediment**: *Societal Impact*. At global scale, technical bugs become societal hazards, requiring governance as a first-class engineering invariant.
* **Security & Privacy (@sec-security-privacy)**: The armor. Defending the fleet against adversaries who seek to poison data or extract proprietary weights.
* **Robustness (@sec-robust-ai)**: The resilience. Ensuring models survive the chaotic, non-I.I.D. reality of the open world.
* **Sustainable AI (@sec-sustainable-ai)**: The endurance. Managing the "Energy Wall" and the lifecycle carbon footprint of industrial-scale AI.
* **Responsible Engineering (@sec-responsible-ai)**: The conscience. Aligning technical marvels with human values like fairness, transparency, and accountability.
## A Breed Apart: The ML Workload Character {#sec-vol2-introduction-breed-apart}
*Why* can't we simply use existing distributed systems like Apache Spark or standard web microservices to run the Machine Learning Fleet? While the underlying hardware—network, compute, storage—is identical, the **workload characteristics** of ML systems are fundamentally different from traditional distributed systems.
### Traditional vs. ML Fleet Dynamics
Traditional systems (e.g., a search engine or a banking database) optimize for **independent, asynchronous tasks**. A web server handles millions of requests, each isolated from the other. *When* one request fails, the others continue.
The Machine Learning Fleet, by contrast, operates under **Synchronous Tight Coupling**.
1. **Iterative Statefulness**: Traditional data processing is often "one-and-done." ML training repeats the same math millions of times, updating a massive shared state (the model weights).
2. **Barrier Synchronization**: In a synchronous training step, 10,000 GPUs must wait for the slowest worker to finish before any can proceed. This makes the fleet hypersensitive to "Stragglers"—a 10% performance drop on one node can reduce the entire cluster's throughput by 10%.
3. **Bisection Bandwidth Dominance**: A web service is often "Latent-Bound" (waiting for the user). An ML training job is "Bandwidth-Bound." It needs to move gigabytes of gradient data across the *entire* network every second. This requires non-blocking network topologies that traditional datacenters rarely implement.
### The Shift to the Warehouse-Scale Computer
This textbook adopts the **Warehouse-Scale Computer (WSC)**[^fn-warehouse-scale] perspective. In traditional computing, the datacenter is a building that *houses* many computers. In the ML Fleet, the datacenter *is* the computer.
* The **Network Fabric** is the System Bus.
* The **Distributed Storage** is the Local Disk.
* The **Fleet Orchestrator** is the Operating System.
Mastering Volume II requires making this mental shift: you are no longer writing code for a CPU; you are writing logic for a 100-Megawatt computer spanning thousands of racks.
[^fn-warehouse-scale]: **The Warehouse-Scale Computer (WSC)**: First formalized by Barroso, Clidaras, and Hölzle at Google [@barroso2019datacenter]. They argued that at scale, the computer is no longer the server, but the entire datacenter facility. This perspective is essential for ML Systems, as it explains why cooling, power ramp rates, and optical network topologies are as important as the neural network architecture itself.
## AI Scaling Laws {#sec-vol2-intro-ai-scaling-laws-a043}
The infrastructure investments described in the preceding section did not arise from arbitrary organizational ambitions. They emerged from an empirical discovery: machine learning performance follows predictable mathematical relationships with scale. Understanding these scaling laws explains why distributed systems have become essential and reveals the constraints that shape their design.

View File

@@ -38,79 +38,134 @@ A single GPU can perform trillions of operations per second, but distributed tra
```{python}
#| label: network-fabrics-setup
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────
# │ NETWORK FABRICS CHAPTER SETUP
# ├─────────────────────────────────────────────────────────────────────
# │ Context: Chapter-wide constants for network fabrics
# ┌─────────────────────────────────────────────────────────────────────────────
# │ NETWORK FABRICS: CHAPTER-WIDE CONSTANTS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Chapter-wide setup for Network Fabrics.
# │
# │ Goal: Import hardware specs and formatting tools for inline use.
# │ Show: Networking bandwidth, latency, and derived values in prose.
# │ How: Pull from mlsys.constants, compute derived metrics.
# │ Goal: Provide hardware specs and performance parameters for network analysis.
# │ Show: Bandwidth, latency, and topology scaling for modern fabrics.
# │ How: Centralize NDR/HDR InfiniBand, NVLink, and PCIe specs.
# │
# │ Imports: mlsys.constants (*), mlsys.formatting (fmt, sci, check)
# │ Exports: Various formatted strings for inline Python references
# └─────────────────────────────────────────────────────────────────────
from mlsys.constants import *
# │ Imports: mlsys.constants, mlsys.formatting
# │ Exports: ib_ndr_*, nvlink_*, pcie_*, h100_*, fat_tree_hosts
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import (
INFINIBAND_NDR_BW, INFINIBAND_HDR_BW,
NVLINK_H100_BW, NVLINK_A100_BW,
PCIE_GEN5_BW, PCIE_GEN4_BW,
H100_FLOPS_FP16_TENSOR, H100_TDP, H100_MEM_CAPACITY, H100_MEM_BW,
A100_FLOPS_FP16_TENSOR,
B200_FLOPS_FP16_TENSOR, B200_MEM_BW,
Gbps, GB, TB, second, watt, GiB, TFLOPs, flop, byte, NS
)
from mlsys.formatting import fmt, sci, check
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────
# Interconnect bandwidths
ib_ndr_gbps = f"{INFINIBAND_NDR_BW.to(Gbps).magnitude:.0f}"
ib_hdr_gbps = f"{INFINIBAND_HDR_BW.to(Gbps).magnitude:.0f}"
ib_ndr_gbs = f"{(INFINIBAND_NDR_BW / 8).to(GB/second).magnitude:.0f}"
ib_hdr_gbs = f"{(INFINIBAND_HDR_BW / 8).to(GB/second).magnitude:.0f}"
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class NetworkFabricsSetup:
"""
Namespace for Network Fabrics reference parameters.
Scenario: Mapping interconnect performance for distributed clusters.
"""
# Intra-node interconnects
nvlink_h100_gbs = f"{NVLINK_H100_BW.to(GB/second).magnitude:.0f}"
nvlink_a100_gbs = f"{NVLINK_A100_BW.to(GB/second).magnitude:.0f}"
pcie5_gbs = f"{PCIE_GEN5_BW.to(GB/second).magnitude:.0f}"
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
# Interconnects
ib_ndr_raw = INFINIBAND_NDR_BW
ib_hdr_raw = INFINIBAND_HDR_BW
nvlink_h100_raw = NVLINK_H100_BW
nvlink_a100_raw = NVLINK_A100_BW
pcie5_raw = PCIE_GEN5_BW
# GPU specs for context
h100_tflops = f"{H100_FLOPS_FP16_TENSOR.to(TFLOPs/second).magnitude:.0f}"
h100_tdp = f"{H100_TDP.to(watt).magnitude:.0f}"
h100_mem = f"{H100_MEM_CAPACITY.to(GiB).magnitude:.0f}"
h100_mem_bw = f"{H100_MEM_BW.to(TB/second).magnitude:.2f}"
# GPU specs
h100_flops_raw = H100_FLOPS_FP16_TENSOR
h100_mem_raw = H100_MEM_CAPACITY
h100_mem_bw_raw = H100_MEM_BW
h100_tdp_raw = H100_TDP
a100_tflops = f"{A100_FLOPS_FP16_TENSOR.to(TFLOPs/second).magnitude:.0f}"
a100_flops_raw = A100_FLOPS_FP16_TENSOR
b200_flops_raw = B200_FLOPS_FP16_TENSOR
b200_mem_bw_raw = B200_MEM_BW
b200_tflops = f"{B200_FLOPS_FP16_TENSOR.to(TFLOPs/second).magnitude:,.0f}"
b200_mem_bw = f"{B200_MEM_BW.to(TB/second).magnitude:.0f}"
# Topology parameters
fat_tree_k = 64
ib_ndr_hop_ns = 500
alpha_ib_us_val = 1.5
# ┌── 2. CALCULATIONS (Derived) ───────────────────────────────────────
# NDR InfiniBand latency (typical switch hop)
ib_ndr_latency_ns = 500 # nanoseconds per switch hop
ib_ndr_latency_us = f"{ib_ndr_latency_ns / 1000:.1f}"
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
ib_ndr_gbps_val = ib_ndr_raw.to(Gbps).magnitude
ib_ndr_gbs_val = (ib_ndr_raw / 8).to(GB/second).magnitude
nvlink_to_ib_ratio_val = nvlink_h100_raw.to(GB/second).magnitude / ib_ndr_gbs_val
# FEC latency contribution
# Fat-tree hosts for radix k: (k^3)/4
fat_tree_hosts_val = (fat_tree_k ** 3) // 4
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
check(ib_ndr_gbps_val == 400, f"NDR InfiniBand should be 400 Gbps, got {ib_ndr_gbps_val}")
check(nvlink_to_ib_ratio_val > 15, f"NVLink/IB ratio should be > 15x, got {nvlink_to_ib_ratio_val:.1f}")
check(fat_tree_hosts_val == 65536, f"Fat-tree hosts for k=64 should be 65536, got {fat_tree_hosts_val}")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
ib_ndr_gbps = f"{ib_ndr_gbps_val:.0f}"
ib_hdr_gbps = f"{ib_hdr_raw.to(Gbps).magnitude:.0f}"
ib_ndr_gbs = f"{ib_ndr_gbs_val:.0f}"
ib_hdr_gbs = f"{(ib_hdr_raw / 8).to(GB/second).magnitude:.0f}"
nvlink_h100_gbs = f"{nvlink_h100_raw.to(GB/second).magnitude:.0f}"
nvlink_a100_gbs = f"{nvlink_a100_raw.to(GB/second).magnitude:.0f}"
pcie5_gbs = f"{pcie5_raw.to(GB/second).magnitude:.0f}"
h100_tflops = f"{h100_flops_raw.to(TFLOPs/second).magnitude:.0f}"
h100_tdp = f"{h100_tdp_raw.to(watt).magnitude:.0f}"
h100_mem = f"{h100_mem_raw.to(GiB).magnitude:.0f}"
h100_mem_bw = f"{h100_mem_bw_raw.to(TB/second).magnitude:.2f}"
a100_tflops = f"{a100_flops_raw.to(TFLOPs/second).magnitude:.0f}"
b200_tflops = f"{b200_flops_raw.to(TFLOPs/second).magnitude:,.0f}"
b200_mem_bw = f"{b200_mem_bw_raw.to(TB/second).magnitude:.0f}"
ib_ndr_latency_us = f"{ib_ndr_hop_ns / 1000:.1f}"
nvlink_to_ib_ratio = f"{nvlink_to_ib_ratio_val:.0f}"
fat_tree_hosts = fmt(fat_tree_hosts_val, precision=0)
# Performance model defaults
alpha_ib_us = f"{alpha_ib_us_val:.1f}"
beta_ib_gbs = f"{ib_ndr_gbs_val:.0f}"
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
ib_ndr_gbps = NetworkFabricsSetup.ib_ndr_gbps
ib_hdr_gbps = NetworkFabricsSetup.ib_hdr_gbps
ib_ndr_gbs = NetworkFabricsSetup.ib_ndr_gbs
ib_hdr_gbs = NetworkFabricsSetup.ib_hdr_gbs
nvlink_h100_gbs = NetworkFabricsSetup.nvlink_h100_gbs
nvlink_a100_gbs = NetworkFabricsSetup.nvlink_a100_gbs
pcie5_gbs = NetworkFabricsSetup.pcie5_gbs
h100_tflops = NetworkFabricsSetup.h100_tflops
h100_tdp = NetworkFabricsSetup.h100_tdp
h100_mem = NetworkFabricsSetup.h100_mem
h100_mem_bw = NetworkFabricsSetup.h100_mem_bw
a100_tflops = NetworkFabricsSetup.a100_tflops
b200_tflops = NetworkFabricsSetup.b200_tflops
b200_mem_bw = NetworkFabricsSetup.b200_mem_bw
ib_ndr_latency_us = NetworkFabricsSetup.ib_ndr_latency_us
nvlink_to_ib_ratio = NetworkFabricsSetup.nvlink_to_ib_ratio
fat_tree_hosts = NetworkFabricsSetup.fat_tree_hosts
alpha_ib_us = NetworkFabricsSetup.alpha_ib_us
beta_ib_gbs = NetworkFabricsSetup.beta_ib_gbs
# Static FEC latency references
fec_latency_ns_low = 100
fec_latency_ns_high = 200
# Bandwidth ratio: NVLink vs IB NDR
nvlink_to_ib_ratio = f"{NVLINK_H100_BW.to(GB/second).magnitude / (INFINIBAND_NDR_BW / 8).to(GB/second).magnitude:.0f}"
# 8-GPU DGX node total NVLink bandwidth
dgx_nvlink_total_gbs = f"{NVLINK_H100_BW.to(GB/second).magnitude:.0f}"
# Alpha-beta model reference values
alpha_ib_us = 1.5 # microseconds typical IB latency
beta_ib_gbs = (INFINIBAND_NDR_BW / 8).to(GB/second).magnitude # GB/s per link
# Fat-tree host calculation for radix-64
fat_tree_k = 64
fat_tree_hosts = f"{(fat_tree_k ** 3) // 4:,}"
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────
check(INFINIBAND_NDR_BW.to(Gbps).magnitude == 400,
"NDR InfiniBand should be 400 Gbps")
check(NVLINK_H100_BW.to(GB/second).magnitude == 900,
"H100 NVLink should be 900 GB/s")
check(int(nvlink_to_ib_ratio) == 18,
"NVLink/IB ratio should be ~18x")
```
In the **Fleet Stack** (@sec-vol2-introduction), network fabrics form the connective tissue binding the Infrastructure Layer into a coherent whole. @sec-compute-infrastructure established the building blocks: accelerators, power delivery, and cooling. Those components define what each node can compute in isolation. This chapter examines how those nodes communicate, because at scale, communication cost dominates computation cost. The **Law of Distributed Efficiency** (@eq-distributed-efficiency) makes this explicit: the $T_{\text{sync}} / T_{\text{compute}}$ ratio in the Scaling Factor is determined almost entirely by the network fabric.
::: {.callout-note title="Bridge: Collective Communication"}
The physical network fabric exists to support **Collective Communication**—mathematical operations that involve every GPU in the fleet.
* **AllReduce**: Summing gradients from 10,000 GPUs so every GPU has the identical average. This is the "Heartbeat" of synchronous training.
* **AllGather**: Collecting different model portions so every GPU can see the full model state.
* **AllToAll**: The most demanding pattern, where every GPU sends unique data to every other GPU (critical for **Expert Parallelism**).
While @sec-collective-communication covers the *algorithms* for these patterns, this chapter covers the *physics* of the wires and switches that make them possible.
:::
::: {.callout-note title="Fleet Stack Connection"}
This chapter builds the **Network Layer** of the Fleet Stack. @sec-compute-infrastructure defined the compute nodes. Here we wire them together. The network fabric constrains every layer above it: @sec-distributed-training-systems cannot overlap communication with computation unless the fabric provides sufficient bandwidth, @sec-collective-communication cannot choose optimal algorithms without knowing the topology, and @sec-fault-tolerance-reliability must account for network partitions alongside node failures. The fabric's bandwidth and latency appear directly in the Scaling Factor of the Law of Distributed Efficiency.
@@ -119,11 +174,11 @@ This chapter builds the **Network Layer** of the Fleet Stack. @sec-compute-infra
@sec-compute-infrastructure established that a single H100 delivers `{python} h100_tflops` TFLOPS of FP16 throughput with `{python} h100_mem_bw` TB/s of memory bandwidth. Within a node, eight such accelerators communicate through NVLink at `{python} nvlink_h100_gbs` GB/s. But the moment computation crosses a node boundary, the available bandwidth drops by a factor of `{python} nvlink_to_ib_ratio` $\times$, from `{python} nvlink_h100_gbs` GB/s (NVLink) to `{python} ib_ndr_gbs` GB/s (NDR InfiniBand per port). This cliff, the transition from intra-node to inter-node communication, is the central challenge of network fabric design.
The remainder of this chapter proceeds from physics to systems. We begin with the physical medium itself, examining why signal integrity at `{python} ib_ndr_gbps` Gbps demands PAM4 encoding and optical interconnects. We then analyze the two dominant RDMA transport protocols, InfiniBand and RoCE, before studying how topology design shapes communication efficiency for ML workloads. The $\alpha$-$\beta$ performance model provides the quantitative framework for reasoning about these design choices. We conclude with congestion control, multi-tenancy, monitoring, and a case study of production-scale network architectures.
The remainder of this chapter proceeds from physics to systems. We begin with the physical medium itself, examining *why* signal integrity at `{python} ib_ndr_gbps` Gbps demands PAM4 encoding and optical interconnects. We then analyze the two dominant RDMA transport protocols, InfiniBand and RoCE, before studying *how* topology design shapes communication efficiency for ML workloads. The $\alpha$-$\beta$ performance model provides the quantitative framework for reasoning about these design choices. We conclude with congestion control, multi-tenancy, monitoring, and a case study of production-scale network architectures.
## Physics of the Wire {#sec-network-fabrics-physics}
Before analyzing protocols and topologies, we must understand the physical medium. Every network design decision is ultimately constrained by what the wire can carry. At `{python} ib_ndr_gbps` Gbps and beyond, the physics of signal transmission imposes hard limits on cable length, power consumption, and error rates. These are not engineering inconveniences to be solved with better technology; they are fundamental constraints that shape cluster geometry.
Before analyzing protocols and topologies, we must understand the physical medium. Every network design decision is ultimately constrained by *what* the wire can carry. At `{python} ib_ndr_gbps` Gbps and beyond, the physics of signal transmission imposes hard limits on cable length, power consumption, and error rates. These are not engineering inconveniences to be solved with better technology; they are fundamental constraints that shape cluster geometry.
### Signal Integrity and PAM4 {#sec-network-fabrics-pam4}
@@ -164,9 +219,9 @@ This physics dictates **cluster geometry**. We pack accelerators as densely as p
Every high-speed port depends on a **Serializer/Deserializer (SerDes)** circuit that converts parallel data from the switch ASIC into a serial stream for transmission over the wire. The SerDes must compensate for signal degradation through equalization, a process that consumes significant power. A modern `{python} ib_ndr_gbps` Gbps port operates four lanes at 100 Gbps each (using PAM4 at approximately 53 GBaud per lane). Each lane's SerDes consumes 5 to 8 W, so a single `{python} ib_ndr_gbps` Gbps port requires 20 to 32 W just for signal processing.
The **link budget** quantifies how much signal loss a link can tolerate while maintaining acceptable error rates. It accounts for cable attenuation (which increases with frequency and length), connector losses, crosstalk from adjacent lanes, and the equalization capability of the SerDes. When the link budget is exceeded, the FEC cannot recover enough errors, and the effective throughput drops or the link fails entirely. This is why ML cluster designers carefully track cable lengths and connector counts: an extra connector or a slightly longer cable can push a link beyond its budget.
The **link budget** quantifies how much signal loss a link can tolerate while maintaining acceptable error rates. It accounts for cable attenuation (which increases with frequency and length), connector losses, crosstalk from adjacent lanes, and the equalization capability of the SerDes. When the link budget is exceeded, the FEC cannot recover enough errors, and the effective throughput drops or the link fails entirely. This is *why* ML cluster designers carefully track cable lengths and connector counts: an extra connector or a slightly longer cable can push a link beyond its budget.
The physical layer constraints established in this section, PAM4 encoding, FEC latency, and the economics of copper versus optics, set hard boundaries on what the protocol and topology layers above can achieve. We now turn to those protocol choices.
The physical layer constraints established in this sectionPAM4 encoding, FEC latency, and the economics of copper versus opticsset hard boundaries on *what* the protocol and topology layers above can achieve. We now turn to those protocol choices.
## RDMA and Transport Protocols {#sec-network-fabrics-protocols}
@@ -182,7 +237,7 @@ Standard TCP/IP networking is CPU-intensive. Processing a `{python} ib_ndr_gbps`
[^fn-rdma]: **RDMA**: Remote Direct Memory Access. First developed for InfiniBand in the early 2000s by the InfiniBand Trade Association. The term "remote" distinguishes it from local DMA (used by PCIe devices), emphasizing that the target memory resides on a different machine.
For ML workloads, RDMA enables a critical optimization: **GPUDirect RDMA**[^fn-gpudirect], where the NIC transfers data directly between GPU memory on different nodes without staging through CPU memory. This eliminates two additional memory copies per transfer (GPU to CPU on the sender, CPU to GPU on the receiver), roughly halving the effective latency for gradient exchanges.
For ML workloads, RDMA enables a critical optimization: **GPUDirect RDMA**[^fn-gpudirect], *where* the NIC transfers data directly between GPU memory on different nodes without staging through CPU memory. This eliminates two additional memory copies per transfer (GPU to CPU on the sender, CPU to GPU on the receiver), roughly halving the effective latency for gradient exchanges.
[^fn-gpudirect]: **GPUDirect RDMA**: An NVIDIA technology that allows third-party PCIe devices (such as network adapters) to directly access GPU memory. Requires compatible hardware (ConnectX NICs) and driver support.
@@ -224,39 +279,39 @@ Making Ethernet behave as a lossless fabric requires two mechanisms working in c
::: {#fig-ib-roce-stack fig-env="figure" fig-pos="htb" fig-cap="**High-Performance Networking Stacks**. Comparison of InfiniBand and RoCE protocol stacks. InfiniBand uses a native lossless fabric, while RoCE encapsulates RDMA traffic within UDP/IP packets. Both expose the same Verbs API to applications, but RoCE relies on Priority Flow Control (PFC) in the Ethernet layer to approximate InfiniBand's lossless guarantees." fig-alt="Two protocol stacks side by side. Left: InfiniBand with 5 native layers. Right: RoCEv2 with IB transport over UDP/IP and Ethernet. Orange arrows show kernel bypass path on both. Dashed lines connect shared Verbs API and IB Transport layers."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{Orchid}{RGB}{218,112,214}
\definecolor{Slate}{RGB}{112,128,144}
\definecolor{OrangeLine}{RGB}{255,140,0}
\definecolor{OrchidL}{HTML}{F4E7F8}
\definecolor{SlateL}{HTML}{E5E7E9}
\definecolor{OrangeLine}{HTML}{CC5500}
\tikzset{
layer/.style={draw=black!70, fill=white, minimum width=3cm, minimum height=0.7cm, font=\sffamily\footnotesize},
header/.style={font=\bfseries\sffamily, align=center}
layer/.style={draw=black!70, fill=white, minimum width=3.2cm, minimum height=0.7cm, font=\sffamily\footnotesize},
header/.style={font=\bfseries\sffamily, align=center, crimson}
}
% InfiniBand Stack
\begin{scope}[local bounding box=IB]
\node[header] at (0, 5) {InfiniBand};
\node[layer, fill=Orchid!20] (ib_verbs) at (0, 4) {Verbs API};
\node[layer, fill=Orchid!10] (ib_trans) at (0, 3) {IB Transport};
\node[layer, fill=Orchid!10] (ib_net) at (0, 2) {IB Network};
\node[layer, fill=Orchid!10] (ib_link) at (0, 1) {IB Link};
\node[layer, fill=Orchid!10] (ib_phy) at (0, 0) {IB Physical};
\node[layer, fill=OrchidL] (ib_verbs) at (0, 4) {Verbs API};
\node[layer, fill=OrchidL!50] (ib_trans) at (0, 3) {IB Transport};
\node[layer, fill=OrchidL!50] (ib_net) at (0, 2) {IB Network};
\node[layer, fill=OrchidL!50] (ib_link) at (0, 1) {IB Link};
\node[layer, fill=OrchidL!50] (ib_phy) at (0, 0) {IB Physical};
% RDMA bypass arrow
\draw[->, thick, OrangeLine] (-1.8, 4.3) -- (-1.8, 0.5) node[midway, left, align=center, font=\scriptsize] {Kernel\\Bypass};
\draw[->, thick, OrangeLine] (-1.9, 4.3) -- (-1.9, 0.5) node[midway, left, align=center, font=\scriptsize\bfseries] {Kernel\\Bypass};
\end{scope}
% RoCE Stack
\begin{scope}[shift={(5,0)}, local bounding box=RoCE]
\begin{scope}[shift={(5.5,0)}, local bounding box=RoCE]
\node[header] at (0, 5) {RoCEv2};
\node[layer, fill=Orchid!20] (roce_verbs) at (0, 4) {Verbs API};
\node[layer, fill=Orchid!10] (roce_trans) at (0, 3) {IB Transport};
\node[layer, fill=Slate!10] (roce_udp) at (0, 2) {UDP / IP};
\node[layer, fill=Slate!10] (roce_eth) at (0, 1) {Ethernet Link (PFC)};
\node[layer, fill=Slate!10] (roce_phy) at (0, 0) {Ethernet Physical};
\node[layer, fill=OrchidL] (roce_verbs) at (0, 4) {Verbs API};
\node[layer, fill=OrchidL!50] (roce_trans) at (0, 3) {IB Transport};
\node[layer, fill=SlateL] (roce_udp) at (0, 2) {UDP / IP};
\node[layer, fill=SlateL] (roce_eth) at (0, 1) {Ethernet Link (PFC)};
\node[layer, fill=SlateL] (roce_phy) at (0, 0) {Ethernet Physical};
% RDMA bypass arrow
\draw[->, thick, OrangeLine] (1.8, 4.3) -- (1.8, 0.5) node[midway, right, align=center, font=\scriptsize] {Kernel\\Bypass};
\draw[->, thick, OrangeLine] (1.9, 4.3) -- (1.9, 0.5) node[midway, right, align=center, font=\scriptsize\bfseries] {Kernel\\Bypass};
\end{scope}
% Connectors
@@ -334,17 +389,19 @@ For ML training, dragonfly topologies are most effective when job placement can
::: {#fig-network-topologies fig-env="figure" fig-pos="htb" fig-cap="**Network Topologies for ML**. (A) Fat-Tree provides full bisection bandwidth through hierarchical switch layers. (B) Torus connects neighbors, optimizing for local communication patterns such as those in TPU pods. (C) Rail-Optimized designs dedicate switch infrastructure to corresponding accelerator positions across nodes, minimizing hop count for tensor parallelism." fig-alt="Three network topology diagrams. A shows hierarchical fat-tree with switch layers. B shows 2D torus grid with wraparound connections. C shows rail-optimized with direct GPU-to-GPU paths across nodes."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.85, transform shape]
\definecolor{NodeColor}{RGB}{200,200,200}
\definecolor{SwitchColor}{RGB}{100,150,200}
\definecolor{NodeColor}{HTML}{D1E6F3}
\definecolor{SwitchColor}{HTML}{006395}
\definecolor{RailColor}{HTML}{CC5500}
\definecolor{GPUColor}{HTML}{F5D2D5}
\tikzset{
switch/.style={circle, fill=SwitchColor, draw=black!50, inner sep=2pt, minimum size=0.4cm},
node/.style={rectangle, fill=NodeColor, draw=black!50, inner sep=2pt, minimum size=0.4cm}
switch/.style={circle, fill=SwitchColor, draw=black!70, inner sep=2pt, minimum size=0.45cm, text=white, font=\tiny\bfseries},
node/.style={rectangle, fill=NodeColor, draw=black!70, inner sep=2pt, minimum size=0.4cm}
}
% Fat Tree
\begin{scope}
\node[anchor=south] at (2, 3.5) {\textbf{A. Leaf-Spine (Fat-Tree)}};
\node[anchor=south, crimson] at (2, 3.5) {\textbf{A. Leaf-Spine (Fat-Tree)}};
% Spine
\foreach \x in {0.5, 1.5, 2.5, 3.5} \node[switch] (s\x) at (\x, 3) {};
% Leaf
@@ -355,36 +412,36 @@ For ML training, dragonfly topologies are most effective when job placement can
% Connections
\foreach \s in {0.5, 1.5, 2.5, 3.5} {
\draw[gray, thin] (s\s) -- (l0);
\draw[gray, thin] (s\s) -- (l1);
\draw[gray, thin] (s\s) -- (l3);
\draw[gray, thin] (s\s) -- (l4);
\draw[black!40, thin] (s\s) -- (l0);
\draw[black!40, thin] (s\s) -- (l1);
\draw[black!40, thin] (s\s) -- (l3);
\draw[black!40, thin] (s\s) -- (l4);
}
\draw[gray] (l0) -- (n0); \draw[gray] (l0) -- (n0.5);
\draw[gray] (l1) -- (n1); \draw[gray] (l1) -- (n1.5);
\draw[black!60] (l0) -- (n0); \draw[black!60] (l0) -- (n0.5);
\draw[black!60] (l1) -- (n1); \draw[black!60] (l1) -- (n1.5);
\end{scope}
% Rail Optimized
\begin{scope}[shift={(6,0)}]
\node[anchor=south] at (1.5, 3.5) {\textbf{C. Rail-Optimized}};
\node[anchor=south, crimson] at (1.5, 3.5) {\textbf{C. Rail-Optimized}};
% Rail Switches
\foreach \x in {0, 1, 2, 3} \node[switch, fill=orange!50] (rs\x) at (\x, 3) {R\x};
\foreach \x in {0, 1, 2, 3} \node[switch, fill=RailColor] (rs\x) at (\x, 3) {R\x};
% Nodes
\node[draw, fit={(0,-0.5) (3, 0.5)}, inner sep=4pt] (host1) {};
\node[anchor=west] at (host1.west) {Node 1};
\node[draw=black!50, rounded corners=2pt, fit={(0,-0.5) (3, 0.5)}, inner sep=4pt, fill=black!5] (host1) {};
\node[anchor=west, font=\tiny\bfseries] at (host1.west) {Node 1};
\node[draw, fit={(0,-2.0) (3, -1.0)}, inner sep=4pt] (host2) {};
\node[anchor=west] at (host2.west) {Node 2};
\node[draw=black!50, rounded corners=2pt, fit={(0,-2.0) (3, -1.0)}, inner sep=4pt, fill=black!5] (host2) {};
\node[anchor=west, font=\tiny\bfseries] at (host2.west) {Node 2};
% GPUs
\foreach \x in {0, 1, 2, 3} {
\node[node, fill=violet!30] (g1\x) at (\x, 0) {};
\node[node, fill=violet!30] (g2\x) at (\x, -1.5) {};
\node[node, fill=GPUColor] (g1\x) at (\x, 0) {};
\node[node, fill=GPUColor] (g2\x) at (\x, -1.5) {};
\draw[thick, orange] (g1\x) -- (rs\x);
\draw[thick, orange] (g2\x) -- (rs\x);
\draw[thick, RailColor] (g1\x) -- (rs\x);
\draw[thick, RailColor] (g2\x) -- (rs\x);
}
\end{scope}
\end{tikzpicture}
@@ -395,6 +452,70 @@ The topology choice has quantitative consequences. To see why oversubscription i
::: {.callout-notebook title="Napkin Math: The Bisection Bottleneck"}
```{python}
#| echo: false
#| label: bisection-bottleneck
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BISECTION BOTTLENECK ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Network Topology - Bisection Bandwidth impact.
# │
# │ Goal: Quantify the slowdown from network oversubscription.
# │ Show: That cost-optimized (oversubscribed) networks throttle training.
# │ How: Calculate AllReduce time for 1:1 vs 4:1 oversubscription.
# │
# │ Imports: mlsys.constants, NetworkFabricsSetup.ib_ndr_gbs_val
# │ Exports: bisec_time_a_ms, bisec_time_b_ms, waste_millions
# └─────────────────────────────────────────────────────────────────────────────
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class BisectionBottleneck:
"""
Scenario: 1024 accelerators (128 nodes) running AllReduce.
Comparison of non-blocking (1:1) vs oversubscribed (4:1) fabric.
"""
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
num_nodes = 128
injection_bw_gbs = NetworkFabricsSetup.ib_ndr_gbs_val
gradient_size_gb = 100
oversub_ratio_b = 4
cluster_cost_m = 300
comm_fraction = 0.30
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
# Bisection bandwidth (aggregate across the cut)
total_bisec_bw_a = num_nodes * injection_bw_gbs
total_bisec_bw_b = total_bisec_bw_a / oversub_ratio_b
# Time = Data / Bandwidth
time_a_s = gradient_size_gb / total_bisec_bw_a
time_b_s = gradient_size_gb / total_bisec_bw_b
# Economic impact
# If comm was 30%, and comm slows by 4x, total time increases.
# New total = (1 - comm_frac) + (comm_frac * 4) = 0.7 + 1.2 = 1.9x
# Utilization drop = (1 - 1/1.9) = 47%? No, let's be more precise.
# Relative throughput = 1 / (0.7 + 0.3 * 4) = 1 / 1.9 = 0.526
# Waste = (1 - 0.526) * Cost
rel_throughput = 1 / ((1 - comm_fraction) + (comm_fraction * oversub_ratio_b))
waste_val = (1 - rel_throughput) * cluster_cost_m
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
check(time_b_s == time_a_s * 4, "Scenario B must be 4x slower than Scenario A")
check(waste_val > 10, f"Waste should be significant, got ${waste_val:.1f}M")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
time_a_ms = f"{time_a_s * 1000:.0f}"
time_b_ms = f"{time_b_s * 1000:.0f}"
waste_m = f"{waste_val:.0f}"
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
bisec_time_a_ms = BisectionBottleneck.time_a_ms
bisec_time_b_ms = BisectionBottleneck.time_b_ms
waste_millions = BisectionBottleneck.waste_m
```
**Problem**: You have 1,024 accelerators across 128 nodes. Each node has `{python} ib_ndr_gbps` Gb/s (`{python} ib_ndr_gbs` GB/s) injection bandwidth. You run an AllReduce job that requires full bisection bandwidth.
**Scenario A (non-blocking fat-tree)**: 1:1 oversubscription ratio. Bisection bandwidth = $128 \times 50 = 6{,}400$ GB/s = 6.4 TB/s.
@@ -405,10 +526,10 @@ The topology choice has quantitative consequences. To see why oversubscription i
Suppose the AllReduce must exchange 100 GB of gradient data (approximately 25 billion FP32 parameters).
- Time on Scenario A: $100 \text{ GB} / 6{,}400 \text{ GB/s} \approx 16 \text{ ms}$
- Time on Scenario B: $100 \text{ GB} / 1{,}600 \text{ GB/s} \approx 63 \text{ ms}$
- Time on Scenario A: `{python} bisec_time_a_ms` ms
- Time on Scenario B: `{python} bisec_time_b_ms` ms
**The systems conclusion**: Saving money on spine switches (Scenario B) slows every synchronization step of the entire cluster by approximately 4 $\times$. If AllReduce accounts for 30% of each training iteration, the cluster's effective throughput drops by roughly 23%. For a \$300 million supercomputer, this amounts to wasting \$69 million worth of accelerator time waiting for the network. Network oversubscription is false economy for training workloads.
**The systems conclusion**: Saving money on spine switches (Scenario B) slows every synchronization step of the entire cluster by approximately 4 $\times$. If AllReduce accounts for 30% of each training iteration, the cluster's effective throughput drops significantly. For a \$300 million supercomputer, this amounts to wasting approximately **\${python} waste_millions` million** worth of accelerator time waiting for the network. Network oversubscription is false economy for training workloads.
:::
@@ -460,31 +581,96 @@ This equation makes the critical insight visible: the bandwidth term is nearly i
::: {.callout-notebook title="Napkin Math: When Does AllReduce Become the Bottleneck?"}
**Setup**: A cluster of 1,024 H100 GPUs training a model with 1 billion parameters (4 GB of FP32 gradients). Each GPU computes at `{python} h100_tflops` TFLOPS. The network uses NDR InfiniBand ($\alpha = 1.5\;\mu\text{s}$, $\beta = 50\;\text{GB/s}$ per link).
```{python}
#| echo: false
#| label: allreduce-bottleneck
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ALLREDUCE BOTTLENECK ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Network Performance Modeling - Alpha-Beta Model.
# │
# │ Goal: Identify when communication intensity overwhelms computation.
# │ Show: Growth of comm overhead from 1B to 70B parameter models.
# │ How: Calculate T_compute vs T_ring_allreduce using Alpha-Beta parameters.
# │
# │ Imports: mlsys.constants, NetworkFabricsSetup.*
# │ Exports: t_comp_ms, t_comm_1b_ms, t_comm_70b_ms, comm_frac_1b
# └─────────────────────────────────────────────────────────────────────────────
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class AllReduceBottleneck:
"""
Scenario: Cluster of 1024 H100 GPUs training models of varying scale.
Alpha-Beta parameters for NDR InfiniBand.
"""
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
num_gpus = 1024
peak_tflops = float(NetworkFabricsSetup.h100_tflops)
gpu_util = 0.50
flops_per_sample_base = 2e18 # Rough estimate for 1B model iteration
alpha_s = float(NetworkFabricsSetup.alpha_ib_us) * 1e-6
beta_gbs = float(NetworkFabricsSetup.beta_ib_gbs)
params_small = 1e9
params_large = 70e9
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
# Compute time
t_comp_s = flops_per_sample_base / (peak_tflops * 1e12 * gpu_util)
# AllReduce time: 2(p-1)alpha + 2(p-1)/p * m/beta
def calc_allreduce(m_bytes):
latency_term = 2 * (num_gpus - 1) * alpha_s
bandwidth_term = (2 * (num_gpus - 1) / num_gpus) * (m_bytes / (beta_gbs * 1e9))
return latency_term + bandwidth_term
t_comm_small_s = calc_allreduce(params_small * 4) # FP32 gradients
t_comm_large_s = calc_allreduce(params_large * 4)
comm_frac_small = t_comm_small_small = t_comm_small_s / (t_comp_s + t_comm_small_s)
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
check(t_comp_s > 1.0, f"Compute time should be significant (>1s), got {t_comp_s:.1f}s")
check(t_comm_large_s > t_comm_small_s * 50, "70B comm should be >50x 1B comm")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
t_comp_ms = f"{t_comp_s * 1000:.0f}"
t_comm_1b_ms = f"{t_comm_small_s * 1000:.0f}"
t_comm_70b_ms = f"{t_comm_large_s * 1000:.0f}"
comm_frac_1b_pct = f"{comm_frac_small * 100:.1f}"
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
t_comp_ms = AllReduceBottleneck.t_comp_ms
t_comm_1b_ms = AllReduceBottleneck.t_comm_1b_ms
t_comm_70b_ms = AllReduceBottleneck.t_comm_70b_ms
comm_frac_1b = AllReduceBottleneck.comm_frac_1b_pct
```
**Setup**: A cluster of 1,024 H100 GPUs training a model with 1 billion parameters (4 GB of FP32 gradients). Each GPU computes at `{python} h100_tflops` TFLOPS. The network uses NDR InfiniBand ($\alpha = `{python} alpha_ib_us` \;\mu\text{s}$, $\beta = `{python} beta_ib_gbs` \;\text{GB/s}$ per link).
**Step 1: Compute time per iteration.**
Assume each GPU processes a microbatch requiring $2 \times 10^{18}$ FLOPs (a rough estimate for a forward plus backward pass on a large batch). At `{python} h100_tflops` TFLOPS with 50% utilization:
$$ T_{\text{compute}} = \frac{2 \times 10^{18}}{989 \times 10^{12} \times 0.5} \approx 4{,}049 \text{ ms} $$
$$ T_{\text{compute}} = `{python} t_comp_ms` \text{ ms} $$
**Step 2: Communication time (Ring AllReduce).**
With $p = 1{,}024$ and $m = 4 \times 10^9$ bytes:
- Latency term: $2 \times 1{,}023 \times 1.5 \times 10^{-6} \approx 3.1 \text{ ms}$
- Bandwidth term: $\frac{2 \times 1{,}023}{1{,}024} \times \frac{4 \times 10^9}{50 \times 10^9} \approx 160 \text{ ms}$
- Total: $T_{\text{ring}} \approx 163 \text{ ms}$
- Total Communication: $T_{\text{ring}} \approx `{python} t_comm_1b_ms` \text{ ms}$
**Step 3: Communication fraction.**
$$ \text{Comm. fraction} = \frac{163}{4{,}049 + 163} \approx 3.9\% $$
$$ \text{Comm. fraction} = `{python} comm_frac_1b` \% $$
With overlap between communication and computation (possible because the backward pass produces gradients layer by layer), the effective overhead can be reduced further. The network is not the bottleneck for this configuration.
**When does it become the bottleneck?** If we scale to 70 billion parameters (280 GB of gradients) and the per-GPU computation stays similar (through batch size scaling), the bandwidth term alone becomes:
$$ \frac{2 \times 1{,}023}{1{,}024} \times \frac{280 \times 10^9}{50 \times 10^9} \approx 11{,}200 \text{ ms} $$
$$ T_{\text{ring}} \approx `{python} t_comm_70b_ms` \text{ ms} $$
Now communication dominates computation by nearly 3 $\times$. This is precisely why models beyond a few billion parameters require tensor and pipeline parallelism to partition the model, rather than relying solely on data parallelism, which must AllReduce the full gradient vector.

View File

@@ -40,17 +40,74 @@ Power is not merely an operational expense but a hard physical constraint that l
```{python}
#| label: sustainable-ai-setup
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SUSTAINABLE AI: CHAPTER-WIDE CONSTANTS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Chapter-wide setup for Sustainable AI.
# │
# │ Goal: Provide energy, carbon, and hardware specs for sustainability analysis.
# │ Show: The gap between compute growth and energy infrastructure.
# │ How: Centralize GPU TDP, grid intensities, and scaling statistics.
# │
# │ Imports: mlsys.constants, mlsys.formatting
# │ Exports: h100_*, gpt3_*, grid_*, energy_wall_*
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import (
H100_TDP, A100_TDP, GPT3_PARAMS, GPT3_TRAINING_OPS,
watt, BILLION, param, second, hour, TFLOPs, MILLION
)
from mlsys.formatting import fmt, sci, check
from mlsys.constants import *
from mlsys.formatting import fmt, sci
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class SustainableAISetup:
"""
Namespace for Sustainability reference parameters.
Scenario: Mapping the AI Energy Wall vs. Grid/Battery scaling.
"""
# GPU TDP values (formatted for inline use)
a100_tdp_w = f"{A100_TDP.to(watt).magnitude:.0f}" # "400"
h100_tdp_w = f"{H100_TDP.to(watt).magnitude:.0f}" # "700"
h100_tdp_kw = f"{H100_TDP.to(watt).magnitude / 1000:.1f}" # "0.7"
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
# Hardware
h100_tdp_raw = H100_TDP
a100_tdp_raw = A100_TDP
# Grid & Carbon
grid_us_avg_g_kwh = 429
grid_quebec_g_kwh = 20
grid_coal_g_kwh = 1000
# Scaling (2012 -> 2024)
compute_growth_factor = 350000
battery_density_growth_annual = 0.05 # 5% per year
grid_efficiency_growth_annual = 0.02 # 2% per year
# Model parameters
gpt3_params_b = f"{GPT3_PARAMS.to(param).magnitude / BILLION:.0f}" # "175"
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
# Compound growth over 12 years: (1 + r)^12
battery_12yr_gain = (1 + battery_density_growth_annual) ** 12
grid_12yr_gain = (1 + grid_efficiency_growth_annual) ** 12
# The Gap: Compute Growth / Energy Infrastructure Growth
energy_wall_gap = compute_growth_factor / battery_12yr_gain
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
check(energy_wall_gap > 100000, f"Energy wall gap should be massive, got {energy_wall_gap:.0f}")
check(battery_12yr_gain < 2.0, "Battery density doubles every ~15-20 years, not 12.")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
h100_tdp_w = f"{h100_tdp_raw.to(watt).magnitude:.0f}"
a100_tdp_w = f"{a100_tdp_raw.to(watt).magnitude:.0f}"
gpt3_params_b = f"{GPT3_PARAMS.to(param).magnitude / BILLION:.0f}"
energy_wall_gap_str = fmt(energy_wall_gap, precision=0)
battery_gain_pct = f"{(battery_12yr_gain - 1) * 100:.0f}"
grid_gain_pct = f"{(grid_12yr_gain - 1) * 100:.0f}"
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
h100_tdp_w = SustainableAISetup.h100_tdp_w
a100_tdp_w = SustainableAISetup.a100_tdp_w
gpt3_params_b = SustainableAISetup.gpt3_params_b
energy_wall_gap_str = SustainableAISetup.energy_wall_gap_str
battery_gain_pct = SustainableAISetup.battery_gain_pct
grid_gain_pct = SustainableAISetup.grid_gain_pct
```
This chapter's position in the book's organizing framework, *the Fleet Stack*, clarifies why energy and environmental constraints are not external concerns but physical limits that bound what the entire system can achieve.
@@ -246,6 +303,59 @@ Y,Date
: **Data Center Energy Projections**: Between 2010 and 2030, data center electricity usage is projected to increase sharply, particularly under worst-case scenarios where consumption could exceed 8,000 TWh by 2030 [@jones2018much]. The gap between best and worst scenarios—over 10 $\times$ difference—demonstrates the critical importance of efficiency optimization at every layer of the systems stack.
:::
### The Energy Wall: Divergent Scaling {#sec-sustainable-ai-energy-wall}
*Why* is AI sustainability a unique engineering challenge? It is a race between two fundamentally different physics: the **exponential scaling of logic** and the **linear scaling of energy infrastructure**.
::: {#fig-energy-wall fig-env="figure" fig-pos="htb" fig-cap="**The Energy Wall**. AI compute requirements (FLOPS) have grown ~350,000$\\times$ since 2012. In contrast, the physical substrate of energy—battery density and grid efficiency—improves at only ~25% annually. This `{python} energy_wall_gap_str`$\\times$ divergence creates the 'Energy Wall' where algorithmic ambition exceeds physical possibility." fig-alt="Log-scale plot showing AI compute growing exponentially while battery density and grid efficiency show nearly flat linear growth. Shaded region between curves marks the 'Sustainability Gap'."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{AIColor}{HTML}{CB202D}
\definecolor{GridColor}{HTML}{008F45}
% Axes
\draw[->, thick] (0,0) -- (8,0) node[right] {Year};
\draw[->, thick] (0,0) -- (0,5.5) node[above] {Growth (Log Scale)};
% Compute Growth (Steep)
\draw[ultra thick, AIColor] (0,0.5) .. controls (2,1) and (5,4) .. (7,5) node[right] {AI Compute Demand};
\node[AIColor, font=\tiny\bfseries] at (6, 4.5) {~350,000$\times$};
% Grid/Battery Growth (Shallow)
\draw[ultra thick, GridColor] (0,0.5) -- (7,1.5) node[right] {Grid/Battery Capacity};
\node[GridColor, font=\tiny\bfseries] at (6, 1.0) {~20-80\%};
% The Gap
\draw[<->, ultra thick, OrangeLine] (6.5, 1.4) -- (6.5, 4.7);
\node[OrangeLine, rotate=90, font=\bfseries] at (6.8, 3.0) {Sustainability Gap};
% Ticks
\node[below] at (0,0) {2012};
\node[below] at (7,0) {2024};
\end{tikzpicture}
```
:::
While AI logic follows the "Iron Law" of software optimization, energy follows the laws of chemistry and thermodynamics. Over the last 12 years, battery energy density has improved by only ~`{python} battery_gain_pct`%, and grid efficiency by ~`{python} grid_gain_pct`%. The `{python} energy_wall_gap_str`$\times$ gap between these two curves is the **Sustainability Wall**—the point where we can no longer "buy our way out" of the efficiency problem with more power.
### Datacenter Grid Dynamics {#sec-sustainable-ai-grid-dynamics}
Sustainable AI requires looking beyond the server rack to the **Electrical Grid Interface**. Traditional datacenters are "Steady-State" loads; they pull constant power 24/7. ML training clusters, however, are **Transient Loads**.
#### The Power Ramp and Grid Stability
As discussed in @sec-compute-power-delivery, a 10,000-GPU cluster can swing its load by 510 Megawatts in milliseconds during an AllReduce synchronization step. For an electrical utility, this is a "Noise Event." *When* thousands of GPUs suddenly stop computing to wait for the network, they cause a "Voltage Spike" on the grid; *when* they resume, they cause a "Voltage Sag."
Managing these transients requires **Energy Buffering**: using on-site battery arrays or massive capacitors to smooth the training iterations, ensuring the ML Fleet doesn't destabilize the local municipal power grid.
#### Heat Reuse: Turning Waste into Fuel
A datacenter is physically a system that converts high-quality energy (electricity) into low-quality energy (waste heat). In a sustainable fleet, this heat is not "exhausted" into the atmosphere but "harvested."
* **District Heating**: Modern facilities in Nordic regions (e.g., Meta's Odense facility) pipe waste heat into local municipal heating systems, providing enough thermal energy to warm thousands of homes.
* **Industrial Coupling**: Using low-grade waste heat (~45°C) for greenhouse climate control or water desalination.
By treating heat as a **Byproduct** rather than a **Pollutant**, the fleet moves toward a "Circular Energy Economy."[^fn-history-pue]
[^fn-history-pue]: **The Evolution of PUE**: In the early 2000s, PUE values of 2.02.5 were common, meaning more power was used for cooling than for computing. Google's 2009 disclosure of PUE 1.21 was a watershed moment, proving that "free-air cooling" and efficient power conversion could halve datacenter footprints. Today, the focus has shifted from PUE (rate efficiency) to **CUE (Carbon Usage Effectiveness)** and **WUE (Water Usage Effectiveness)**, reflecting a holistic view of planetary boundaries.
Training complex AI systems demands high levels of computing power, resulting in significant energy consumption. OpenAI's GPT-3 exemplifies this scale: training required 1,287 megawatt-hours of electricity, equivalent to powering 130 U.S. homes for an entire year [@maslej2023artificial].[^fn-sustainable-gpt3] This energy consumption represents the computational algorithms trained on large datasets that characterize modern large language models.[^fn-training-process]
[^fn-training-process]: **Training Process**: Iterative optimization of model parameters through forward passes (computing predictions), loss calculation, and backward passes (gradient computation via backpropagation). Modern training runs millions of iterations across distributed hardware. GPT-4 training consumed an estimated 50+ GWh over several months, with gradient synchronization and checkpointing adding 15-30% communication overhead.