Migrates diagrams to SVG and adds dynamic visualizations

Refactors numerous embedded TikZ diagrams in Quarto markdown files to external SVG images. This improves rendering performance, streamlines content management, and enhances cross-platform consistency.

Introduces interactive "Napkin Math" `callout-notebook` blocks, featuring Python code to generate dynamic visualizations for key system trade-offs and scenarios. Expands the `mlsysim` library with new constants and plotting utilities to support these interactive calculations and comparisons.
This commit is contained in:
Vijay Janapa Reddi
2026-03-07 16:15:40 -05:00
parent a2038a0121
commit c24ab1ccb9
20 changed files with 688 additions and 2533 deletions

View File

@@ -9,59 +9,7 @@ Machine learning systems at scale do not operate in a vacuum; they are an extens
Machine learning systems at scale do not operate in a vacuum; they are an extension of the physical and algorithmic constraints that govern single-machine computation. This appendix provides a rigorous review of those foundations—spanning the memory hierarchy, the mechanics of backpropagation, and the "Iron Law" of performance. We treat these not as introductory topics, but as the first-order constraints that determine which distributed architectures are viable and which are physically impossible. @fig-fleet-vs-single-stack illustrates the shift from single-machine to fleet-scale optimization.
::: {#fig-fleet-vs-single-stack fig-env="figure" fig-pos="htb" fig-cap="**The Evolution of the Stack**. Moving from a single machine to the fleet shifts the systems focus from local hierarchies to global networks. (A) The **Single-Machine Stack** optimizes for the local Memory Wall (SRAM/HBM/PCIe). (B) The **Fleet Stack** optimizes for the global Bisection Bandwidth Wall (NVLink/InfiniBand/Ethernet). This volume assumes mastery of (A) and focuses on the engineering of (B)." fig-alt="Side-by-side stack diagram. Single-machine stack shows HW, OS, Framework, Application. Fleet stack shows Infrastructure, Distribution, Serving, Governance."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth]
\tikzset{
layer/.style={
draw=black!60,
line width=0.75pt,
rounded corners=1pt,
minimum width=3.5cm,
minimum height=0.6cm,
align=center
},
stacklabel/.style={
font=\bfseries,
anchor=south
},
bottleneck/.style={
font=\tiny\itshape,
text=RedLine,
anchor=north
},
connection/.style={
->,
line width=1.0pt,
color=GreenLine
}
}
% Single Machine
\begin{scope}[local bounding box=single]
\node[layer, fill=GreenL] (S1) {\textbf{Hardware (PCIe/HBM)}};
\node[layer, fill=BlueL, above=of S1] (S2) {Operating System};
\node[layer, fill=OrangeL, above=of S2] (S3) {ML Framework};
\node[layer, fill=RedL, above=of S3] (S4) {Application};
\node[stacklabel, above=0.2cm of S4] (S_title) {A. Single-Machine Stack};
\node[bottleneck, below=0.1cm of S1] (S_bot) {Bottleneck: Memory Wall};
\end{scope}
% The Fleet
\begin{scope}[local bounding box=fleet]
\node[layer, fill=GreenL, right=2cm of S1] (F1) {\textbf{Infra (Fabric/RDMA)}};
\node[layer, fill=BlueL, above=of F1] (F2) {Distribution};
\node[layer, fill=OrangeL, above=of F2] (F3) {Serving};
\node[layer, fill=RedL, above=of F3] (F4) {Governance};
\node[stacklabel, above=0.2cm of F4] (F_title) {B. Fleet Stack};
\node[bottleneck, below=0.1cm of F1] (F_bot) {Bottleneck: Bisection Wall};
\end{scope}
\draw[connection] (single) -- (fleet) node[midway, above, font=\tiny, text=black] {Scaling};
\end{tikzpicture}
```
![](images/svg/_system-map.svg){width=100%}
:::
## A.1 The Three Computing Paradigms {#sec-appendix-systems-paradigms}

View File

@@ -79,7 +79,13 @@ The speed of light sets the latency floor for all communication. In a vacuum, li
Higher bandwidth demands more signaling energy per bit, creating a bandwidth-distance product constraint. Modern InfiniBand NDR operates at 100 Gbps per lane using PAM4 signaling (4 voltage levels per symbol), which requires more precise analog circuits than the simpler NRZ (2 levels) used by earlier generations. This precision comes at a cost: the maximum reach of a copper cable at NDR rates is approximately 2 meters before the signal degrades below recoverable levels. Longer distances require active optical cables (AOCs) or fiber transceivers that convert electrical signals to light and back, adding both cost (hundreds of dollars per link) and latency (nanoseconds per conversion). The bandwidth-distance product is a fundamental constraint: a link can have high bandwidth or long distance, but not both cheaply.
Moving data also costs energy that scales with distance. Accessing data from local SRAM costs roughly 0.5 pJ/bit. Moving it across a PCB (NVLink) costs 5--10 pJ/bit. Moving it across a datacenter via InfiniBand optical links costs 20--50 pJ/bit. At the exascale (tens of thousands of GPUs), the power budget for communication rivals the power budget for computation itself. A 10,000-GPU cluster exchanging 1 GB of gradients per step at 30 pJ/bit consumes approximately 2.4 kJ per AllReduce, a non-trivial fraction of the total per-step energy budget.
Moving data also costs energy that scales with distance. Accessing data from local SRAM costs roughly 0.5 pJ/bit. Moving it across a PCB (NVLink) costs 5--10 pJ/bit. Moving it across a datacenter via InfiniBand optical links costs 20--50 pJ/bit.
::: {#fig-data-movement-hierarchy fig-env="figure" fig-pos="htb" fig-cap="**The Data Movement Energy Hierarchy**. The energy cost of moving a single bit increases by orders of magnitude as it traverses the system hierarchy, from local SRAM through HBM and intra-node NVLink to inter-node InfiniBand fabrics. This energy gradient makes data locality the primary driver of sustainable and efficient distributed ML system design." fig-alt="Pyramid or tiered diagram showing increasing energy cost (pJ per bit) from local SRAM (base) up to remote datacenter nodes (top)."}
![](images/svg/data-movement-hierarchy.svg){width=100%}
:::
At the exascale (tens of thousands of GPUs), the power budget for communication rivals the power budget for computation itself. A 10,000-GPU cluster exchanging 1 GB of gradients per step at 30 pJ/bit consumes approximately 2.4 kJ per AllReduce, a non-trivial fraction of the total per-step energy budget.
Beyond these physical limits, protocol overhead adds a per-message software tax. Traditional TCP/IP stacks incur microseconds of OS kernel overhead per packet: the CPU must process socket calls, traverse the kernel networking stack, copy data between user and kernel buffers, and interact with the NIC through device drivers. High-performance ML networks bypass the kernel entirely using **RDMA (Remote Direct Memory Access)**[^fn-zerocopy-rdma], allowing the network card (NIC) to read directly from GPU memory via the PCIe bus. RDMA eliminates the kernel traversal, reducing per-message overhead from 10--20 $\mu$s (TCP) to 1--3 $\mu$s (RDMA). The most advanced configuration, **GPUDirect RDMA**, further eliminates the CPU from the data path: the NIC reads from GPU HBM through a direct PCIe peer-to-peer transfer, without the data ever touching CPU DRAM.
@@ -478,6 +484,10 @@ class LogPOverlap:
**Effective time**: max(`{python} LogPOverlap.compute_str`, `{python} LogPOverlap.L_str` + `{python} LogPOverlap.two_o_str`) = `{python} LogPOverlap.effective_str` $\mu$s (communication hidden!).
::: {#fig-comm-compute-overlap fig-env="figure" fig-pos="htb" fig-cap="**Communication-Computation Overlap**. By pipelining collective operations with the backward pass, systems can hide network latency behind arithmetic execution. Overlap is successful only when the non-overlappable overhead ($o$) is minimized, allowing the bulk of the transfer ($L$) to proceed while the GPU computes the next layer's gradients." fig-alt="Timeline diagram showing overlapping Compute and Communication bars. Computation for layer N overlaps with Communication for layer N-1."}
![](images/svg/comm-compute-overlap.svg){width=100%}
:::
**The Systems Insight**: The α-β model captures the total communication time. The LogP model reveals **how much of it can be hidden**. When designing pipelined training, optimize for low $o$ (kernel bypass, GPUDirect) rather than high $\beta$ alone.
:::
@@ -609,6 +619,10 @@ If a GPU simply opens a socket and sends a massive gradient to another GPU, the
:::
::: {#fig-collective-primitives-overview fig-env="figure" fig-pos="htb" fig-cap="**The Six Core Collective Primitives**. Standardized patterns for group communication in distributed systems. (1) Broadcast: rank 0 sends to all. (2) Reduce: all aggregate to rank 0. (3) AllReduce: all aggregate and all receive result. (4) AllGather: all send to all and concatenate. (5) ReduceScatter: all aggregate and results are distributed in shards. (6) AllToAll: every process sends unique data to every other process." fig-alt="Grid of 6 diagrams illustrating Broadcast, Reduce, AllReduce, AllGather, ReduceScatter, and AllToAll patterns across a process group."}
![](images/svg/collective-primitives-overview.svg){width=100%}
:::
Different model architectures stress different operations, and selecting the wrong primitive for a workload creates unnecessary bottlenecks.
### The Six Core Primitives
@@ -1306,22 +1320,8 @@ class HierarchicalAllreduceCalc:
These three phases confine most traffic within each node before crossing the slower inter-node fabric.
::: {.callout-note title="Hierarchical AllReduce Phases"}
```{.tikz fig-alt="Three-phase hierarchical AllReduce showing two nodes with 4 GPUs each. Phase 1 shows intra-node ReduceScatter via NVLink. Phase 2 shows inter-node AllReduce via InfiniBand between corresponding GPUs. Phase 3 shows intra-node AllGather via NVLink."}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{NodeColor}{RGB}{245,245,250}
\definecolor{GpuColor}{RGB}{180,200,240}
\definecolor{NVColor}{RGB}{0,100,200}
\definecolor{IBColor}{RGB}{220,100,50}
\tikzset{
nodebox/.style={draw=black!40, fill=NodeColor, rounded corners=6pt, minimum width=4.2cm, minimum height=2.8cm},
gpu/.style={draw=black!60, fill=GpuColor, thick, rounded corners=2pt, minimum width=0.7cm, minimum height=0.5cm, font=\tiny\bfseries},
nv_link/.style={draw=NVColor, line width=1.5pt},
ib_link/.style={draw=IBColor, line width=2pt, dashed},
phase_label/.style={font=\scriptsize\bfseries, rounded corners=2pt, inner sep=3pt}
}
![](images/svg/hierarchical-allreduce.svg){width=100%}
:::
% Node 0
\node[nodebox] (n0) at (0,0) {};
\node[font=\footnotesize\bfseries] at (0, -1.8) {Node 0};
@@ -1461,6 +1461,10 @@ class ThreeLevelNapkin:
### In-Network Reduction: SHARP and Beyond {#sec-communication-sharp}
::: {#fig-sharp-innetwork fig-env="figure" fig-pos="htb" fig-cap="**In-Network Reduction (SHARP)**. Traditional Tree AllReduce performs reduction at intermediate GPUs, incurring multiple store-and-forward delays. SHARP offloads the reduction to the network switch ASIC, allowing partial sums to be aggregated at line rate as packets traverse the switch. This eliminates GPU memory traffic and store-and-forward latency, significantly accelerating small-to-medium message collectives." fig-alt="Two-panel comparison. Top: Software Tree AllReduce with data flowing through GPUs. Bottom: In-Network SHARP with data aggregated at the switch."}
![](images/svg/sharp-innetwork.svg){width=100%}
:::
Hierarchical AllReduce reduces the *volume* of cross-node traffic, but the aggregation still requires multiple network round-trips. An alternative approach eliminates round-trips entirely by performing the reduction *inside the network switch itself*. NVIDIA's **Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)**[^fn-sharp-innetwork] implements this idea: instead of gradients traveling to a destination GPU for summation, the InfiniBand switch aggregates partial sums as data packets pass through it.
[^fn-sharp-innetwork]: **SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)**: In-network computing on Quantum InfiniBand switches that performs reduction (sum, min, max) in the switch ASIC with sub-microsecond latency, eliminating the store-and-forward overhead at each intermediate GPU. Production deployments report 2--3$\times$ AllReduce speedups for 1--64 MB gradients, though the number of concurrent aggregation trees is limited by switch resources, creating contention in multi-tenant clusters. \index{SHARP!in-network reduction}
@@ -1481,6 +1485,10 @@ The topology detection process begins with hardware enumeration. NCCL queries th
#### Torus Topology (TPU Pods)
::: {#fig-torus-reduction fig-env="figure" fig-pos="htb" fig-cap="**Dimension-Ordered Reduction on Torus Topology**. In a 3D torus mesh (typical of TPU pods), AllReduce is decomposed into three sequential steps along the X, Y, and Z dimensions. By reducing along one dimension at a time, the system minimizes network diameter and avoids link contention, ensuring that each direct neighbor link operates at full bandwidth." fig-alt="3D grid diagram of TPUs. Arrows show sequential reduction along X-axis rings, then Y-axis rings, then Z-axis rings."}
![](images/svg/torus-reduction.svg){width=100%}
:::
Google's TPU pods use a 3D torus topology where each TPU connects directly to 6 neighbors ($\pm$X, $\pm$Y, $\pm$Z). Unlike the hierarchical fat-tree topology of InfiniBand clusters, the torus provides uniform, direct connectivity: every TPU chip has the same number of links (6) and the same per-link bandwidth, regardless of its position in the mesh. The optimal AllReduce strategy for this topology is **dimension-ordered reduction**:
1. Reduce along the X dimension (all TPUs in the same YZ plane)

View File

@@ -74,7 +74,8 @@ from mlsysim.core.constants import (
WSE3_CORES, WSE3_MEM_CAPACITY, WSE3_MEM_BW, WSE3_TDP,
TPUV5P_MEM_BW, NVLINK_H100_BW, INFINIBAND_NDR_BW, PCIE_GEN5_BW, SYSTEM_MEMORY_BW,
CLOUD_ELECTRICITY_PER_KWH, USD, GPT3_PARAMS,
Mparam, Bparam, TFLOPs, flop, param, MFLOPs, GFLOPs
Mparam, Bparam, TFLOPs, flop, param, MFLOPs, GFLOPs,
LEAD_TIME_GPU_MONTHS, LEAD_TIME_SUBSTATION_MONTHS, GRID_INTERCONNECTION_QUEUE_US_GW
)
from mlsysim.fmt import fmt, sci, check, md, md_math
@@ -245,6 +246,10 @@ Wafer-scale engines sit at a unique point on the spectrum: they are highly speci
The key insight from @fig-wafer-scale-engine is the trade-off that wafer-scale integration accepts: manufacturing complexity and defect-aware routing in exchange for eliminating the inter-chip communication bottleneck entirely, keeping all 900,000 cores within nanoseconds of each other on a single silicon fabric.
::: {#fig-accelerator-spectrum fig-env="figure" fig-pos="htb" fig-cap="**The Accelerator Spectrum**. The landscape of ML hardware involves a fundamental trade-off between programmability and efficiency. General-purpose CPUs provide maximum flexibility but low compute density. Moving toward custom ASICs and systolic arrays increases throughput per watt by hardwiring common dataflows, at the cost of supporting a narrower range of model architectures." fig-alt="Continuum diagram from General Purpose (left) to Domain Specific (right). Nodes: CPU, GPU, TPU, Wafer-Scale Engine, Custom ASIC. Arrows show increasing Efficiency and decreasing Programmability."}
![](images/svg/accelerator-spectrum.svg){width=100%}
:::
| **Feature** | **CPU** | **GPU** (H100) | **TPU** (v5p) | **Wafer-Scale** | **Custom ASIC** |
|:--------------------|:-------------:|:--------------:|:--------------:|:---------------:|:---------------:|
| **Arithmetic Core** | Scalar/Vector | Tensor Core | Systolic Array | RISC-style Core | Fixed Dataflow |
@@ -470,7 +475,13 @@ While **Archetype A (GPT-4)** is primarily throughput-bound (demanding more TFLO
:::
The bandwidth gap between registers and HBM is approximately 1,000$\times$. If an operand must be fetched from HBM for a single operation, the arithmetic unit spends 99.9% of its time stalling. High Model FLOPS Utilization (MFU) is only possible through aggressive **tiling**: breaking the massive weight matrices into small tiles that fit entirely within shared memory and registers, then performing as many multiply-accumulate operations on each tile as possible before evicting it. Despite the `{python} InfraSetup.frontier_params_b`B model's massive total memory footprint, the active working set at any given microsecond must be meticulously managed to reside in that top 30 MB of register space, or the chip's theoretical performance becomes a mirage.
The bandwidth gap between registers and HBM is approximately 1,000$\times$. If an operand must be fetched from HBM for a single operation, the arithmetic unit spends 99.9% of its time stalling. High Model FLOPS Utilization (MFU) is only possible through aggressive **tiling**: breaking the massive weight matrices into small tiles that fit entirely within shared memory and registers, then performing as many multiply-accumulate operations on each tile as possible before evicting it.
::: {#fig-tensor-core-tiling fig-env="figure" fig-pos="htb" fig-cap="**Tensor Core Tiling Strategy**. To overcome the HBM bandwidth bottleneck, GPUs decompose large matrix operations into small tiles that fit within the Streaming Multiprocessor's (SM) fast SRAM. Data is loaded once into SRAM and reused across multiple Tensor Core operations, effectively multiplying the arithmetic value of every byte fetched from HBM." fig-alt="Diagram showing a large matrix divided into a grid of tiles. One tile is highlighted and shown being loaded into a smaller SM-local memory block, then feeding into a dense Tensor Core unit."}
![](images/svg/tensor-core-tiling.svg){width=100%}
:::
Despite the `{python} InfraSetup.frontier_params_b`B model's massive total memory footprint, the active working set at any given microsecond must be meticulously managed to reside in that top 30 MB of register space, or the chip's theoretical performance becomes a mirage.
[^fn-hbm-origin]: **HBM (High Bandwidth Memory)**: Standardized by JEDEC in 2013 as a joint development between AMD and SK Hynix, originally for graphics cards. ML accelerators adopted HBM because neural networks exhibit the same bandwidth-hungry, capacity-moderate access pattern as high-end rendering. Each HBM generation has roughly doubled bandwidth (128 GB/s in HBM1 to 1.2 TB/s per stack in HBM3e), yet the gap between memory bandwidth and arithmetic throughput continues to widen -- making HBM a necessary but never sufficient response to the Memory Wall. \index{HBM!origin}
@@ -488,7 +499,13 @@ Traditional DDR memory connects to the processor through pins on the edge of a p
The physical distance from the DIMM slot to the processor die is measured in centimeters, and every centimeter of copper trace introduces capacitance, signal attenuation, and energy loss. At DDR5 data rates (4,800--6,400 MT/s per pin), the signal conditioning circuits must compensate for significant channel impairment, consuming substantial power per bit transferred. Increasing the data rate on these long traces requires progressively more power for signal conditioning, creating a diminishing-returns curve that DDR5 is already approaching.
HBM solves this problem by changing the physical topology entirely. Instead of routing signals horizontally across a PCB, HBM stacks multiple DRAM dies *vertically*, one on top of another, and connects them with **Through-Silicon Vias (TSVs)**[^fn-tsv-stacking]: microscopic copper pillars etched through the silicon substrate itself. The vertical stacking represents a fundamental change in memory architecture: rather than increasing bandwidth by pushing signals faster through long copper traces (the DDR approach, which has diminishing returns), HBM increases bandwidth by multiplying the number of parallel signal paths through extremely short vertical connections.
HBM solves this problem by changing the physical topology entirely. Instead of routing signals horizontally across a PCB, HBM stacks multiple DRAM dies *vertically*, one on top of another, and connects them with **Through-Silicon Vias (TSVs)**[^fn-tsv-stacking]: microscopic copper pillars etched through the silicon substrate itself.
::: {#fig-hbm-architecture fig-env="figure" fig-pos="htb" fig-cap="**HBM 3D-Stacked Architecture**. High Bandwidth Memory achieves its performance by vertically stacking DRAM dies and connecting them via Through-Silicon Vias (TSVs). The entire stack sits on a silicon interposer alongside the GPU die, enabling a 1024-bit-wide bus and reducing signal travel distance from centimeters to micrometers. This physical proximity is the key to breaking the Memory Wall." fig-alt="Cross-section diagram of an HBM stack. Multiple DRAM dies are stacked on a logic die, connected by vertical TSVs. The assembly sits on a silicon interposer with microbumps connecting to the GPU die."}
![](images/svg/hbm-architecture.svg){width=100%}
:::
The vertical stacking represents a fundamental change in memory architecture: rather than increasing bandwidth by pushing signals faster through long copper traces (the DDR approach, which has diminishing returns), HBM increases bandwidth by multiplying the number of parallel signal paths through extremely short vertical connections.
A single HBM stack contains 8--12 DRAM dies. Each die is thinned to approximately 30--50 micrometers (roughly the thickness of a human hair) using a chemical-mechanical polishing process. The thinning is necessary because the TSVs must pass through the full thickness of each die, and thinner dies allow shorter, lower-resistance vias. The thinned dies are then vertically aligned with sub-micrometer precision and bonded to the die below using thermocompression bonding at temperatures of 300--400 degrees Celsius.
@@ -1345,6 +1362,10 @@ With the accelerator's physics established, we face a concrete problem. Our `{py
## The Node {#sec-compute-node}
::: {#fig-node-topology-comparison fig-env="figure" fig-pos="htb" fig-cap="**Accelerator Node Topologies**. Comparison of internal node wiring. (Left) Standard PCIe topology where all GPUs share a single root complex, creating a bottleneck. (Right) NVLink/NVSwitch topology providing all-to-all connectivity at full bandwidth. The NVSwitch fabric transforms 8 independent GPUs into a single virtual accelerator with unified memory access characteristics." fig-alt="Two-panel diagram. Left: GPUs connected via a central PCIe switch. Right: GPUs connected via multiple NVSwitch chips in a non-blocking mesh."}
![](images/svg/node-topology-comparison.svg){width=100%}
:::
\index{Node}
\index{NVLink}
\index{NVSwitch}
@@ -1422,6 +1443,10 @@ class BandwidthHierarchyScenario:
: **The Bandwidth Hierarchy**. Each physical boundary introduces an order-of-magnitude bandwidth cliff. These cliffs are not engineering failures to be optimized away; they reflect fundamental differences in the physics of each interconnect medium. The cliffs dictate model partitioning: Tensor Parallelism, which requires AllReduce after every layer, is strictly confined to the intra-node domain. {#tbl-bandwidth-hierarchy-compute}
::: {#fig-bandwidth-hierarchy fig-env="figure" fig-pos="htb" fig-cap="**The Infrastructure Bandwidth Hierarchy**. Data movement speed drops by orders of magnitude as it moves further from the compute units. The pyramid shows the bandwidth tiers from on-chip SRAM down to the global network. Exploiting this hierarchy through tiered algorithms is the primary task of distributed systems engineering." fig-alt="Pyramid showing bandwidth tiers from SRAM (TB/s) at the tip down to Inter-Pod network (GB/s) at the base."}
![](images/svg/bandwidth-hierarchy.svg){width=100%}
:::
As @tbl-bandwidth-hierarchy-compute shows, parallelism strategies must respect these boundaries. The following notebook quantifies the bandwidth gaps.
```{python}
@@ -1836,6 +1861,50 @@ Electricity flows through a chain of transformations before it reaches a GPU's v
Stage 1: Grid Connection. Utility power arrives at the datacenter as high-voltage AC, typically 13.8--69 kV depending on the country and the size of the facility. A dedicated substation or transformer yard steps this down to medium voltage (480 V AC in North America, 400 V AC in Europe). The largest ML facilities require their own substation, which takes 18--24 months to build and requires coordination with the local utility. The grid connection is the ultimate bottleneck: no amount of engineering inside the building can deliver more power than the grid provides.
```{python}
#| label: grid-queue-calc
#| echo: false
from mlsysim.fmt import check
class GridQueue:
"""The lead-time bottleneck of the ML fleet."""
# ┌── 1. LOAD ──────────────────────────────────────────
gpu_lead_time_months = LEAD_TIME_GPU_MONTHS
substation_lead_time_months = LEAD_TIME_SUBSTATION_MONTHS
grid_queue_size_gw = GRID_INTERCONNECTION_QUEUE_US_GW
# ┌── 2. EXECUTE ───────────────────────────────────────
lead_time_ratio = substation_lead_time_months / gpu_lead_time_months
# ┌── 3. GUARD ─────────────────────────────────────────
check(lead_time_ratio == 4, f"Ratio {lead_time_ratio} unexpected")
# ┌── 4. OUTPUT ────────────────────────────────────────
ratio_str = f"{lead_time_ratio:.0f}"
queue_str = f"{grid_queue_size_gw}"
@classmethod
def plot(cls):
"""Visualizes the Interconnection Queue."""
from mlsysim import viz
return viz.bar_compare(
labels=["GPU Silicon", "Grid Substation"],
values=[cls.gpu_lead_time_months, cls.substation_lead_time_months],
title="Fleet Deployment Lead Times",
ylabel="Lead Time (Months)",
goal_line=6 # Target turnaround for logic layer
)
```
::: {.callout-notebook title="Napkin Math: The Interconnection Queue"}
**Problem**: Compare the deployment timeline for 10,000 GPUs versus the electrical substation required to power them (7 MW).
1. **Silicon Path**: GPU supply chains are volatile, but typical enterprise lead times are **`{python} GridQueue.gpu_lead_time_months` months**.
2. **Infrastructure Path**: Permitting, EPC (Engineering, Procurement, Construction), and grid connection for a new 10+ MW substation averages **`{python} GridQueue.substation_lead_time_months` months**.
3. **The Lag**: Infrastructure takes **`{python} GridQueue.ratio_str`$\times$ longer** to deploy than the accelerators themselves.
**The Systems Insight**: In the era of the ML Fleet, the primary bottleneck is not the **Supply Chain** of silicon, but the **Interconnection Queue** of the grid. As of 2024, there are over **`{python} GridQueue.queue_str` GW** of capacity waiting for grid connection in the US alone. An engineer who optimizes for GPU utilization without a 2-year power roadmap will find their fleet "electrically stranded"—expensive silicon sitting in a dark building waiting for a transformer.
:::
Stage 2: UPS and Power Conditioning. An Uninterruptible Power Supply (UPS) sits between the utility feed and the IT equipment. The UPS serves two functions: it conditions the incoming power (removing voltage fluctuations and frequency variations) and provides battery backup during brief outages. Modern online (double-conversion) UPS systems convert AC to DC, charge a battery bank, and then convert back to AC, ensuring clean power but losing 3--5% efficiency. Newer high-efficiency "eco-mode" UPS designs bypass the double conversion during normal operation, achieving 98--99% efficiency but providing slightly less protection against input power anomalies.
Stage 3: Power Distribution Unit (PDU). The PDU distributes power from the UPS to individual racks. In traditional datacenters, the PDU provides the final AC-to-AC step-down (from 480 V to 208/240 V for servers).
@@ -1975,6 +2044,10 @@ The operational procedures for immersion-cooled facilities differ sharply from a
: **The Shift to Liquid Cooling**. At rack power densities above 30 kW, air cooling requires fan power that approaches the power consumed by the GPUs themselves. Liquid cooling is not a premium option; it is a thermodynamic requirement for modern ML racks. {#tbl-cooling-limits}
::: {#fig-cooling-comparison fig-env="figure" fig-pos="htb" fig-cap="**Cooling Architectures: Air vs. Liquid**. Comparison of thermal management strategies for high-density ML racks. Air cooling (left) requires massive airflow and large fan overhead, reaching a physical limit near 30 kW per rack. Liquid cooling (right) uses direct-to-chip cold plates and high-thermal-capacity coolant, enabling densities exceeding 100 kW per rack with significantly lower PUE." fig-alt="Two-panel diagram. Left: Server rack with large fans and blue/red air arrows. Right: Server rack with thin pipes and blue/red liquid arrows leading to a heat exchanger."}
![](images/svg/cooling-comparison.svg){width=100%}
:::
As @tbl-cooling-limits illustrates, the capital cost of these cooling technologies spans an order of magnitude. Standard air cooling infrastructure costs \$2,000--5,000 per rack (fans, CRAC units, raised floor tiles). Direct-to-chip liquid cooling costs \$15,000--25,000 per rack (cold plates, manifolds, CDUs, piping). Full immersion cooling costs \$30,000--50,000 per tank (dielectric fluid, sealed tanks, specialized heat exchangers). The break-even analysis between air and liquid cooling depends on rack power density: at 20 kW per rack, air cooling's lower CapEx wins over a 3-year lifecycle. At 40 kW per rack, the electricity savings from liquid cooling's lower PUE (1.08 vs. 1.5) offset the higher CapEx within 18--24 months. At 60+ kW per rack -- the regime of modern ML infrastructure -- air cooling is physically impossible, making the comparison moot. For our `{python} InfraSetup.frontier_params_b`B model's 128-rack training cluster at `{python} RackPowerScenario.rack_power_str` kW per rack, direct-to-chip liquid cooling is the standard choice, balancing density, serviceability, and cost. Immersion cooling offers marginal PUE improvement (1.03 vs. 1.08) but introduces operational complexity that most organizations find unjustified at current rack densities.
::: {.callout-notebook title="The Cooling Tax"}
@@ -2062,6 +2135,10 @@ The rack concentrates power and heat into a physical volume where thermodynamics
## The Pod {#sec-compute-pod}
::: {#fig-pod-network-topology fig-env="figure" fig-pos="htb" fig-cap="**Datacenter Pod Topology**. A pod aggregates multiple racks into a single non-blocking network domain. By using a fat-tree or Clos topology with high-radix switches, the pod ensures that any GPU can communicate with any other GPU at full line rate, providing the structural bisection bandwidth required for large-scale gradient synchronization." fig-alt="Diagram showing multiple racks connected to a central spine switch layer. Racks are grouped into pods, with redundant links providing multiple paths between any two racks."}
![](images/svg/pod-network-topology.svg){width=100%}
:::
\index{Pod}
\index{Warehouse-Scale Computer}
@@ -2547,6 +2624,10 @@ For our `{python} InfraSetup.frontier_params_b`B model, a minimum viable trainin
The most consequential infrastructure decision is whether to build an on-premises cluster or rent capacity from a cloud provider. This decision depends primarily on one variable: sustained utilization.
::: {#fig-tco-build-vs-buy fig-env="figure" fig-pos="htb" fig-cap="**TCO: Build vs. Buy**. Comparison of cumulative cost over time for owned infrastructure (on-premises) versus rented infrastructure (cloud) at different utilization rates. Owned infrastructure requires high upfront CapEx but has lower marginal OpEx, leading to a break-even point typically between 18 and 24 months at high utilization (>70%). At low utilization (<30%), cloud instances are almost always more cost-effective." fig-alt="Plot showing cumulative cost over 36 months. Cloud lines are straight with slopes varying by utilization. On-prem line starts with high CapEx and has a shallower slope. Cross-over points are highlighted."}
![](images/svg/tco-build-vs-buy.svg){width=100%}
:::
```{python}
#| label: tco-scenario
#| echo: false

View File

@@ -189,64 +189,7 @@ No single accelerator has enough memory or compute to train a frontier model. To
The defining characteristic of a node is its **Bandwidth Hierarchy**. As data moves from the silicon die to the board and finally to the network, it incurs a "Bandwidth Tax." As @fig-bandwidth-hierarchy-expanded illustrates, bandwidth drops by orders of magnitude at each physical boundary.
::: {#fig-bandwidth-hierarchy-expanded fig-env="figure" fig-pos="htb" fig-cap="**The Bandwidth Hierarchy**. In a modern ML node, bandwidth drops by orders of magnitude as data crosses physical boundaries. Tensor Parallelism is confined to the NVLink domain to avoid the 'InfiniBand Cliff'." fig-alt="Vertical bar chart showing bandwidth levels: HBM (3.3 TB/s), NVLink (900 GB/s), PCIe/InfiniBand (50-64 GB/s). Each step represents a physical boundary."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
bar/.style={
draw=BlueLine,
line width=0.75pt,
text=black,
font=\bfseries,
align=center,
anchor=south
},
hbm/.style={bar, fill=BlueL, minimum width=1cm, minimum height=4.5cm},
nvlink/.style={bar, fill=BlueLine!40, minimum width=1cm, minimum height=3.5cm},
ib/.style={bar, fill=BlueLine!60, minimum width=1cm, minimum height=1.5cm},
label/.style={
anchor=north,
font=\small,
align=center
},
drop arrow/.style={
->,
draw=RedLine,
line width=1.2pt,
shorten >=2pt,
shorten <=2pt
},
drop text/.style={
above,
sloped,
font=\small,
text=black
}
}
% Axes
\draw[->, line width=1.0pt] (0,0) -- (0,5) node[above] {Log Bandwidth (GB/s)};
\draw[line width=1.0pt] (0,0) -- (6,0);
% Bars using positioning
\begin{scope}[node distance=1cm]
\node[hbm] (hbm) at (1,0) {\rotatebox{90}{HBM (3350)}};
\node[label, below=2pt of hbm.south] {On-Package};
\node[nvlink, right=of hbm] (nvlink) {\rotatebox{90}{NVLink (900)}};
\node[label, below=2pt of nvlink.south] {Intra-Node};
\node[ib, right=of nvlink] (ib) {\rotatebox{90}{IB/PCIe (50-64)}};
\node[label, below=2pt of ib.south] {Inter-Node};
\end{scope}
% Arrows showing the drop
\draw[drop arrow] (hbm.east |- 0, 4.0) -- (nvlink.west |- 0, 3.5) node[midway, drop text] {~4$\times$ Drop};
\draw[drop arrow] (nvlink.east |- 0, 3.0) -- (ib.west |- 0, 1.5) node[midway, drop text] {~15$\times$ Drop};
\end{tikzpicture}
```
![](images/svg/bandwidth-hierarchy.svg){width=100%}
:::
As @fig-bandwidth-hierarchy-expanded illustrates, this tax determines the **Distributed Strategy**. **Tensor Parallelism**, which requires synchronizing partial sums after every matrix multiplication, is physically confined to the high-bandwidth domain (NVLink). If we attempted to run Tensor Parallelism across nodes using standard networking, the GPUs would spend 90% of their time waiting for the "InfiniBand Cliff," rendering the extra compute useless.

View File

@@ -68,7 +68,7 @@ from mlsysim.core.constants import (
STORAGE_COST_S3_STD, STORAGE_COST_GLACIER,
STORAGE_COST_NVME_LOW, STORAGE_COST_NVME_HIGH,
Mparam, Bparam, TFLOPs, GFLOPs,
watt
watt, SYNTHETIC_PROVENANCE_OVERHEAD, SYNTHETIC_VERIFICATION_PASSES
)
from mlsysim.fmt import fmt, sci, check, md
@@ -1542,6 +1542,55 @@ The complete storage picture for our running example, a 30-day training run of a
:::
## The Synthetic Fuel Line {#sec-storage-synthetic-fuel}
As the fleet consumes the sum of human-generated text, it encounters the **Data Wall**: the point where new high-quality training samples can only be generated by the fleet itself. This shift from *collecting* data to *synthesizing* data fundamentally changes the storage architecture. A "Synthetic Fuel Line" does not just stream data; it must store the **Provenance Chain** of every sample to prevent "Model Collapse" (where a model degrades by training on its own unverified errors).
```{python}
#| label: synthetic-amplification-calc
#| echo: false
from mlsysim.fmt import check
class SyntheticStorage:
"""The systems tax of the Synthetic Fuel Line."""
# ┌── 1. LOAD ──────────────────────────────────────────
data_size_tb = 1
provenance_metadata_overhead = SYNTHETIC_PROVENANCE_OVERHEAD
verification_passes = SYNTHETIC_VERIFICATION_PASSES
# ┌── 2. EXECUTE ───────────────────────────────────────
total_footprint_tb = data_size_tb * (1 + provenance_metadata_overhead) * verification_passes
amplification = total_footprint_tb / data_size_tb
# ┌── 3. GUARD ─────────────────────────────────────────
check(amplification == 4.2, f"Amp {amplification} unexpected")
# ┌── 4. OUTPUT ────────────────────────────────────────
amp_str = f"{amplification:.1f}"
total_tb_str = f"{total_footprint_tb:.1f}"
@classmethod
def plot(cls):
"""Visualizes the Synthetic Tax."""
from mlsysim import viz
return viz.bar_compare(
labels=["Human (Raw)", "Synthetic (Verified)"],
values=[cls.data_size_tb, cls.total_footprint_tb],
title="Data Wall Storage Amplification",
ylabel="Dataset Footprint (TB)"
)
```
::: {.callout-notebook title="Napkin Math: The Synthetic Tax"}
**Problem**: Calculate the storage amplification of a 1 TB synthetic dataset that requires cryptographic lineage and multi-model verification.
1. **Raw Payload**: 1 TB.
2. **Provenance Overhead**: `{python} int(SyntheticStorage.provenance_metadata_overhead * 100)`% extra for lineage hashes, generation logs, and reward-model scores.
3. **Verification Factor**: To avoid "Self-Poisoning," each sample is verified by **`{python} SyntheticStorage.verification_passes` independent "Judge" models**.
4. **The Amplification**: Total footprint = 1 TB $\times$ `{python} 1 + SyntheticStorage.provenance_metadata_overhead` $\times$ `{python} SyntheticStorage.verification_passes` = **`{python} SyntheticStorage.total_tb_str` TB**.
**The Systems Insight**: Synthetic data is **Verified Data**, and verification has a **`{python} SyntheticStorage.amp_str`$\times$ Storage Tax**. In the Machine Learning Fleet, storage moves from being a simple bit-bucket to a **Provenance Engine**. If you cannot store the *why* and *who* behind a synthetic token, you risk poisoning the future of the fleet with its own past mistakes.
:::
## Summary {#sec-storage-summary}
Storage in ML systems is not a passive repository; it is an active, multi-tiered pipeline whose sole purpose is to keep accelerator HBM populated with data. The hierarchy spanning HBM, host DRAM, local NVMe, parallel file systems, object storage, and cold archive exists because no single technology can simultaneously deliver the bandwidth, capacity, and cost profile that large-scale training demands. A 300,000$\times$ bandwidth gap separates the fastest tier from the slowest, and each intermediate tier serves as a staging buffer that absorbs the mismatch between the rate at which accelerators consume data and the rate at which persistent storage can supply it. The data pipeline throughput equation, $BW_{\text{required}} = N_{GPUs} \times U_{target} \times S_{batch} / T_{iteration}$, provides the quantitative foundation for sizing every tier: miss the required bandwidth at any level and expensive accelerators idle; over-provision and capital is wasted on storage capacity that sits underutilized.

View File

@@ -221,6 +221,27 @@ A useful mental model frames these distributed strategies as *loop transformatio
If we view the training process as a massive loop over data and layers, distributed strategies are simply **Loop Transformations** applied by the cluster-level compiler:
```{mermaid}
%%| label: fig-loop-transforms
%%| fig-cap: "**Distributed Strategies as Loop Transformations**. Visualizing how DP, TP, and PP map to standard compiler optimizations applied at cluster scale."
flowchart LR
subgraph Logical[Logical Training Loop]
direction TB
L1[for epoch in epochs:] --> L2[for batch in data:]
L2 --> L3[for layer in layers:]
L3 --> L4[Compute(layer, batch)]
end
subgraph Physical[Physical Implementation]
direction TB
DP[Data Parallelism] -.->|Unroll Batch Loop| GPU_Batch
TP[Tensor Parallelism] -.->|Vectorize Layer Ops| GPU_Ops
PP[Pipeline Parallelism] -.->|Pipeline Layer Sequence| GPU_Stages
end
Logical ==> Physical
```
* **Data Parallelism = Parallel For Loop.** We unroll the outer loop (batch dimension) across devices. Each device runs the same code body on different data indices.
* **Tensor Parallelism = Vectorization (SIMD).** We split the inner loops (matrix multiplication) across devices. This is "Cluster-Scale SIMD," where NVLink acts as the vector register file.
* **Pipeline Parallelism = Instruction Pipelining.** We split the sequential operations (layers) across devices. Just as a CPU pipeline stages fetch/decode/execute, the cluster stages Layer 1/Layer 2/Layer 3 to keep all ALUs busy.
@@ -250,27 +271,8 @@ Although modern frameworks abstract away much of the complexity through sharded
The distributed training process itself involves splitting the dataset into non-overlapping subsets, assigning each subset to a different GPU, and performing forward and backward passes independently on each device. Once gradients are computed on each GPU, they are synchronized and aggregated before updating the model parameters, ensuring that all devices learn in a consistent manner. The coordinated flow of data splitting, computation, and gradient synchronization (@fig-distributed-training) forms the foundation of distributed training, with each GPU processing its batch independently before synchronization brings all gradients together.
::: {#fig-distributed-training fig-env="figure" fig-pos="htb" fig-cap="**Data Parallel Training Flow**. Distributed training partitions datasets across GPUs, computes gradients concurrently on each device's data subset, then aggregates gradients through AllReduce to update shared model parameters. Each GPU maintains an identical model copy and processes its portion of the batch independently, with synchronization occurring only during gradient aggregation. This approach achieves near-linear speedup when communication overhead remains below 30--40% of training time." fig-alt="Two parallel GPU workflows showing data parallel training. Each GPU processes a data chunk through forward pass, error computation, loss function, backward pass, then gradients merge at Calculate Global Gradients for parameter updates."}
```{.tikz}
\begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}\small]
\tikzset{%
mycylinder/.style={cylinder, shape border rotate=90, aspect=1.3, draw, fill=white,
minimum width=20mm,minimum height=9mm,line width=1pt},
mycycle/.style={circle, draw=none, fill=red, minimum width=5mm},
myline/.style={line width=1.15pt,draw=cyan},
%
Box/.style={align= flush center,
inner xsep=2pt,
draw=RedLine,
line width=0.75pt,
fill=RedL!20,
text width=22mm,
minimum width=22mm, minimum height=8mm
},
%
Line/.style={line width=1.0pt,black!50}
}
![](images/svg/_data-parallel-pipeline.svg){width=100%}
:::
\begin{scope}[node distance=-1.7,local bounding box = SC1]]
\node[mycylinder,fill=red!30] (A) {};
\scoped[on background layer]
@@ -695,6 +697,10 @@ The key trade-offs across synchronization models are summarized here.
:::
::: {#fig-sync-model-timeline fig-env="figure" fig-pos="htb" fig-cap="**Distributed Synchronization Models**. Timeline comparison of three synchronization strategies. (A) Bulk Synchronous Parallel (BSP) forces all workers to wait at a global barrier every step. (B) Stale Synchronous Parallel (SSP) allows workers to proceed up to $s$ steps ahead of the slowest worker. (C) Asynchronous SGD eliminates barriers entirely, allowing maximum throughput but introducing gradient staleness." fig-alt="Timeline diagram with three panels. Top: BSP with global barriers. Middle: SSP with bounded staleness slack. Bottom: Asynchronous with independent workers and overlapping steps."}
![](images/svg/sync-model-timeline.svg){width=100%}
:::
The choice of synchronization model directly affects both system throughput and model convergence. Production systems typically use BSP for final training runs to ensure reproducibility, while exploring SSP or async approaches during hyperparameter search where exact reproducibility is less critical.
#### Barrier Semantics and Failure Modes {#sec-distributed-training-systems-systems-barrier-semantics-failure-modes-5c94}
@@ -1015,57 +1021,8 @@ $$ M_{\text{ZeRO3}} = \frac{112 \text{ GB}}{64} \approx \mathbf{1.75 \text{ GB}}
ZeRO addresses this redundancy through progressive sharding:
::: {.callout-note title="Figure: ZeRO Memory Partitioning"}
![](images/svg/zero-memory-partitioning.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, xscale=1.5]
\definecolor{ParamColor}{RGB}{200,220,255}
\definecolor{GradColor}{RGB}{255,220,200}
\definecolor{OptColor}{RGB}{220,255,200}
\usetikzlibrary{positioning}
\tikzset{
bar/.style={draw=black!70, thick, minimum width=1.2cm, minimum height=0.5cm, font=\scriptsize, text=black!80, align=center},
label/.style={font=\scriptsize, text=black!80},
desc/.style={font=\tiny, text=gray, anchor=north west},
}
% DDP (Replicated)
\node[label=DDP] (ddp) {};
\node[bar, fill=ParamColor, anchor=north] (ddp-p) at (ddp.south) {P};
\node[bar, fill=GradColor, anchor=north] (ddp-g) at (ddp-p.south) {G};
\node[bar, fill=OptColor, anchor=north, minimum height=2cm] (ddp-os) at (ddp-g.south) {OS};
\node[below, font=\tiny] at (ddp-os.south) {Replicated};
% ZeRO-1
\node[label=ZeRO-1, right=of ddp] (zero1) {};
\node[bar, fill=ParamColor, anchor=north] (zero1-p) at (zero1.south) {P};
\node[bar, fill=GradColor, anchor=north] (zero1-g) at (zero1-p.south) {G};
\node[bar, fill=OptColor, anchor=north, minimum height=0.25cm] (zero1-os) at (zero1-g.south) {OS/N};
\node[below, font=\tiny] at (zero1-os.south) {Shard OS};
% ZeRO-2
\node[label=ZeRO-2, right=of zero1] (zero2) {};
\node[bar, fill=ParamColor, anchor=north] (zero2-p) at (zero2.south) {P};
\node[bar, fill=GradColor, anchor=north, minimum height=0.1cm] (zero2-g) at (zero2-p.south) {G/N};
\node[bar, fill=OptColor, anchor=north, minimum height=0.25cm] (zero2-os) at (zero2-g.south) {OS/N};
\node[below, font=\tiny] at (zero2-os.south) {+Shard G};
% ZeRO-3
\node[label=ZeRO-3, right=of zero2] (zero3) {};
\node[bar, fill=ParamColor, anchor=north, minimum height=0.1cm] (zero3-p) at (zero3.south) {};
\node[bar, fill=GradColor, anchor=north, minimum height=0.1cm] (zero3-g) at (zero3-p.south) {};
\node[bar, fill=OptColor, anchor=north, minimum height=0.25cm] (zero3-os) at (zero3-g.south) {};
\node[anchor=west, font=\tiny] at (zero3-os.south west) {All/N};
\node[below, font=\tiny] at (zero3-os.south) {+Shard P};
% Descriptions
\node[desc] (desc1) at (4, 3) {P: Parameters};
\node[desc, below=3pt of desc1] {G: Gradients};
\node[desc, below=3pt of desc1.south] {OS: Optimizer States};
\end{tikzpicture}
```
**ZeRO Memory Reduction**. Standard Data Parallelism (DDP) replicates all model states across every GPU. ZeRO progressively partitions these states: ZeRO-1 shards optimizer states, ZeRO-2 adds gradient sharding, and ZeRO-3 shards the parameters themselves. ZeRO-3 achieves linear memory scaling, enabling models with 100B+ parameters to fit on commodity hardware.
:::
@@ -1379,54 +1336,7 @@ The scaling law regime exhibits three distinct behaviors:
@fig-critical-batch-size illustrates this relationship between batch size and training efficiency.
::: {#fig-critical-batch-size fig-env="figure" fig-pos="htb" fig-cap="**Critical Batch Size and Scaling Regimes**. Below the critical batch size $B*$, larger batches reduce noise and improve sample efficiency (linear scaling regime). Above $B*$, larger batches provide diminishing returns: while throughput increases, total samples required also increases, reducing sample efficiency. The optimal operating point balances hardware utilization against convergence efficiency." fig-alt="Graph showing sample efficiency versus batch size. Efficiency is flat in the linear regime below B-star, then decreases in the diminishing returns regime above B-star. Vertical dashed line marks critical batch size."}
```{.tikz}
\begin{tikzpicture}[
font=\small\usefont{T1}{phv}{m}{n},
node distance=1cm and 1cm,
BlueLine/.style={blue, ultra thick},
RedLine/.style={red, dashed, thick},
Annotation/.style={font=\footnotesize},
every node/.style={anchor=center},
]
\begin{axis}[
width=12cm,
height=7cm,
xlabel={Batch Size $B$ (log scale)},
ylabel={Sample Efficiency (samples to target loss)},
xmode=log,
ymode=log,
xmin=100, xmax=1000000,
ymin=0.1, ymax=2,
ytick={0.2, 0.5, 1.0, 2.0},
yticklabels={0.2$\times$, 0.5$\times$, 1.0$\times$, 2.0$\times$},
grid=major,
legend pos=north east,
legend style={font=\footnotesize},
]
% Linear scaling regime (flat)
\addplot[BlueLine, domain=100:8000] {1.0};
% Transition region
\addplot[BlueLine, domain=8000:16000, samples=50] {1.0 * (1 + 0.5*((x-8000)/8000)^2)};
% Diminishing returns regime
\addplot[BlueLine, domain=16000:1000000, samples=50] {1.5 * (x/16000)^0.3};
% Critical batch size line
\addplot[RedLine] coordinates {(12000, 0.1) (12000, 2)};
% Annotations
\node[Annotation, anchor=south] (linear) at (axis cs:1000, 1.1) {Linear Scaling};
\node[Annotation, anchor=south] (diminishing) at (axis cs:100000, 1.8) {Diminishing Returns};
\node[Annotation, anchor=west, RedLine] (critical) at (axis cs:14000, 0.15) {$B*$};
\end{axis}
\end{tikzpicture}
```
![](images/svg/_critical-batch-size.svg){width=100%}
:::
The critical batch size has important implications for distributed training system design:
@@ -1579,58 +1489,7 @@ Despite higher parallelism, 64-GPU training costs more per run due to communicat
The fundamental trade-off in distributed training is between communication efficiency and convergence quality. @fig-comm-convergence-tradeoff visualizes this trade-off space.
::: {#fig-comm-convergence-tradeoff fig-env="figure" fig-pos="htb" fig-cap="**Communication-Convergence Trade-off Space**. Each point represents a different distributed training configuration. The Pareto frontier (dashed line) shows optimal configurations where improving one metric requires sacrificing the other. BSP sits at high convergence quality but lower throughput; ASP provides maximum throughput at convergence cost. Gradient compression and SSP occupy intermediate positions." fig-alt="Scatter plot with Communication Efficiency on x-axis and Convergence Quality on y-axis. Points for BSP, SSP, ASP, and gradient compression methods form a Pareto frontier from upper-left to lower-right."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
point/.style={only marks, mark=*, mark size=4pt},
label/.style={anchor=west, font=\footnotesize},
grid/.style={grid=major},
line/.style={solid, line width=1.0pt},
pareto/.style={black, dashed, thick},
labelText/.style={font=\footnotesize},
nodeStyle/.style={anchor=west, font=\footnotesize},
pointStyle/.style={only marks, mark=*, mark size=4pt},
}
% Define colors consistently
\colorlet{BlueL}{blue}
\colorlet{OrangeL}{orange}
\colorlet{RedL}{red}
\colorlet{GreenL}{green!60!black}
\colorlet{PurpleL}{purple}
\begin{axis}[
width=11cm,
height=8cm,
xlabel={Communication Efficiency (throughput)},
ylabel={Convergence Quality (final loss)},
xmin=0, xmax=100,
ymin=0, ymax=100,
xtick={0, 25, 50, 75, 100},
xticklabels={Low, , Medium, , High},
ytick={0, 25, 50, 75, 100},
yticklabels={Poor, , Medium, , Optimal},
grid=major,
]
% Pareto frontier
\addplot[pareto, domain=15:95, samples=50] {100 - 0.8*(x-15) - 0.005*(x-15)^2};
% Methods as points
\node[pointStyle, BlueL, label=right:BSP] at (axis cs:20, 98) {};
\node[pointStyle, OrangeL, label=right:SSP ($s$=4)] at (axis cs:50, 85) {};
\node[pointStyle, RedL, label=left:ASP] at (axis cs:90, 65) {};
\node[pointStyle, GreenL, label=above:Gradient Compression] at (axis cs:60, 92) {};
\node[pointStyle, PurpleL, label=below:Local SGD] at (axis cs:75, 78) {};
% Annotation
\node[labelText, anchor=north east] at (axis cs:95, 20) {Pareto Frontier};
\end{axis}
\end{tikzpicture}
```
![](images/svg/_comm-convergence-tradeoff.svg){width=100%}
:::
Several techniques occupy different positions on this trade-off curve:
@@ -1647,7 +1506,13 @@ The choice among these methods depends on the specific bottleneck. When network
When a model is so massive that even a single layer's weights exceed the memory capacity of a GPU, data parallelism entirely collapses. The memory optimization techniques examined in the previous section extend data parallelism's reach, but eventually, we must partition the model itself.
Even with ZeRO-3 fully deployed, sharding optimizer states, gradients, and parameters across workers, some architectures remain intractable. A `{python} FrontierTrainingContext.frontier_params_b_str`B parameter model using FSDP across 64 GPUs still requires 700 GB / 64 = 11 GB of parameters per GPU before accounting for activations. For long-context transformers where activation memory dominates, a 2048-token sequence through `{python} FrontierTrainingContext.frontier_params_b_str`B parameters generates 200+ GB of intermediate activations, and no amount of optimizer sharding addresses this constraint. Model parallelism addresses these limitations by splitting the model architecture itself across devices, rather than replicating it with sharded state.
Even with ZeRO-3 fully deployed, sharding optimizer states, gradients, and parameters across workers, some architectures remain intractable. A `{python} FrontierTrainingContext.frontier_params_b_str`B parameter model using FSDP across 64 GPUs still requires 700 GB / 64 = 11 GB of parameters per GPU before accounting for activations.
::: {#fig-model-parallel-flow fig-env="figure" fig-pos="htb" fig-cap="**Model Parallelism Data Flow**. High-level comparison of the two primary model partitioning strategies. (Top) Pipeline Parallelism partitions layers vertically across GPUs, passing activations sequentially. (Bottom) Tensor Parallelism partitions individual operations horizontally within layers, requiring frequent synchronization. These strategies are often combined in hybrid 3D parallelism to handle the largest frontier models." fig-alt="Two-panel comparison. Top: Vertical partitions labeled Stage 1 through N. Bottom: Horizontal partitioning of an operation into two GPU workers."}
![](images/svg/_model-parallel-flow.svg){width=100%}
:::
For long-context transformers where activation memory dominates, a 2048-token sequence through `{python} FrontierTrainingContext.frontier_params_b_str`B parameters generates 200+ GB of intermediate activations, and no amount of optimizer sharding addresses this constraint. Model parallelism addresses these limitations by splitting the model architecture itself across devices, rather than replicating it with sharded state.
```{python}
#| label: a100-capacity-context
@@ -1832,75 +1697,7 @@ Pipeline parallelism extends layer-wise partitioning by introducing microbatchin
As @fig-pipline-parallelism shows, each device, as represented by the rows in the drawing, processes its assigned model layers for different microbatches simultaneously. The forward pass involves devices passing activations to the next stage, such as $F_{0,0}$ to $F_{1,0}$. The backward pass transfers gradients back through the pipeline, such as $B_{3,3}$ to $B_{2,3}$. This overlapping computation reduces idle time and increases throughput while maintaining the logical sequence of operations across devices.
::: {#fig-pipline-parallelism fig-cap="**Pipeline Parallelism Schedule**. A 4-stage pipeline processing 4 microbatches, showing forward passes ($F_{i,j}$) and backward passes ($B_{i,j}$) across time. Rows represent pipeline stages (GPUs), columns represent time steps. The staggered execution keeps all devices active: while stage 0 computes $F_{0,1}$, stage 1 processes $F_{1,0}$ from the previous microbatch. After all forward passes complete, backward passes propagate in reverse order. The \"Update\" column shows synchronized parameter updates after gradient accumulation across all microbatches." fig-alt="Pipeline schedule grid showing 4 stages processing 4 microbatches. Forward passes F stagger diagonally across time, backward passes B follow in reverse order, ending with synchronized Update column."}
```{.tikz}
\begin{tikzpicture}[
every node/.style={font=\sffamily, draw, minimum width=1cm, minimum height=0.7cm, align=center, outer sep=0},
fill0/.style={fill=red!20}, % Complementary to lightgray
fill1/.style={fill=blue!20}, % Complementary to orange
fill2/.style={fill=orange!20}, % Complementary to blue
fill3/.style={fill=yellow!20}, % Complementary to purple
back3/.style={fill=yellow!20} % Same as fill3
]
% Row 0
\node[fill0] (F0_0) {$F_{0,0}$};
\node[fill0, right=0cm of F0_0] (F0_1) {$F_{0,1}$};
\node[fill0, right=0cm of F0_1] (F0_2) {$F_{0,2}$};
\node[fill0, right=0cm of F0_2] (F0_3) {$F_{0,3}$};
% Row 1
\node[fill1, above right=0cm and 0cm of F0_0] (F1_0) {$F_{1,0}$};
\node[fill1, right=0cm of F1_0] (F1_1) {$F_{1,1}$};
\node[fill1, right=0cm of F1_1] (F1_2) {$F_{1,2}$};
\node[fill1, right=0cm of F1_2] (F1_3) {$F_{1,3}$};
% Row 2 (stacked above F1)
\node[fill2, above right=0cm and 0cm of F1_0] (F2_0) {$F_{2,0}$};
\node[fill2, right=0cm of F2_0] (F2_1) {$F_{2,1}$};
\node[fill2, right=0cm of F2_1] (F2_2) {$F_{2,2}$};
\node[fill2, right=0cm of F2_2] (F2_3) {$F_{2,3}$};
% Row 3 (stacked above F2)
\node[fill3, above right=0cm and 0cm of F2_0] (F3_0) {$F_{3,0}$};
\node[fill3, right=0cm of F3_0] (F3_1) {$F_{3,1}$};
\node[fill3, right=0cm of F3_1] (F3_2) {$F_{3,2}$};
\node[fill3, right=0cm of F3_2] (F3_3) {$F_{3,3}$};
% Row 3 (backward pass)
\node[back3, right=0cm of F3_3] (B3_3) {$B_{3,3}$};
\node[back3, right=0cm of B3_3] (B3_2) {$B_{3,2}$};
\node[back3, right=0cm of B3_2] (B3_1) {$B_{3,1}$};
\node[back3, right=0cm of B3_1] (B3_0) {$B_{3,0}$};
% Row 2 (backward pass)
\node[fill2, below=0cm and 0cm of B3_2] (B2_3) {$B_{2,3}$};
\node[fill2, right=0cm of B2_3] (B2_2) {$B_{2,2}$};
\node[fill2, right=0cm of B2_2] (B2_1) {$B_{2,1}$};
\node[fill2, right=0cm of B2_1] (B2_0) {$B_{2,0}$};
% Row 1 (backward pass)
\node[fill1, below=0cm of B2_2] (B1_3) {$B_{1,3}$};
\node[fill1, right=0cm of B1_3] (B1_2) {$B_{1,2}$};
\node[fill1, right=0cm of B1_2] (B1_1) {$B_{1,1}$};
\node[fill1, right=0cm of B1_1] (B1_0) {$B_{1,0}$};
% Row 0 (backward pass)
\node[fill0, below=0cm of B1_2] (B0_3) {$B_{0,3}$};
\node[fill0, right=0cm of B0_3] (B0_2) {$B_{0,2}$};
\node[fill0, right=0cm of B0_2] (B0_1) {$B_{0,1}$};
\node[fill0, right=0cm of B0_1] (B0_0) {$B_{0,0}$};
% Update nodes
\node[fill0, right=0cm of B0_0] (U0_0) {Update};
\node[fill1, above=0cm of U0_0] (U0_1) {Update};
\node[fill2, above=0cm of U0_1] (U0_2) {Update};
\node[fill3, above=0cm of U0_2] (U0_3) {Update};
%\node[draw=none, minimum width=4cm, minimum height=1cm, align=center, right=1cm of F0_3] (Bubble) {Bubble};
\end{tikzpicture}
```
![](images/svg/_pipeline-parallelism.svg){width=100%}
:::
::: {.callout-definition title="Pipeline Parallelism"}
@@ -2126,6 +1923,10 @@ In a distributed setting, experts are partitioned across workers. If we have 8 G
3. **Computation**: Experts process their assigned tokens.
4. **All-to-All Combine**: Processed tokens are routed back to their original device to resume the sequence.
::: {#fig-moe-all-to-all-routing fig-env="figure" fig-pos="htb" fig-cap="**Mixture of Experts (MoE) All-to-All Routing**. Expert parallelism requires a unique 'All-to-All' communication pattern. Tokens from a local batch are routed by a gating network to experts sharded across the cluster. This involves a global shuffle where every GPU sends tokens to and receives tokens from potentially every other GPU, stressing the cluster's bisection bandwidth." fig-alt="Diagram showing tokens being routed from a gating network to four different experts sharded across multiple GPUs, then returning to the original sequence order."}
![](images/svg/moe-all-to-all-routing.svg){width=100%}
:::
The primary advantage is decoupling model size from compute budget. A trillion-parameter MoE model might use only 10B parameters per token, enabling training on feasible hardware budgets. The constraint is the **All-to-All** communication, which is bandwidth-intensive and sensitive to load imbalance.
At the heart of expert parallelism lies the All-to-All communication primitive, which shuffles tokens across the cluster based on dynamic routing decisions. Consider a configuration with $E=64$ experts distributed across 64 GPUs, processing a batch of $B=4$ sequences at length $S=2048$ with hidden dimension $H=4096$. For every MoE layer, the system must dispatch $B \times S$ tokens to their assigned experts. In FP16, this moves $B \cdot S \cdot H \cdot 2$ bytes—approximately 67 MB—in a single direction. Since the processed embeddings must return to their original device for the residual connection, the total network overhead is roughly 134 MB per transformer block. While manageable in isolation, this latency accumulates rapidly in deep, sparse architectures like the Switch Transformer [@fedus2022switch] (up to 2,048 experts) or GShard [@lepikhin2021gshard].
@@ -2558,76 +2359,7 @@ To systematically weigh the architectural and hardware trade-offs for a new 50-b
@fig-parallelism-flowchart provides a decision tree for selecting parallelism strategies based on model size, dataset size, and scaling constraints. While intentionally simplified, real-world scenarios often involve additional complexities such as hardware heterogeneity, communication bandwidth, and workload imbalance that may influence the choice of parallelism techniques. Practitioners should view this as a foundational tool for understanding trade-offs and decision points, then adapt it to the specific requirements and constraints of their systems.
::: {#fig-parallelism-flowchart fig-env="figure" fig-pos="htb" fig-cap="**Parallelism Strategy Decision Tree**. A systematic selection guide based on two key questions: Does the model fit in single-device memory? Does the dataset fit on a single device? Models exceeding device memory require model parallelism; large datasets benefit from data parallelism; significant constraints in both dimensions demand hybrid approaches. While simplified, this framework captures the primary decision points before practitioners must consider secondary factors like hardware heterogeneity and workload imbalance." fig-alt="Decision tree flowchart starting from Start. Diamond nodes ask about model and dataset fit on single device. Paths lead to four outcomes: Single Device Optimization, Data Parallelism, Model Parallelism, or Hybrid Parallelism."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{Line/.style={line width=1.0pt,black!50,text=black
},
Box/.style={inner xsep=2pt,
node distance=11mm,
draw=GreenLine, line width=0.75pt,
fill=GreenL,
text width=27mm,align=flush center,
minimum width=27mm, minimum height=9mm
},
Box1/.style={Box,
draw=RedLine, fill=RedL,
text width=31mm,
minimum width=32mm,
minimum height=10mm
},
Text/.style={inner xsep=2pt,
draw=none, line width=0.75pt,
fill=TextColor,
font=\footnotesize\usefont{T1}{phv}{m}{n},
align=flush center,
minimum width=7mm,
minimum height=5mm
},
decision/.style = {align=flush center,text width=42mm,diamond, aspect=2.2, node distance=6mm,
inner xsep=-3pt, inner ysep=-2.95ex,fill=VioletL2, draw=VioletLine},
}
\node[Box](B1){Hybrid\\ Parallelism};
\node[Box,node distance=16mm,right=of B1](B2){Model\\Parallelism};
\node[Box,node distance=16 mm,right=of B2](B3){Data\\ Parallelism};
\node[Box,right=of B3,fill=RedL, draw=RedLine](B4){Single Device Optimization};
%
\scoped[on background layer]
\node[draw=BackLine,inner xsep=5mm,inner ysep=5mm,
yshift=-1mm,
fill=BackColor,fit=(B1)(B3),line width=0.75pt](BB){};
\node[decision,node distance=18mm,
above=of B4](G1B4){Is\\ the dataset\\ very large?};
\node[Box1,node distance=15mm,
above=of $(B2.north)!0.5!(B3.north)$](G1B3){Is scaling the model\\ or data more critical?};
\node[decision,above=of G1B3](G2B3){Are\\ both constraints\\ significant?};
\node[decision,above=of G2B3](G3B3){Does\\ the dataset fit in a\\ single device?};
\node[decision,above=of G3B3](G4B3){Does\\ the model fit in a\\ single device?};
\node[Box,node distance=5mm,above=of G4B3,fill=BlueL, draw=BlueLine](G5B3){Start};
%
\node[Box,below=1 of B2,fill=BlueL, draw=BlueLine](DB2){End};
%
\draw[Line,-latex](G5B3)--(G4B3);
\draw[Line,-latex](G4B3)--node[right,pos=0.35]{No}(G3B3);
\draw[Line,-latex](G4B3)-|node[above,pos=0.05]{Yes}(G1B4);
\draw[Line,-latex](G3B3)--node[right,pos=0.35]{No}(G2B3);
\draw[Line,-latex](G2B3)--node[right,pos=0.35]{No}(G1B3);
\draw[Line,-latex](G1B4)--node[right,pos=0.15]{No}(B4);
%
\draw[Line,-latex](G3B3.west)--node[above,pos=0.25]{Yes}++(180:2.3)|-(B2.west);
\draw[Line,-latex](G2B3)-|node[above,pos=0.05]{Yes}(B1);
\draw[Line,-latex](G1B3.south)--node[left,align=center,pos=0.45]{Scaling Model}++(270:8mm)-|(B2);
\draw[Line,-latex](G1B3.south)--++(270:8mm)-|(B3);
\draw[Line,-latex](G1B4)-|node[above,pos=0.22,text=black]{Yes}(B3.40);
%
\draw[Line,-latex](B1)|-(DB2);
\draw[Line,-latex](B3)|-(DB2);
\draw[Line,-latex](B2)--(DB2);
\node[above=2pt of BB.204,inner sep=0pt,anchor=south,fill=BackColor]{Parallelism Opportunities};
\end{tikzpicture}
```
![](images/svg/_parallelism-flowchart.svg){width=100%}
:::
## From Principles to Systems {#sec-distributed-training-systems-systems-framework-integration-cf71}
@@ -2775,64 +2507,7 @@ We explored the **3D Parallelism Cube** (@fig-3d-parallelism-cube-summary), the
Ultimately, the choice of parallelism is a **loop transformation** applied by the cluster-level compiler. By matching logical communication patterns to physical hardware hierarchies, we move from the "linear scaling regime" of small clusters to the "communication-bound" reality of the exascale supercomputer. @fig-3d-parallelism-sliced visualizes how a single Transformer layer is partitioned across the fleet.
::: {#fig-3d-parallelism-sliced fig-env="figure" fig-pos="htb" fig-cap="**Anatomy of a Hybrid 3D Parallel Step**. Visualization of how a single Transformer layer is partitioned across the fleet. **Pipeline Parallelism ($p$)** slices the model vertically by depth (assigning layers to different stages). **Tensor Parallelism ($t$)** slices the weight matrices horizontally within each layer (assigning shards to different GPUs). **Data Parallelism ($d$)** replicates this sliced block $d$ times to process independent data samples. A single training step coordinates $d \times p \times t$ GPUs, where $t$ synchronizes at kilohertz frequency, $p$ at megahertz, and $d$ at hertz." fig-alt="Volumetric diagram showing a model block sliced by three orthogonal planes: depth (green), width (orange), and replicas (blue). Labels show p, t, and d dimensions."}
```{.tikz}
\begin{tikzpicture}[scale=1.5, font=\small\usefont{T1}{phv}{m}{n}, line join=round]
% Define colors
\definecolor{BlueL}{RGB}{0,99,149}
\definecolor{OrangeL}{RGB}{204,85,0}
\definecolor{GreenL}{RGB}{0,143,69}
% Set node styles
\tikzset{
face/.style={fill=white, opacity=0.8, draw=black!60, thick},
label/.style={font=\bfseries, align=center},
gpu/.style={face, minimum width=1cm, minimum height=1cm, fill=#1!10},
brace/.style={decorate, decoration={brace, amplitude=5pt}, thick},
brace_mirror/.style={decorate, decoration={brace, mirror, amplitude=10pt}, thick}
}
% Explicit node placement: d=replica, p=pipeline stage, t=tensor shard
% Replica 0, stage 0: anchor at (0,0)
\node[gpu=BlueL] (data000) at (0, 0) {\tiny GPU};
\node[gpu=OrangeL] (tensor000) [right=1.2cm of data000] {};
\node[gpu=GreenL] (pipe000) [above=1.2cm of data000] {};
% Replica 0, stage 1: below stage 0
\node[gpu=BlueL] (data010) [below=1.2cm of data000] {\tiny GPU};
\node[gpu=OrangeL] (tensor010) [right=1.2cm of data010] {};
\node[gpu=GreenL] (pipe010) [above=1.2cm of data010] {};
% Replica 1, stage 0: to the right of replica 0 (offset by 2.5cm past tensor column)
\node[gpu=BlueL] (data100) [right=2.5cm of tensor000] {\tiny GPU};
\node[gpu=OrangeL] (tensor100) [right=1.2cm of data100] {};
\node[gpu=GreenL] (pipe100) [above=1.2cm of data100] {};
% Replica 1, stage 1: below replica 1 stage 0
\node[gpu=BlueL] (data110) [below=1.2cm of data100] {\tiny GPU};
\node[gpu=OrangeL] (tensor110) [right=1.2cm of data110] {};
\node[gpu=GreenL] (pipe110) [above=1.2cm of data110] {};
% Brackets for Replica 0
\draw[brace, GreenL]
([yshift=-0.1cm]data000.south west) --
([yshift=0.1cm]data010.north west)
node[midway, left=8pt, label] {$p$ stages};
\draw[brace, OrangeL]
([xshift=-0.1cm]data000.south west) --
([xshift=0.1cm]tensor000.south)
node[midway, below=8pt, label] {$t$ shards};
% Bracket for Data (across both replicas)
\draw[brace_mirror, BlueL]
([xshift=-0.1cm]data000.south west) --
([xshift=0.1cm]tensor100.south east)
node[midway, below=12pt, label] {$d$ replicas};
\end{tikzpicture}
```
![](images/svg/3d-parallelism.svg){width=100%}
:::
::: {.callout-takeaways title="Parallelism Is a Loop Transformation"}

View File

@@ -1,126 +1,69 @@
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 680 460"
font-family="Helvetica Neue, Helvetica, Arial, sans-serif">
<?xml version="1.0" encoding="utf-8"?>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 680 460" font-family="Helvetica Neue, Helvetica, Arial, sans-serif">
<rect width="680" height="460" fill="#fff" rx="4"/>
<!-- ===== Title ===== -->
<text x="340" y="28" text-anchor="middle" font-size="13" font-weight="700" fill="#333">ZeRO Progressive Memory Partitioning</text>
<text x="340" y="46" text-anchor="middle" font-size="9" fill="#555">Per-GPU memory footprint as sharding progresses across ZeRO stages (N = 64 GPUs, 7B parameter model)</text>
<!-- ===== Legend ===== -->
<text x="340" y="28" text-anchor="middle" font-size="13" font-weight="700" fill="#333" style="white-space: pre;">ZeRO Progressive Memory Partitioning</text>
<text x="340" y="46" text-anchor="middle" font-size="9" fill="#555" style="white-space: pre;">Per-GPU memory footprint as sharding progresses across ZeRO stages (N = 64 GPUs, 7B parameter model)</text>
<rect x="120" y="58" width="12" height="10" fill="#cfe2f3" stroke="#4a90c4" stroke-width="1.2" rx="2"/>
<text x="136" y="67" font-size="9" fill="#555">Weights (FP16, 2B/param)</text>
<text x="136" y="67" font-size="9" fill="#555" style="white-space: pre;">Weights (FP16, 2B/param)</text>
<rect x="280" y="58" width="12" height="10" fill="#d4edda" stroke="#3d9e5a" stroke-width="1.2" rx="2"/>
<text x="296" y="67" font-size="9" fill="#555">Gradients (FP16, 2B/param)</text>
<text x="296" y="67" font-size="9" fill="#555" style="white-space: pre;">Gradients (FP16, 2B/param)</text>
<rect x="450" y="58" width="12" height="10" fill="#fdebd0" stroke="#c87b2a" stroke-width="1.2" rx="2"/>
<text x="466" y="67" font-size="9" fill="#555">Optimizer States (FP32, 12B/param)</text>
<!-- ===== Column headers ===== -->
<text x="125" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333">DDP</text>
<text x="125" y="113" text-anchor="middle" font-size="8.5" fill="#555">(Baseline)</text>
<text x="280" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333">ZeRO-1</text>
<text x="280" y="113" text-anchor="middle" font-size="8.5" fill="#555">Shard Optimizer States</text>
<text x="435" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333">ZeRO-2</text>
<text x="435" y="113" text-anchor="middle" font-size="8.5" fill="#555">+ Shard Gradients</text>
<text x="590" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333">ZeRO-3</text>
<text x="590" y="113" text-anchor="middle" font-size="8.5" fill="#555">+ Shard Parameters</text>
<!-- ===== Y-axis label ===== -->
<text x="28" y="290" text-anchor="middle" font-size="9" fill="#555" transform="rotate(-90,28,290)">Memory per GPU (GB)</text>
<!-- ===== Y-axis grid lines ===== -->
<!-- Scale: 0..115 GB maps to y=400..120. That's 280 px for 115 GB => 2.43 px/GB -->
<!-- Ticks at 0, 20, 40, 60, 80, 100, 112 -->
<text x="466" y="67" font-size="9" fill="#555" style="white-space: pre;">Optimizer States (FP32, 12B/param)</text>
<text x="125" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333" style="white-space: pre;">DDP</text>
<text x="125" y="113" text-anchor="middle" font-size="8.5" fill="#555" style="white-space: pre;">(Baseline)</text>
<text x="280" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333" style="white-space: pre;">ZeRO-1</text>
<text x="280" y="113" text-anchor="middle" font-size="8.5" fill="#555" style="white-space: pre;">Shard Optimizer States</text>
<text x="435" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333" style="white-space: pre;">ZeRO-2</text>
<text x="435" y="113" text-anchor="middle" font-size="8.5" fill="#555" style="white-space: pre;">+ Shard Gradients</text>
<text x="590" y="100" text-anchor="middle" font-size="10" font-weight="700" fill="#333" style="white-space: pre;">ZeRO-3</text>
<text x="590" y="113" text-anchor="middle" font-size="8.5" fill="#555" style="white-space: pre;">+ Shard Parameters</text>
<text x="28" y="290" text-anchor="middle" font-size="9" fill="#555" transform="rotate(-90,28,290)" style="white-space: pre;">Memory per GPU (GB)</text>
<line x1="50" y1="400" x2="650" y2="400" stroke="#e0e0e0" stroke-width="0.8"/>
<text x="46" y="403" text-anchor="end" font-size="8.5" fill="#999">0</text>
<text x="46" y="403" text-anchor="end" font-size="8.5" fill="#999" style="white-space: pre;">0</text>
<line x1="50" y1="351" x2="650" y2="351" stroke="#e0e0e0" stroke-width="0.8" stroke-dasharray="3,3"/>
<text x="46" y="354" text-anchor="end" font-size="8.5" fill="#999">20</text>
<text x="46" y="354" text-anchor="end" font-size="8.5" fill="#999" style="white-space: pre;">20</text>
<line x1="50" y1="303" x2="650" y2="303" stroke="#e0e0e0" stroke-width="0.8" stroke-dasharray="3,3"/>
<text x="46" y="306" text-anchor="end" font-size="8.5" fill="#999">40</text>
<text x="46" y="306" text-anchor="end" font-size="8.5" fill="#999" style="white-space: pre;">40</text>
<line x1="50" y1="254" x2="650" y2="254" stroke="#e0e0e0" stroke-width="0.8" stroke-dasharray="3,3"/>
<text x="46" y="257" text-anchor="end" font-size="8.5" fill="#999">60</text>
<text x="46" y="257" text-anchor="end" font-size="8.5" fill="#999" style="white-space: pre;">60</text>
<line x1="50" y1="206" x2="650" y2="206" stroke="#e0e0e0" stroke-width="0.8" stroke-dasharray="3,3"/>
<text x="46" y="209" text-anchor="end" font-size="8.5" fill="#999">80</text>
<text x="46" y="209" text-anchor="end" font-size="8.5" fill="#999" style="white-space: pre;">80</text>
<line x1="50" y1="157" x2="650" y2="157" stroke="#e0e0e0" stroke-width="0.8" stroke-dasharray="3,3"/>
<text x="46" y="160" text-anchor="end" font-size="8.5" fill="#999">100</text>
<!-- ===== DDP Column (col x=85..165, bar width=80) =====
Weights: 14 GB = 34 px
Gradients: 14 GB = 34 px
Optimizer: 84 GB = 204 px
Total: 112 GB = 272 px
Bottom y=400, top y=128
-->
<!-- Optimizer states (bottom) -->
<text x="46" y="160" text-anchor="end" font-size="8.5" fill="#999" style="white-space: pre;">100</text>
<rect x="85" y="196" width="80" height="204" fill="#fdebd0" stroke="#c87b2a" stroke-width="1.5" rx="2"/>
<text x="125" y="310" text-anchor="middle" font-size="9" font-weight="700" fill="#c87b2a">OS</text>
<text x="125" y="322" text-anchor="middle" font-size="8.5" fill="#c87b2a">84 GB</text>
<!-- Gradients (middle) -->
<text x="125" y="310" text-anchor="middle" font-size="9" font-weight="700" fill="#c87b2a" style="white-space: pre;">OS</text>
<text x="125" y="322" text-anchor="middle" font-size="8.5" fill="#c87b2a" style="white-space: pre;">84 GB</text>
<rect x="85" y="162" width="80" height="34" fill="#d4edda" stroke="#3d9e5a" stroke-width="1.5" rx="2"/>
<text x="125" y="183" text-anchor="middle" font-size="8.5" font-weight="700" fill="#3d9e5a">G 14 GB</text>
<!-- Weights (top) -->
<text x="125" y="183" text-anchor="middle" font-size="8.5" font-weight="700" fill="#3d9e5a" style="white-space: pre;">G 14 GB</text>
<rect x="85" y="128" width="80" height="34" fill="#cfe2f3" stroke="#4a90c4" stroke-width="1.5" rx="2"/>
<text x="125" y="149" text-anchor="middle" font-size="8.5" font-weight="700" fill="#4a90c4">W 14 GB</text>
<!-- Total label -->
<text x="125" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333">112 GB</text>
<text x="125" y="431" text-anchor="middle" font-size="8.5" fill="#555">1.0× baseline</text>
<!-- ===== ZeRO-1 Column (col x=240..320) =====
Weights: 14 GB = 34 px (full, replicated)
Gradients: 14 GB = 34 px (full, replicated)
Optimizer: 84/64 ≈ 1.3 GB ≈ 4 px → show as thin bar
Total: ~29 GB = 70 px
-->
<!-- Optimizer states (sharded, thin) -->
<text x="125" y="149" text-anchor="middle" font-size="8.5" font-weight="700" fill="#4a90c4" style="white-space: pre;">W 14 GB</text>
<text x="125" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333" style="white-space: pre;">112 GB</text>
<text x="125" y="431" text-anchor="middle" font-size="8.5" fill="#555" style="white-space: pre;">1.0× baseline</text>
<rect x="240" y="330" width="80" height="70" fill="#fdebd0" stroke="#c87b2a" stroke-width="1.5" rx="2"/>
<text x="280" y="364" text-anchor="middle" font-size="9" font-weight="700" fill="#c87b2a">OS/N</text>
<text x="280" y="376" text-anchor="middle" font-size="8.5" fill="#c87b2a">≈1.3 GB</text>
<!-- Gradients (full) -->
<text x="280" y="364" text-anchor="middle" font-size="9" font-weight="700" fill="#c87b2a" style="white-space: pre;">OS/N</text>
<text x="280" y="376" text-anchor="middle" font-size="8.5" fill="#c87b2a" style="white-space: pre;">≈1.3 GB</text>
<rect x="240" y="296" width="80" height="34" fill="#d4edda" stroke="#3d9e5a" stroke-width="1.5" rx="2"/>
<text x="280" y="317" text-anchor="middle" font-size="8.5" font-weight="700" fill="#3d9e5a">G 14 GB</text>
<!-- Weights (full) -->
<text x="280" y="317" text-anchor="middle" font-size="8.5" font-weight="700" fill="#3d9e5a" style="white-space: pre;">G 14 GB</text>
<rect x="240" y="262" width="80" height="34" fill="#cfe2f3" stroke="#4a90c4" stroke-width="1.5" rx="2"/>
<text x="280" y="283" text-anchor="middle" font-size="8.5" font-weight="700" fill="#4a90c4">W 14 GB</text>
<!-- Total label -->
<text x="280" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333">~29 GB</text>
<text x="280" y="431" text-anchor="middle" font-size="8.5" fill="#3d9e5a">~4× reduction</text>
<!-- ===== ZeRO-2 Column (col x=395..475) =====
Weights: 14 GB = 34 px (full, replicated)
Gradients: 14/64 ≈ 0.2 GB → thin
Optimizer: 84/64 ≈ 1.3 GB → thin
Total: ~15.5 GB = 38 px
-->
<!-- Optimizer states (sharded) -->
<text x="280" y="283" text-anchor="middle" font-size="8.5" font-weight="700" fill="#4a90c4" style="white-space: pre;">W 14 GB</text>
<text x="280" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333" style="white-space: pre;">~29 GB</text>
<text x="280" y="431" text-anchor="middle" font-size="8.5" fill="#3d9e5a" style="white-space: pre;">~4× reduction</text>
<rect x="395" y="349" width="80" height="51" fill="#fdebd0" stroke="#c87b2a" stroke-width="1.5" rx="2"/>
<text x="435" y="375" text-anchor="middle" font-size="9" font-weight="700" fill="#c87b2a">OS/N</text>
<!-- Gradients (sharded) -->
<text x="435" y="375" text-anchor="middle" font-size="9" font-weight="700" fill="#c87b2a" style="white-space: pre;">OS/N</text>
<rect x="395" y="325" width="80" height="24" fill="#d4edda" stroke="#3d9e5a" stroke-width="1.5" rx="2"/>
<text x="435" y="341" text-anchor="middle" font-size="8.5" font-weight="700" fill="#3d9e5a">G/N</text>
<!-- Weights (full) -->
<text x="435" y="341" text-anchor="middle" font-size="8.5" font-weight="700" fill="#3d9e5a" style="white-space: pre;">G/N</text>
<rect x="395" y="291" width="80" height="34" fill="#cfe2f3" stroke="#4a90c4" stroke-width="1.5" rx="2"/>
<text x="435" y="312" text-anchor="middle" font-size="8.5" font-weight="700" fill="#4a90c4">W 14 GB</text>
<!-- Total label -->
<text x="435" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333">~16 GB</text>
<text x="435" y="431" text-anchor="middle" font-size="8.5" fill="#3d9e5a">~7× reduction</text>
<!-- ===== ZeRO-3 Column (col x=550..630) =====
Everything sharded: W/N + G/N + OS/N ≈ 1.75 GB total → very thin bars
-->
<!-- Optimizer states (sharded) -->
<text x="435" y="312" text-anchor="middle" font-size="8.5" font-weight="700" fill="#4a90c4" style="white-space: pre;">W 14 GB</text>
<text x="435" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333" style="white-space: pre;">~16 GB</text>
<text x="435" y="431" text-anchor="middle" font-size="8.5" fill="#3d9e5a" style="white-space: pre;">~7× reduction</text>
<rect x="550" y="370" width="80" height="30" fill="#fdebd0" stroke="#c87b2a" stroke-width="1.5" rx="2"/>
<text x="590" y="389" text-anchor="middle" font-size="8.5" font-weight="700" fill="#c87b2a">OS/N</text>
<!-- Gradients (sharded) -->
<text x="590" y="389" text-anchor="middle" font-size="8.5" font-weight="700" fill="#c87b2a" style="white-space: pre;">OS/N</text>
<rect x="550" y="352" width="80" height="18" fill="#d4edda" stroke="#3d9e5a" stroke-width="1.5" rx="2"/>
<text x="590" y="365" text-anchor="middle" font-size="8" font-weight="700" fill="#3d9e5a">G/N</text>
<!-- Weights (sharded) -->
<text x="590" y="365" text-anchor="middle" font-size="8" font-weight="700" fill="#3d9e5a" style="white-space: pre;">G/N</text>
<rect x="550" y="334" width="80" height="18" fill="#cfe2f3" stroke="#4a90c4" stroke-width="1.5" rx="2"/>
<text x="590" y="347" text-anchor="middle" font-size="8" font-weight="700" fill="#4a90c4">W/N</text>
<!-- Total label -->
<text x="590" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333">~1.75 GB</text>
<text x="590" y="431" text-anchor="middle" font-size="8.5" fill="#3d9e5a">~64× reduction</text>
<!-- ===== Reduction arrows ===== -->
<text x="590" y="347" text-anchor="middle" font-size="8" font-weight="700" fill="#4a90c4" style="white-space: pre;">W/N</text>
<text x="590" y="418" text-anchor="middle" font-size="9.5" font-weight="700" fill="#333" style="white-space: pre;">~1.75 GB</text>
<text x="590" y="431" text-anchor="middle" font-size="8.5" fill="#3d9e5a" style="white-space: pre;">~64× reduction</text>
<defs>
<marker id="arrow-green" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
<path d="M0,0 L8,3 L0,6 Z" fill="#3d9e5a"/>
@@ -129,12 +72,8 @@
<line x1="175" y1="270" x2="230" y2="300" stroke="#3d9e5a" stroke-width="1.2" stroke-dasharray="4,2" marker-end="url(#arrow-green)"/>
<line x1="330" y1="315" x2="385" y2="323" stroke="#3d9e5a" stroke-width="1.2" stroke-dasharray="4,2" marker-end="url(#arrow-green)"/>
<line x1="485" y1="325" x2="540" y2="355" stroke="#3d9e5a" stroke-width="1.2" stroke-dasharray="4,2" marker-end="url(#arrow-green)"/>
<!-- Stage labels on arrows -->
<text x="202" y="280" text-anchor="middle" font-size="8" fill="#3d9e5a">Shard OS</text>
<text x="357" y="314" text-anchor="middle" font-size="8" fill="#3d9e5a">+ Shard G</text>
<text x="512" y="342" text-anchor="middle" font-size="8" fill="#3d9e5a">+ Shard W</text>
<!-- ===== Bottom footnote ===== -->
<text x="340" y="448" text-anchor="middle" font-size="9" fill="#999" font-style="italic">7B parameter model, N=64 GPUs, mixed precision (FP16 weights/gradients, FP32 optimizer states). Total baseline: 112 GB/GPU.</text>
</svg>
<text x="210.536" y="274.838" text-anchor="middle" font-size="8" fill="#3d9e5a" style="white-space: pre; font-size: 8px;">Shard OS</text>
<text x="361.426" y="310.796" text-anchor="middle" font-size="8" fill="#3d9e5a" style="white-space: pre; font-size: 8px;">+ Shard G</text>
<text x="519.67" y="329.811" text-anchor="middle" font-size="8" fill="#3d9e5a" style="white-space: pre; font-size: 8px;">+ Shard W</text>
<text x="340" y="448" text-anchor="middle" font-size="9" fill="#999" font-style="italic" style="white-space: pre;">7B parameter model, N=64 GPUs, mixed precision (FP16 weights/gradients, FP32 optimizer states). Total baseline: 112 GB/GPU.</text>
</svg>

Before

Width:  |  Height:  |  Size: 9.3 KiB

After

Width:  |  Height:  |  Size: 8.8 KiB

View File

@@ -122,6 +122,10 @@ class FaultToleranceSetup:
Imagine a 10,000-GPU cluster midway through a three-month training run for a new foundation model. Statistically, a GPU will fail every few hours. If the system is not designed to absorb these continuous physical failures, the training process will halt entirely, wasting millions of dollars in compute time. In the **Fleet Stack** (@sec-vol2-introduction), Fault Tolerance acts as this necessary **Immune System** of the Infrastructure Layer.
::: {#fig-fault-tolerance-failure-spectrum fig-env="figure" fig-pos="htb" fig-cap="**The ML Failure Spectrum**. Reliability challenges at scale range from hard hardware failures (crashes) to silent data corruption (SDC) and performance degradation (gray failures). Fault tolerance engineering must provide mechanisms to detect and recover from each type of failure to maintain fleet-scale throughput." fig-alt="Continuum diagram showing Failure Frequency vs Detection Difficulty. Categories: Hard Failure, Gray Failure, Silent Data Corruption."}
![](images/svg/_failure-spectrum.svg){width=100%}
:::
The distributed training systems examined in @sec-distributed-training-systems achieve massive throughput by coordinating thousands of devices, and the collective communication patterns from @sec-collective-communication — AllReduce, AllGather, AllToAll — sustain that coordination through rigidly synchronized exchanges. However, this tight synchronization creates fragility: a failure in any single device can stall the entire fleet. This chapter builds the resilience layer necessary to keep that fleet running.
The transition from small-scale experimentation to large-scale production changes the relationship between systems and failures. A researcher training a model on a single GPU might experience hardware failure once per year. That same researcher scaling to a 1,000 GPU cluster will experience failures multiple times per day. This shift from rare exception to routine occurrence demands different engineering approaches. The mathematical analysis that follows makes this transition precise and quantitative.
@@ -226,66 +230,15 @@ As @fig-young-daly illustrates, the **Young-Daly formula**[^fn-young-daly-histor
[^fn-young-daly-history]: **Young-Daly Formula**: J. W. Young derived the first-order optimal checkpoint interval in a 1974 *Communications of the ACM* paper; John Daly independently refined it in 2006 with tighter second-order bounds. The formula's square-root relationship ($\tau_{\text{opt}} = \sqrt{2 \cdot T_{\text{write}} \cdot \text{MTBF}}$) means that halving MTBF only increases optimal checkpoint frequency by $\sqrt{2}\approx 1.4\times$, explaining why doubling cluster size does not demand doubling checkpoint I/O bandwidth. \index{Young-Daly Formula!history}
::: {#fig-young-daly fig-env="figure" fig-pos="htb" fig-cap="**The Young-Daly Optimal Checkpoint**. Total wasted work is the sum of *checkpointing overhead* (which decreases with interval $\tau$) and *rework cost* (which increases with $\tau$). The minimum point defines the optimal interval $\tau_{\text{opt}} = \sqrt{2 \cdot T_{\text{write}} \cdot \text{MTBF}}$. For a cluster with 5-hour MTBF and 15-minute write time, the optimal interval is ~1.2 hours." fig-alt="Plot of Overhead vs Checkpoint Interval. A red curve for Rework Cost increases linearly. A blue curve for Checkpoint Overhead decreases hyperbolically. Their sum (green curve) shows a clear minimum point labeled Optimal Interval."}
```{python}
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ YOUNG-DALY OPTIMAL CHECKPOINT (FIGURE)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-young-daly — checkpoint overhead vs rework trade-off
# │
# │ Goal: Plot overhead = T_write/τ and rework = τ/(2*\text{MTBF}); show optimal
# │ τ_opt = sqrt(2*T_write*\text{MTBF}) minimizing total waste.
# │ Show: Three curves; minimum point annotation.
# │ How: tau = linspace; over_ckpt + over_rework; viz.set_book_style().
# │
# │ Imports: numpy (np), matplotlib.pyplot (plt), mlsysim.core.viz (viz)
# │ Exports: (figure only, no prose variables)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
import matplotlib.pyplot as plt
from mlsysim import viz
viz.set_book_style()
COLORS = viz.COLORS
# Parameters for visualization
mtbf = 5.0 # hours
t_write = 0.25 # hours (15 mins)
tau = np.linspace(0.1, 5.0, 100)
over_ckpt = t_write / tau
over_rework = tau / (2 * mtbf)
total_waste = over_ckpt + over_rework
tau_opt = np.sqrt(2 * t_write * mtbf)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(tau, over_ckpt, label='Checkpoint Overhead ($T_{\text{write}}/\\tau$)', color=COLORS['BlueLine'], linestyle='--')
ax.plot(tau, over_rework, label=r'Expected Rework ($\tau/2\text{MTBF}$)', color=COLORS['RedLine'], linestyle='--')
ax.plot(tau, total_waste, label='Total Wasted Work', color=COLORS['GreenLine'], linewidth=2.5)
# Optimal point
ax.scatter([tau_opt], [np.sqrt(2*t_write/mtbf)], color='black', zorder=5)
ax.annotate(f'$\\tau_{{opt}} \\approx {tau_opt:.1f}$h',
xy=(tau_opt, np.sqrt(2*t_write/mtbf)),
xytext=(tau_opt + 0.2, 0.6),
arrowprops=dict(facecolor='black', shrink=0.05, width=1, headwidth=5))
ax.set_xlabel('Checkpoint Interval $\\tau$ (Hours)')
ax.set_ylabel('Fraction of Total Time Wasted')
ax.set_ylim(0, 1.0)
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
```
![](images/svg/_young-daly-optimization.svg){width=100%}
:::
The formula $\tau_{\text{opt}} = \sqrt{2 \cdot T_{\text{write}} \cdot \text{MTBF}}$ reveals a critical scaling property: as clusters grow larger ($\text{MTBF} \downarrow$), we must checkpoint more frequently. This, in turn, demands higher-bandwidth storage systems (@sec-data-storage) to keep $T_{\text{write}}$ small, otherwise the "Checkpoint Tax" will consume most of the cluster's compute capacity.
::: {#fig-checkpoint-tax fig-env="figure" fig-pos="htb" fig-cap="**The Checkpoint Tax**. As cluster size increases, the optimal checkpoint interval shrinks, increasing the fraction of time spent on I/O. For a fixed storage bandwidth, the 'tax' on compute capacity grows with N. Minimizing this tax requires tiered storage staging (NVMe to PFS) to keep $T_{\text{write}}$ independent of cluster scale." fig-alt="Plot showing Checkpoint Overhead percentage vs GPU count. Three curves for different storage bandwidths (1 GB/s, 10 GB/s, 100 GB/s) show how overhead grows with N but is suppressed by bandwidth."}
![](images/svg/_checkpoint-tax.svg){width=100%}
:::
```{python}
#| label: lego-nines-reliability
#| echo: false
@@ -560,121 +513,7 @@ At the nanometer scale of modern transistors, hardware is not deterministic; it
Facebook documented a pervasive SDC issue where a hardware fault caused a valid file to be reported as "size zero" during decompression [@dixit2021silent]. As @fig-sdc-example illustrates, the system "worked" (no crash), but data was silently deleted. In ML, this manifests as valid-looking but mathematically garbage gradients.
::: {#fig-sdc-example fig-env="figure" fig-pos="htb" fig-cap="**Silent Data Corruption Propagation**. Unexpected faults can return incorrect file sizes, leading to data loss during decompression and propagating errors through distributed querying systems. This example from Facebook emphasizes how silent errors bypass standard exception handlers. Source: Facebook (2021)." fig-alt="System diagram showing data flow from compressed storage through defective CPU to database. Arrows indicate processing stages where file size calculation returns zero, causing missing rows in output."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}\footnotesize]
\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50,text=black},
cube/.style={cylinder, draw,shape border rotate=90, aspect=1.8,inner ysep=0pt,
minimum height=34mm,minimum width=25mm, cylinder uses custom fill,
cylinder body fill=black!07,cylinder end fill=black!25},
Box/.style={,
inner xsep=2pt,
node distance=1.1,
draw=GreenLine,
line width=0.75pt,
font=\usefont{T1}{phv}{m}{n}\small,
align=flush center,
fill=GreenL,
text width=29mm,
minimum width=29mm, minimum height=10mm
},
Box2/.style={helvetica,
inner xsep=2pt,
node distance=0.8,
draw=VioletLine,
line width=0.75pt,
font=\usefont{T1}{phv}{m}{n}\small,
align=flush center,
fill=VioletL2,
text width=32mm,
minimum width=32mm, minimum height=8mm
},
}
\definecolor{CPU}{RGB}{0,120,176}
%%%
\node[Box](B2){Scale math.pow()};
\node[Box,above=of B2](B1){Decompress file size calculation};
\begin{scope}[local bounding box = CPU,shift={($(B2)+(0,-2.6)$)},
scale=0.7, every node/.append style={transform shape}]
\node[fill=CPU,minimum width=56, minimum height=56,
rounded corners=8,outer sep=2pt] (C1) {};
\node[fill=white,minimum width=44, minimum height=44] (C2) {};
\node[fill=CPU!40,minimum width=39, minimum height=39,
align=center,inner sep=0pt,font=\usefont{T1}{phv}{m}{n}
\fontsize{8pt}{9}\selectfont] (C3) {Defective\\CPU};
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=3, minimum height=12,
inner sep=0pt,anchor=south](GO\y)at($(C1.north west)!\x!(C1.north east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=3, minimum height=12,
inner sep=0pt,anchor=north](DO\y)at($(C1.south west)!\x!(C1.south east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=12, minimum height=3,
inner sep=0pt,anchor=east](LE\y)at($(C1.north west)!\x!(C1.south west)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=CPU,minimum width=12, minimum height=3,
inner sep=0pt,anchor=west](DE\y)at($(C1.north east)!\x!(C1.south east)$){};
}
\end{scope}
%%
\begin{scope}[local bounding box = CY1,shift={($(B2)+(5,-0.1)$)}]
\node (CA1) [cube] {};
\node (CA2) [cube,minimum height=10pt, fill=CPU!60]at($(CA1.bottom)!0.1!(CA1.top)$) {};
\node (CA3) [cube,minimum height=10pt,fill=red!80]at($(CA2.bottom)+(0,2.6mm)$){};
\node (CA4) [cube,minimum height=10pt,fill=red!80]at($(CA3.bottom)+(0,2.6mm)$){};
\node (CA5) [cube,minimum height=10pt, fill=CPU!60]at($(CA1.bottom)!0.65!(CA1.top)$) {};
\node[align=center]at (CA1){Spark shuffle and\\ merge database};
\end{scope}
%%
\begin{scope}[local bounding box = CY2,shift={($(B2)+(-5,-0.1)$)}]
\node (LCA1) [cube] {};
\node[align=center]at (LCA1){Spark pre-shuffle \\ data store\\(compressed)};
\end{scope}
\node[single arrow, draw=black,thick, fill=VioletL,
minimum width = 15pt, single arrow head extend=3pt,rotate=270,
minimum height=7mm]at($(B2)!0.52!(B1)$) {};
\node[single arrow, draw=black,thick, fill=VioletL,
minimum width = 15pt, single arrow head extend=3pt,rotate=270,
minimum height=7mm]at($(B2)!0.39!(CPU)$) {};
%
\coordinate(DES)at($(DE1)!0.5!(DE6)$);
\coordinate(LEV)at($(LE1)!0.5!(LE6)$);
\node[single arrow, draw=black,thick, fill=VioletL, inner sep=1pt,
minimum width = 14pt, single arrow head extend=2pt,anchor=east,
minimum height=18mm](LS)at($(LEV)+(-0.5,0)$) {};
\node[single arrow, draw=black,thick, fill=VioletL, inner sep=1pt,
minimum width = 14pt, single arrow head extend=2pt,anchor=west,
minimum height=18mm](DS)at($(DES)+(0.5,0)$) {};
%
%fitting
\scoped[on background layer]
\node[draw=violet,inner xsep=6.5mm,inner ysep=6.5mm,outer sep=0pt,
yshift=2mm,fill=none,fit=(CPU)(B1),line width=2.5pt](BB1){};
\node[below=3pt of BB1.north,anchor=north,helvetica]{Shuffle and merge};
%%%
\node[Box2,below left=0.5 of LS](N2){\textbf{2.} Compute (1.1)\textsuperscript{53}};
\node[Box2,below right=0.5 of DS,fill=BlueL,draw=BlueLine](R3){\textbf{3.} Result = 0};
\node[Box2,below right=0.3 and -2.5 of R3,text width=43mm](N3){\textbf{3.} Expected Result = 156.24};
%
\node[Box2,above= of CY2](N1){\textbf{1.} Compute file size for decompression};
\node[Box2,above= of CY1](N4){\textbf{4.} Write file to database if size $>$ 0};
\node[Box2,below right= 0.2 and -1.15of CY1](N5){\textbf{5.} Missing rows in DB};
%
\draw[Line,-latex](N5)|-(CA3.before bottom);
\draw[Line,-latex](N5.50)|-(CA4.6);
\draw[Line](N3.20)|-(R3);
\draw[Line,-latex](LCA1.top)|-(B1);
\draw[Line,latex-](CA1.top)|-(B1);
\end{tikzpicture}
```
![](images/svg/_sdc-propagation.svg){width=100%}
:::
Real-world evidence of SDC in production systems confirms these risks. @fig-sdc-jeffdean shows corrupted data blocks accumulating in a shuffle and merge database at Google, where even a small fraction of corrupted blocks can cascade into significant data quality degradation.
@@ -696,128 +535,7 @@ Google reported that SDC in TPU pods often manifests as sudden, inexplicable spi
Google addresses this by maintaining "Hot Spares": running the same computation on two distinct chips or having a standby ready to take over. If a "Sanity Checker" (monitoring loss/gradients) detects an anomaly, the workload is instantly migrated to the hot spare, and the suspect chip is drained for diagnostics (@fig-sdc-controller). This moves reliability from the *component* (which we cannot trust) to the *system* (which verifies the result).
::: {#fig-sdc-controller fig-env="figure" fig-pos="htb" fig-cap="**Hot Spare Redundancy**. Google's data centers use hot spare cores to maintain uninterrupted ML training despite hardware failures, seamlessly transitioning workloads from defective machines to backup resources. This approach contrasts with parallel redundancy techniques like DMR/TMR by providing a reactive fault tolerance mechanism that minimizes downtime and preserves data integrity during ML training. Source: Jeff Dean, MLSys 2024 Keynote." fig-alt="Four-panel sequence: normal training grid, defective machine marked red, SDC checker detecting fault, workload transferred to hot spare while defective unit sent for repair."}
```{.tikz}
\begin{tikzpicture}[line width=0.75pt,font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{Green}{RGB}{84,180,53}
\definecolor{Red}{RGB}{249,56,39}
\definecolor{Blue}{RGB}{0,97,168}
\definecolor{Siva}{RGB}{161,152,130}
%
\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=2.0pt,black!50,rounded corners=7,-latex},
main/.style={circle, minimum size=5mm, line width=0.7mm,draw=red,keep name},
keep name/.style={prefix after command={\pgfextra{\let\fixname\tikzlastnode}}},
red box/.style={
append after command={
node [rotate=-50,
fit=(\fixname) ,
fill=red,
text width=1.3mm,
inner sep=-\pgflinewidth,
rectangle
] {}
}
}
}
\tikzset{
Box/.style={helvetica,
inner xsep=2pt,
node distance=0.7,
draw=Green,
rounded corners,
fill=Green,
minimum width=11mm, minimum height=6mm
},
Text/.style={%
inner sep=2pt,
draw=none,
line width=0.75pt,
fill=TextColor,
helvetica,
align=flush center,
minimum width=10mm, minimum height=6mm
},
}
\begin{scope}[local bounding box=M1,shift={(0,0)}]
\foreach \x in {1,2,3}{
\foreach \y in {1,2,3}{
\def\br{M1}
\node[Box](R\y\x\br) at (1.3*\x,-0.8*\y) {};
}
}
\node[Box,draw=Blue,fill=Blue]at(R32M1){};
\node[Box,draw=Siva,fill=Siva]at(R33M1){};
\node[below=0.2 of R32M1]{Normal training state};
\end{scope}
\begin{scope}[local bounding box=M1,shift={(4.5,0)}]
\foreach \x in {1,2,3}{
\foreach \y in {1,2,3}{
\def\br{M2}
\node[Box](R\y\x\br) at (1.3*\x,-0.8*\y) {};
}
}
\node[Box,draw=Blue,fill=Blue]at(R32M2){};
\node[Box,draw=Siva,fill=Siva]at(R33M2){};
\node[below=0.2 of R32M2,align=center,
red](DM){Defective machine\\ causes SDC};
\node [main,red box] (c) at (R23M2){};
\draw[Line,red](R23M2)--++(0:1)|-(DM);
\end{scope}
\begin{scope}[local bounding box=M1,shift={(9.0,0)}]
\foreach \x in {1,2,3}{
\foreach \y in {1,2,3}{
\def\br{M3}
\node[Box](R\y\x\br) at (1.3*\x,-0.8*\y) {};
}
}
\node[Box,draw=Blue,fill=Blue]at(R32M3){};
\node[Box,draw=Blue,fill=none,line width=2pt]at(R23M3){};
\node[Box,draw=Siva,fill=Siva]at(R33M3){};
\node[below=0.2 of R32M3,align=center,
Blue](SD){SDC checker\\ automatically\\ identifies SDC};
\node [main,red box] (c) at (R23M3){};
\draw[Line,Blue](R23M3)--++(0:1)|-(SD);
\end{scope}
\begin{scope}[local bounding box=M1,shift={(13.5,0)}]
\foreach \x in {1,2,3}{
\foreach \y in {1,2,3}{
\def\br{M4}
\node[Box](R\y\x\br) at (1.3*\x,-0.8*\y) {};
}
}
\node[Box,draw=Blue,fill=Blue]at(R32M4){};
\node[Box,draw=red,fill=white,line width=2pt]at(R23M4){};
\node[Box,draw=Blue,fill=Green,line width=2pt]at(R33M4){};
\node[below=0.2 of R32M4,align=center,
Blue](SD1){SDC checker moves\\ training to hot spare\\
and sends defective\\ machine for repair};
\node [main,red box] (c) at (R23M4){};
\draw[Line,Blue](R33M4)--++(0:1)|-(SD1);
\end{scope}
\begin{scope}[local bounding box=LE,shift={(3.5,0.4)}]
\node[Box,draw=Green,fill=Green](ZE){};
\node[right=2pt of ZE,font=\small\usefont{T1}{phv}{m}{n}
\footnotesize](L1){Synchronous Training Worker};
\node[Box,draw=Blue,fill=Blue,right=of L1](PL){};
\node[right=2pt of PL,font=\small\usefont{T1}{phv}{m}{n}
\footnotesize](L2){SDC checker};
%
\node[Box,draw=Siva,fill=Siva,right=of L2](SI){};
\node[right=2pt of SI,font=\small\usefont{T1}{phv}{m}{n}
\footnotesize](L3){Hot spare};
\scoped[on background layer]
\node[draw=BackLine,inner xsep=10,inner ysep=6,yshift=0mm,
fill=BackColor!60,fit=(ZE)(L3),line width=0.75pt](BB1){};
\end{scope}
\end{tikzpicture}
```
![](images/svg/_sdc-controller.svg){width=100%}
:::
::: {.callout-checkpoint title="Detecting Silent Corruption"}
@@ -992,7 +710,21 @@ A single bit-flip in a weight matrix illustrates the severity. If a critical wei
The reliability problem is worsening. As the ML fleet scales to sub-5nm process nodes (@sec-compute-infrastructure), the Failures In Time (FIT) rate for Silent Data Corruption (SDC) rises: smaller nodes have lower critical charges, making bit-flips from cosmic rays and thermal noise more frequent, while chips with billions of transistors have higher statistical probabilities of manufacturing defects that manifest only under specific ML workloads [@ma2024challenges]. ML system architects must treat hardware as an **unreliable substrate**, where algorithmic fault tolerance (gradient checksums, weight replication, periodic consistency checks in the MLOps pipeline (@sec-ops-scale)) is a mandatory requirement rather than an HPC specialty.
Hardware faults fall into three categories based on temporal characteristics. **Transient faults** are temporary disruptions caused by external factors such as cosmic rays or electromagnetic interference [@ziegler1996ibm]; they cause incorrect computations without permanent hardware damage and can corrupt gradient updates during training or alter model weights during inference. **Permanent faults** represent irreversible damage from physical defects or component wear-out, such as stuck-at faults or device failures that require hardware replacement; for long-running training jobs, they can mean days or weeks of lost computation. **Intermittent faults** appear and disappear sporadically due to unstable conditions like loose connections or aging components, causing non-deterministic behavior that compromises model validation and reproducibility.
Hardware faults fall into three categories based on temporal characteristics.
::: {#fig-fault-temporal-categories fig-env="figure" fig-pos="htb" fig-cap="**Temporal Taxonomy of Hardware Faults**. (A) Transient: temporary disruptions (e.g., bit flips) that do not cause permanent damage. (B) Intermittent: sporadic errors due to unstable conditions or aging. (C) Permanent: irreversible physical damage (e.g., stuck-at-0/1 faults) requiring replacement." fig-alt="Four-panel diagram showing Transient, Intermittent, and Permanent fault patterns over time, with a dedicated bit-flip illustration."}
::: {layout-ncol=2}
![**Transient Fault (Bit Flip)**](images/svg/_bit-flip.svg)
![**Permanent Fault (Stuck-at)**](images/svg/_stuck-fault.svg)
![**Transient Timing**](images/svg/_transient-fault.svg)
![**Intermittent Timing**](images/svg/_intermittent-fault.svg)
:::
:::
**Transient faults** are temporary disruptions caused by external factors such as cosmic rays or electromagnetic interference [@ziegler1996ibm]; they cause incorrect computations without permanent hardware damage and can corrupt gradient updates during training or alter model weights during inference. **Permanent faults** represent irreversible damage from physical defects or component wear-out, such as stuck-at faults or device failures that require hardware replacement; for long-running training jobs, they can mean days or weeks of lost computation. **Intermittent faults** appear and disappear sporadically due to unstable conditions like loose connections or aging components, causing non-deterministic behavior that compromises model validation and reproducibility.
### Transient Faults {#sec-ft-transient-faults-1455}
@@ -1334,10 +1066,12 @@ Hardware redundancy uses component duplication and voting to detect and mask fau
[^fn-tmr-correction]: **TMR (Triple Modular Redundancy)**: Performs computation three times and takes a majority vote, enabling automatic single-fault correction at 200% hardware overhead. Developed for the Apollo Guidance Computer (1960s), TMR remains the standard for space-grade ML inference where cosmic radiation rates are orders of magnitude higher than at sea level, achieving error rates below $10^{-12}$ per operation. \index{TMR!majority voting}
::: {#fig-tesla-dmr fig-env="figure" fig-pos="htb" fig-cap="**Dual Modular Redundancy**: Tesla's full self-driving computer employs a DMR architecture, replicating critical computations across two independent system-on-chips (socs) to mitigate hardware faults and ensure continuous operation. This redundancy enables the system to mask errors: if one soc fails, the other continues functioning, maintaining safety-critical functions like perception and control. *Source: [Tesla](HTTPS://old.hotchips.org/hc31/HC31_2.3_tesla_hotchips_ppt_final_0817.PDF)*" fig-alt="Block diagram of Tesla self-driving computer with two identical SoCs processing sensor inputs in parallel. Comparator unit validates matching outputs before sending control commands."}
![](./images/png/tesla_dmr.png)
::: {#fig-tesla-dmr fig-env="figure" fig-pos="htb" fig-cap="**Dual Modular Redundancy (DMR)**. Tesla's full self-driving computer employs a DMR architecture, replicating critical computations across two independent system-on-chips (SoCs). A hardware comparator unit validates that both chips produce matching outputs before allowing a control command to reach the vehicle's actuators, ensuring that a single-chip hardware failure or bit flip cannot trigger dangerous driving behavior." fig-alt="Block diagram of Tesla self-driving computer with two identical SoCs processing sensor inputs in parallel. Comparator unit validates matching outputs before sending control commands."}
![](images/svg/tesla-dmr.svg){width=100%}
:::
::: {#fig-error-masking fig-env="figure" fig-pos="htb" fig-cap="**Error Masking via Voting**. In Triple Modular Redundancy (TMR), the same computation runs on three independent units. A voter circuit takes a majority vote of the results. Even if one unit suffers a fault (e.g., bit flip), the system 'masks' the error by selecting the matching outputs from the other two units, allowing continuous operation without interruption." fig-alt="Diagram showing three functional units feeding into a single voter unit. One unit is marked with a red X (fault), but the final output matches the two healthy units."}
![](images/svg/error-masking.svg){width=100%}
:::
At the software level, distributed ML systems employ runtime monitoring [@francalanza2017foundation; @mahmoud2021issre], anomaly detection (statistical outlier detection, One-Class SVM [@chandola2009anomaly]), consistency checks across distributed model parameters [@lindholm2019data], and heartbeat mechanisms [@kawazoe1997heartbeat] that detect node failures within configurable timeout periods. Software-implemented fault tolerance (SIFT) techniques [@reis2005swift] such as N-version programming and Reed-Solomon error correction codes [@plank1997tutorial] add redundancy at the software level, enabling detection and correction of errors without dedicated hardware. Watchdog timers [@pont2002using] monitor task execution and trigger recovery actions when systems become unresponsive.
@@ -1855,6 +1589,54 @@ yshift=-6mm,fill=cyan!10,fit=(PERSON2)(DISPLAY3),line width=0.75pt](BB2){};
:::
## Check-and-Verify: Defending against Silent Data Corruption {#sec-ft-sdc-verification}
As clusters scale to 100,000+ GPUs, the probability of a "Silent Data Corruption" (SDC) event—where an ALU or HBM bit flip occurs without triggering ECC or hardware alerts—approaches certainty during large collective operations. Standard AllReduce algorithms assume that if a node is "alive," its data is correct. In the Machine Learning Fleet, we must transition to a **Byzantine Fault Tolerance** mindset: "Trust, but verify."
```{python}
#| label: sdc-prob-calc
#| echo: false
from mlsysim.fmt import check
from mlsysim.core.constants import P_SDC_PER_GPU_HR
class SDCCollective:
"""The risk of silent corruption in global collectives."""
# ┌── 1. LOAD ──────────────────────────────────────────
n_gpus = 100000
p_sdc_per_gpu_hr = P_SDC_PER_GPU_HR # 1 in a million chance per hour
training_step_s = 2
# ┌── 2. EXECUTE ───────────────────────────────────────
p_step_sdc = 1 - (1 - p_sdc_per_gpu_hr)**(n_gpus * (training_step_s / 3600))
# ┌── 3. GUARD ─────────────────────────────────────────
check(p_step_sdc > 0.05, f"Prob {p_step_sdc:.4f} unexpected")
# ┌── 4. OUTPUT ────────────────────────────────────────
prob_str = f"{p_step_sdc*100:.2f}%"
@classmethod
def plot(cls):
"""Visualizes SDC probability risk."""
from mlsysim import viz
return viz.bar_compare(
labels=["No Fault", "SDC Probability"],
values=[100 - (cls.p_step_sdc*100), cls.p_step_sdc*100],
title="Silent Data Corruption Risk per Step",
ylabel="Probability (%)"
)
```
::: {.callout-notebook title="Napkin Math: The SDC Certainty"}
**Problem**: Calculate the probability that at least one GPU in a 100,000-node cluster experiences a silent ALU error during a single 2-second training step.
1. **Fleet Size**: 100,000 accelerators.
2. **Individual Risk**: $10^{-6}$ per hour (a conservative estimate for SDC).
3. **The Exposure**: In a 2-second window, the fleet has $100,000 \times (2/3600) \approx 55$ "GPU-hours" of exposure.
4. **The Probability**: $P(\text{at least one SDC}) \approx$ **`{python} SDCCollective.prob_str`**.
**The Systems Insight**: In a 100k-GPU fleet, a silent error occurs every 20 steps. If your AllReduce does not implement **Checksummed Collectives** or **Hash-and-Verify** gradients, your model parameters will silently drift into "Numerical Garbage" within minutes. Robustness moves from being a "Restart" problem to a **Verification** problem: the fleet must perform redundant reductions or use parity-protected gradients to catch the silent killer of scale.
:::
## Fault Injection Tools and Frameworks {#sec-ft-fault-injection-tools-frameworks}
Proving that a multi-node training cluster can survive a sudden network partition requires more than hope. Engineers do not wait for a real hurricane to take out a data center; they deliberately sever the network connection in a staging environment and observe what the orchestration layer does. **Fault injection**\index{Fault Injection} is the engineering discipline of deliberately breaking a system — flipping memory bits, dropping network packets, and corrupting API responses — to empirically verify that robustness mechanisms work before real chaos arrives.
@@ -2405,6 +2187,10 @@ Centralized checkpointing works acceptably for small-scale distributed training
In distributed checkpointing, each worker writes its portion of the checkpoint to a shared filesystem or object storage. A coordinator signals when to checkpoint and confirms completion, but state flows directly from workers to storage without aggregation.
::: {#fig-distributed-checkpoint-architecture fig-env="figure" fig-pos="htb" fig-cap="**Distributed Checkpoint Architecture**. Comparison of centralized vs. distributed patterns. (Top) Centralized aggregation creates bottlenecks at the coordinator. (Bottom) Distributed sharding enables every worker to write directly to the Parallel File System (PFS) in parallel, aggregating bandwidth across the storage fabric and minimizing the training pause." fig-alt="Two-panel comparison. Top: Many workers sending data to a single coordinator, which writes to storage. Bottom: Many workers each writing directly to a shared storage cloud in parallel."}
![](images/svg/distributed-checkpoint-architecture.svg){width=100%}
:::
The coordination protocol proceeds in six steps:
1. Coordinator broadcasts checkpoint request with checkpoint ID
@@ -3160,6 +2946,10 @@ ML systems face additional debugging challenges. Silent accuracy degradation pro
Effective distributed debugging requires three observability pillars: metrics, logs, and traces.
::: {#fig-observability-three-pillars fig-env="figure" fig-pos="htb" fig-cap="**The Three Pillars of Observability**. Diagnosis of fleet failures requires correlating three signal types. **Metrics** reveal *when* and *where* a problem occurs (e.g., GPU utilization drops). **Logs** provide the *what* through event context (e.g., Out-of-Memory exception). **Traces** provide the *why* by linking events across the distributed lifecycle (e.g., identifying the specific request that triggered the memory spike)." fig-alt="Venn diagram with three overlapping circles: Metrics, Logs, and Traces."}
![](images/svg/observability-three-pillars.svg){width=100%}
:::
#### Metrics {#sec-fault-tolerance-reliability-reliability-metrics-a1e0}
Metrics are numerical measurements collected over time.
@@ -3303,6 +3093,10 @@ The observability pillars above enable detection of recurring failure patterns t
#### Training Failures {#sec-fault-tolerance-reliability-reliability-training-failures-bc3a}
::: {#fig-training-failure-signatures fig-env="figure" fig-pos="htb" fig-cap="**Training Failure Signatures**. Common patterns of model loss over time that indicate underlying system or algorithmic failures. (A) Loss Spike: transient numerical instability. (B) Divergence: permanent desynchronization or corruption. (C) Hang: deadlock or crashed worker. Identifying these signatures automatically is essential for rapid recovery in autonomous training pipelines." fig-alt="Plot showing three loss curves. One with a sharp spike and recovery, one drifting upward (divergence), and one flat line starting from a point (hang)."}
![](images/svg/training-failure-signatures.svg){width=100%}
:::
A loss spike followed by recovery typically indicates a transient data issue or numerical instability. The spike is usually self-correcting but should be investigated to rule out systematic causes.
A loss spike followed by a plateau signals a more serious problem: the learning rate may be too high, a checkpoint may be corrupted, or a data bug may have been introduced. This pattern requires investigation and potentially rollback to an earlier checkpoint.

View File

@@ -64,6 +64,7 @@ Optimizations that shave milliseconds from inference latency or percentage point
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.core.constants import *
from mlsysim.fmt import fmt, sci, check
from mlsysim.core.constants import LOGIC_WALL_REASONING_STEPS_EXAMPLE
```
@@ -1079,6 +1080,10 @@ Analysis
2. **Batch size 8 achieves stability**: At `{python} GPT3ServingResults.b8_util_pct`% utilization, the system is stable but queuing delays contribute significantly to latency (`{python} GPT3ServingResults.b8_wait_ms` ms).
::: {#fig-queuing-hockey-stick fig-env="figure" fig-pos="htb" fig-cap="**The Queuing Hockey Stick**. Relationship between system utilization and queuing delay. As utilization approaches 100%, the queuing delay increases exponentially according to Kingman's formula. Production systems typically target 70--80% utilization to maintain a 'safe' region where traffic spikes do not cause unbounded latency growth." fig-alt="Plot of Latency vs Utilization. The curve is relatively flat until 70 percent utilization, then rises sharply (hockey stick shape) toward infinity at 100 percent."}
![](images/svg/queuing-hockey-stick.svg){width=100%}
:::
3. **Batch size 32 minimizes total latency**: Despite longer service time (`{python} GPT3ServingResults.b32_service_ms` ms vs `{python} GPT3ServingResults.b8_service_ms` ms), the dramatic reduction in queue wait time (`{python} GPT3ServingResults.b32_wait_ms` ms vs `{python} GPT3ServingResults.b8_wait_ms` ms) yields lower total latency.
4. **Diminishing returns beyond B=32**: Further batch size increases would reduce utilization but memory constraints prevent exploration.
@@ -1309,6 +1314,53 @@ The 3$\times$ throughput improvement from continuous batching comes from elimina
:::
## The Logic Wall: Test-Time Compute Scaling {#sec-inference-logic-wall}
As models transition from "Fast Thinking" (instant pattern matching) to "Slow Thinking" (deliberative reasoning), the bottleneck shifts from **HBM Bandwidth** to **Test-Time Compute**. This is the **Logic Wall**: the realization that for complex problems, the fleet must scale compute *per request* proportional to the difficulty of the task, often through search or "Chain-of-Thought" (CoT) unrolling.
```{python}
#| label: logic-scaling-calc
#| echo: false
from mlsysim.fmt import check
class LogicScaling:
"""The latency tax of test-time reasoning."""
# ┌── 1. LOAD ──────────────────────────────────────────
t_std_ms = 100 # standard 1-token decode
reasoning_steps = LOGIC_WALL_REASONING_STEPS_EXAMPLE # tokens of 'thought' before the answer
# ┌── 2. EXECUTE ───────────────────────────────────────
t_reasoning_s = (t_std_ms * reasoning_steps) / 1000
latency_expansion = reasoning_steps
# ┌── 3. GUARD ─────────────────────────────────────────
check(latency_expansion == 128, f"Exp {latency_expansion} unexpected")
# ┌── 4. OUTPUT ────────────────────────────────────────
steps_str = f"{reasoning_steps}"
time_s_str = f"{t_reasoning_s:.1f}"
@classmethod
def plot(cls):
"""Visualizes the Logic Wall latency expansion."""
from mlsysim import viz
return viz.bar_compare(
labels=["Fast (Pattern)", "Slow (Reasoning)"],
values=[0.1, cls.t_reasoning_s], # 100ms vs multi-second
title="Inference Scaling: Reasoning Depth",
ylabel="Latency (Seconds)"
)
```
::: {.callout-notebook title="Napkin Math: Scaling Reasoning Depth"}
**Problem**: Calculate the latency impact of a model that uses `{python} LogicScaling.steps_str` "Thinking Tokens" to solve a complex math proof versus a standard answer.
1. **Standard Response**: 1 token answer = 100 ms.
2. **Reasoning Response**: `{python} LogicScaling.steps_str` tokens of internal search/CoT before the answer.
3. **The Latency**: `{python} LogicScaling.steps_str` $\times$ 100 ms = **`{python} LogicScaling.time_s_str` seconds**.
**The Systems Insight**: Test-time scaling transforms the serving architecture from a **Throughput Factory** to a **Search Engine**. While standard serving optimizes for tokens per second, reasoning-heavy models are constrained by **Steps per Second**. This creates a new "Reasoning SLO": users will wait 12 seconds for a correct proof, but not for a simple greeting. In the Machine Learning Fleet, we are moving toward **Dynamic Compute Allocation**, where the scheduler grants more "Thinking Time" to harder prompts.
:::
### Quantitative Analysis: Traditional vs Continuous Batching {#sec-inference-scale-quantitative-analysis-traditional-vs-continuous-batching-ca70}
The performance gap between traditional and continuous batching becomes quantifiable through careful analysis of wasted compute cycles. The mathematics of batching waste reveals exactly how much throughput is lost and under what conditions continuous batching delivers the greatest improvement.
@@ -1475,115 +1527,8 @@ Average latency comparison
The result is a **`{python} BatchingWorkedExample.latency_red_str`% reduction in average latency**, exactly matching the waste ratio.
::: {.callout-note title="Traditional vs. Continuous Batching"}
![](images/svg/continuous-batching.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.65]
\definecolor{BlueLine}{HTML}{006395}
\definecolor{BlueL}{HTML}{D1E6F3}
\definecolor{GreenLine}{HTML}{008F45}
\definecolor{GreenL}{HTML}{D4EFDF}
\definecolor{OrangeLine}{HTML}{E67817}
\definecolor{OrangeL}{HTML}{FCE4CC}
\definecolor{RedLine}{HTML}{CB202D}
\definecolor{RedL}{HTML}{F5D2D5}
\definecolor{VioletLine}{HTML}{7E317B}
\definecolor{VioletL}{HTML}{E6D4E5}
\definecolor{BrownLine}{HTML}{8B4513}
\definecolor{BrownL}{HTML}{F5DEB3}
\tikzset{
Box/.style={draw=black!70, line width=0.75pt, anchor=south west, minimum height=0.5cm, inner sep=0pt}
}
% Traditional Batching (top)
\node[anchor=west, font=\bfseries] at (0, 6.5) {Traditional Batching};
% Time axis
\draw[->, thick] (0, 0.5) -- (12, 0.5);
\node[anchor=west] at (12, 0.5) {Time};
\foreach \x/\label in {0/0, 2.5/50, 5/100, 7.5/150, 10/200} {
\draw (\x, 0.4) -- (\x, 0.6);
\node[below, font=\scriptsize] at (\x, 0.4) {\label};
}
% R1 row
\node[Box, fill=GreenL, minimum width=2.5cm] at (0, 5.5) {};
\node[Box, fill=black!10, minimum width=7.5cm] at (2.5, 5.5) {};
\node[anchor=west, font=\scriptsize] at (0.1, 5.75) {R1: 50 tokens};
\node[font=\scriptsize, text=black!60] at (6.25, 5.75) {Idle (waiting)};
% R2 row
\node[Box, fill=BlueL, minimum width=10cm] at (0, 4.5) {};
\node[anchor=west, font=\scriptsize, text=black] at (0.1, 4.75) {R2: 200 tokens};
% R3 row
\node[Box, fill=OrangeL, minimum width=5cm] at (0, 3.5) {};
\node[Box, fill=black!10, minimum width=5cm] at (5, 3.5) {};
\node[anchor=west, font=\scriptsize] at (0.1, 3.75) {R3: 100 tokens};
\node[font=\scriptsize, text=black!60] at (7.5, 3.75) {Idle};
% R4 row
\node[Box, fill=VioletL, minimum width=7.5cm] at (0, 2.5) {};
\node[Box, fill=black!10, minimum width=2.5cm] at (7.5, 2.5) {};
\node[anchor=west, font=\scriptsize, text=black] at (0.1, 2.75) {R4: 150 tokens};
\node[font=\scriptsize, text=black!60] at (8.75, 2.75) {Idle};
% Waste annotation
\draw[<->, RedLine, thick] (10.3, 2.5) -- (10.3, 6);
\node[anchor=west, font=\scriptsize, text=RedLine] at (10.5, 4.25) {37.5\% waste};
% Continuous Batching (bottom, shifted down)
\node[anchor=west, font=\bfseries] at (0, -0.5) {Continuous Batching};
% Time axis for continuous
\draw[->, thick] (0, -6.5) -- (12, -6.5);
\node[anchor=west] at (12, -6.5) {Time};
\foreach \x/\label in {0/0, 2.5/50, 5/100, 7.5/150, 10/200} {
\draw (\x, -6.6) -- (\x, -6.4);
\node[below, font=\scriptsize] at (\x, -6.6) {\label};
}
% R1 row (completes early, freed)
\node[Box, fill=GreenL, minimum width=2.5cm] at (0, -1.5) {};
\draw[thick, GreenLine] (2.5, -1.25) circle (0.15);
\node[font=\tiny, GreenLine] at (2.5, -1.25) {\checkmark};
\node[anchor=west, font=\scriptsize] at (0.1, -1.25) {R1};
% R5 joins in R1's slot
\node[Box, fill=RedL, minimum width=4.5cm] at (2.5, -1.5) {};
\node[anchor=west, font=\scriptsize] at (2.6, -1.25) {R5 (new)};
% R2 row (longest, no change)
\node[Box, fill=BlueL, minimum width=10cm] at (0, -2.5) {};
\node[anchor=west, font=\scriptsize, text=black] at (0.1, -2.25) {R2: 200 tokens};
% R3 row (completes at 100)
\node[Box, fill=OrangeL, minimum width=5cm] at (0, -3.5) {};
\draw[thick, GreenLine] (5, -3.25) circle (0.15);
\node[font=\tiny, GreenLine] at (5, -3.25) {\checkmark};
\node[anchor=west, font=\scriptsize] at (0.1, -3.25) {R3};
% R6 joins in R3's slot
\node[Box, fill=RedL, minimum width=4cm] at (5, -3.5) {};
\node[anchor=west, font=\scriptsize] at (5.1, -3.25) {R6 (new)};
% R4 row (completes at 150)
\node[Box, fill=VioletL, minimum width=7.5cm] at (0, -4.5) {};
\draw[thick, GreenLine] (7.5, -4.25) circle (0.15);
\node[font=\tiny, GreenLine] at (7.5, -4.25) {\checkmark};
\node[anchor=west, font=\scriptsize, text=black] at (0.1, -4.25) {R4};
% R7 joins in R4's slot
\node[Box, fill=RedL, minimum width=2.5cm] at (7.5, -4.5) {};
\node[anchor=west, font=\scriptsize] at (7.6, -4.25) {R7};
% Benefit annotation
\draw[<->, GreenLine, thick] (10.3, -4.5) -- (10.3, -1);
\node[anchor=west, font=\scriptsize, text=GreenLine] at (10.5, -2.75) {0\% waste};
\node[anchor=west, font=\scriptsize, text=GreenLine] at (10.5, -3.25) {+3 requests};
\end{tikzpicture}
```
**Traditional vs Continuous Batching**. Top: Traditional batching wastes 37.5% of GPU cycles as completed requests (R1, R3, R4) wait idle for the longest request (R2). Bottom: Continuous batching immediately frees slots upon completion, allowing new requests (R5, R6, R7) to join. This eliminates waste and increases effective throughput.
:::
@@ -1723,6 +1668,10 @@ After implementing these changes, P99 dropped to 185 ms. The Infrastructure Laye
### Feature-Parallel Batching for Recommendation Systems {#sec-inference-scale-featureparallel-batching-recommendation-systems-2a3f}
::: {#fig-feature-parallel-pipeline fig-env="figure" fig-pos="htb" fig-cap="**Feature-Parallel Batching Pipeline**. Requests are batched by feature type (Users, Items, Context) and dispatched to specialized embedding servers in parallel. The retrieved embeddings are then concatenated and processed by the dense ranking head. This architecture enables scaling to trillions of parameters by decoupling embedding storage from dense compute." fig-alt="Flow diagram showing 3 requests entering a batching layer. Batch is split into User, Item, and Context parallel paths. Each path hits a set of sharded embedding servers. Outputs converge into a dense MLP ranking head."}
![](images/svg/feature-parallel-pipeline.svg){width=100%}
:::
Recommendation systems have distinct batching requirements from vision or language models. The computation pattern involves:
1. **Sparse feature lookup**: Retrieve embeddings for user, item, and context features
@@ -1908,7 +1857,31 @@ Verify your understanding of different batching mechanics:
:::
Before selecting a batching strategy, it is essential to understand where latency accumulates across the full request lifecycle. @fig-inference-lifecycle maps each stage from client to response, revealing the "serving tax" that serialization, routing, and coordination impose outside of GPU compute.
Before selecting a batching strategy, it is essential to understand where latency accumulates across the full request lifecycle. @fig-inference-request-flow maps each stage from client to response, revealing the "serving tax" that serialization, routing, and coordination impose outside of GPU compute.
```{mermaid}
%%| label: fig-inference-request-flow
%%| fig-cap: "**Inference Request Lifecycle**. A sequence diagram showing how a user request traverses the serving stack. The 'Serving Tax' is the time spent in the Router, Queue, and Scheduler before the GPU begins mathematical execution."
sequenceDiagram
participant User
participant Router
participant KV_Cache as KV Cache Manager
participant Scheduler as Iteration Scheduler
participant GPU
User->>Router: POST /v1/chat/completions
Router->>KV_Cache: Check capacity / Reserve slots
KV_Cache-->>Router: Slot handle
Router->>Scheduler: Add to Wait Queue
loop Every Iteration
Scheduler->>GPU: Execute Batch (Prefill/Decode)
GPU-->>Scheduler: Logits / Next Tokens
Scheduler->>KV_Cache: Update block status
end
Scheduler->>User: Stream Result
```
::: {#fig-inference-lifecycle fig-env="figure" fig-pos="htb" fig-cap="**End-to-End Inference Pipeline**. A high-level view of the request lifecycle: Client -> Load Balancer -> Request Queue -> Batch Scheduler -> Model Execution -> Response. This visualization highlights the critical \"Serving Tax\" components (serialization, routing, coordination) that consume latency budget outside of the actual GPU compute time." fig-alt="Flowchart with 5 stages: Client, Load Balancer, Request Queue, Dynamic Batcher, Model Execution. Arrows show request flow with dashed return path. Red annotations below highlight latency sources at each stage."}
@@ -2116,65 +2089,7 @@ PagedAttention [@kwon2023vllm], introduced in vLLM, applies virtual memory conce
@fig-paged-attention illustrates the key concepts including page tables that map logical sequence positions to physical memory pages, block size that defines the number of tokens per page (typically 16 tokens), and physical blocks that provide fixed-size memory allocations assignable to any sequence.
::: {#fig-paged-attention fig-env="figure" fig-pos="htb" fig-cap="**PagedAttention Memory Mapping**. Decoupling the logical view of a sequence's KV cache (contiguous pages) from its physical storage (non-contiguous 16-token blocks). A block table maps logical pages to physical blocks, allowing the system to fill fragmentation gaps with small blocks from any sequence." fig-alt="Three-part diagram: left shows 4 contiguous logical pages (P0-P3), center shows block table mapping logical to physical addresses, right shows 12 scattered physical blocks in HBM with 4 highlighted as mapped."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.9]
\definecolor{BlockColor}{HTML}{D1E6F3}
\definecolor{MappedColor}{HTML}{FCE4CC}
\tikzset{
block/.style={draw=black!70, thick, minimum width=0.8cm, minimum height=0.6cm, font=\tiny, inner sep=0pt},
arrow/.style={->, >=stealth, thick, black!60}
}
% Logical View
\node[anchor=west] (logical_title) at (0, 3.5) {\textbf{Logical KV Cache} (Contiguous)};
\node[block, fill=BlockColor, below=0.2cm of logical_title.south west, anchor=north west] (L0) {P0};
\node[block, fill=BlockColor, right=0cm of L0] (L1) {P1};
\node[block, fill=BlockColor, right=0cm of L1] (L2) {P2};
\node[block, fill=BlockColor, right=0cm of L2] (L3) {P3};
% Page Table
\node[draw, dashed, inner sep=4pt, right=1.5cm of L3, yshift=-0.5cm] (table) {
\begin{tabular}{c|c}
\tiny Logic & \tiny Phys \\
\hline
\tiny P0 & \tiny B7 \\
\tiny P1 & \tiny B2 \\
\tiny P2 & \tiny B9 \\
\tiny P3 & \tiny B4
\end{tabular}
};
\node[above, font=\scriptsize] at (table.north) {Block Table};
% Physical View
\node[anchor=west, right=2cm of table.east, yshift=1.5cm] (phys_title) {\textbf{Physical HBM} (Scattered)};
\node[block, fill=white, below=0.2cm of phys_title.south west, anchor=north west] (physB0) {B0};
\node[block, fill=white, right=0cm of physB0] (physB1) {B1};
\node[block, fill=MappedColor, right=0cm of physB1] (physB2) {B2};
\node[block, fill=white, right=0cm of physB2] (physB3) {B3};
\node[block, fill=MappedColor, below=0cm of physB0] (physB4) {B4};
\node[block, fill=white, right=0cm of physB4] (physB5) {B5};
\node[block, fill=white, right=0cm of physB5] (physB6) {B6};
\node[block, fill=MappedColor, right=0cm of physB6] (physB7) {B7};
\node[block, fill=white, below=0cm of physB4] (physB8) {B8};
\node[block, fill=MappedColor, right=0cm of physB8] (physB9) {B9};
\node[block, fill=white, right=0cm of physB9] (physB10) {B10};
\node[block, fill=white, right=0cm of physB10] (physB11) {B11};
% Mapping arrows
\draw[arrow] (L0) to[out=-90, in=180] (table);
\draw[arrow] (table) to[out=0, in=180] (physB7.west);
\draw[arrow] (table) to[out=0, in=180] (physB2.west);
\draw[arrow] (table) to[out=0, in=180] (physB9.west);
\draw[arrow] (table) to[out=0, in=180] (physB4.west);
\end{tikzpicture}
```
![](images/svg/kv-cache-fragmentation.svg){width=100%}
:::
PagedAttention provides several benefits. It eliminates internal fragmentation by allocating only the pages needed for actual tokens. It eliminates external fragmentation because any free page can be used by any sequence. It enables dynamic growth so sequences can grow without pre-allocation. It supports memory sharing so common prefixes can share physical pages.
@@ -2803,6 +2718,10 @@ async def swap_to_gpu(sequence_id):
### Disaggregated Serving: Splitting the Workload {#sec-inference-scale-disaggregated-serving-splitting-workload-b8f9}
::: {#fig-disaggregated-serving fig-env="figure" fig-pos="htb" fig-cap="**Disaggregated Serving Architecture**. Separating the Compute-Bound **Prefill Phase** from the Bandwidth-Bound **Decode Phase**. Requests enter the Prefill Pool for prompt processing, and the resulting KV cache is migrated to the Decode Pool for token generation. This allows each phase to run on hardware optimized for its specific bottleneck, improving overall fleet efficiency." fig-alt="System diagram showing Request entering a Prefill Pool of high-compute GPUs. Red dashed arrow labeled KV Cache Transfer points to a Decode Pool of high-memory-bandwidth GPUs. Response exits from the Decode Pool."}
![](images/svg/disaggregated-serving.svg){width=100%}
:::
::: {.callout-definition title="Prefill and Decode Phases"}
***Prefill and Decode Phases***\index{Prefill and Decode!definition} are the two distinct computational regimes of transformer-based LLM inference.
@@ -3121,72 +3040,7 @@ Pipeline parallelism distributes layers across devices sequentially, with each d
For inference, pipeline parallelism creates bubbles differently than in training. @fig-pipeline-bubbles contrasts single-request latency (bubble-dominated) with pipelined throughput (bubble-amortized):
::: {#fig-pipeline-bubbles fig-env="figure" fig-pos="htb" fig-cap="**Pipeline Parallelism Bubbles**. For a single inference request, pipeline parallelism offers no latency benefit as the request must traverse all stages sequentially (top). However, when processing multiple concurrent requests, pipeline bubble utilization improves significantly (bottom), allowing throughput to scale with the number of stages. This makes pipeline parallelism ideal for high-throughput batch processing but less suitable for latency-critical interactive serving." fig-alt="Two timeline diagrams. Top: Single request across 3 GPUs showing sequential processing with idle bubbles. Bottom: Pipelined batch with 4 requests overlapping across stages, minimizing idle time."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{Stage1}{HTML}{D1E6F3}
\definecolor{Stage2}{HTML}{D4EFDF}
\definecolor{Stage3}{HTML}{FCE4CC}
\definecolor{Bubble}{HTML}{F5F5F5}
\tikzset{
block/.style={draw=black!50, minimum height=0.6cm, minimum width=1.4cm, font=\scriptsize},
label/.style={font=\footnotesize\bfseries}
}
% Scenario A: Single Request (Sequential)
\node[label, anchor=west] at (0, 3.5) {A. Single Request (Latency Bound)};
% Dev 0
\node[anchor=east] at (0, 2.8) {GPU 0};
\node[block, fill=Stage1] at (0.8, 2.8) {Req1};
\node[block, fill=Bubble, minimum width=2.8cm] at (3.0, 2.8) {Idle};
% Dev 1
\node[anchor=east] at (0, 2.0) {GPU 1};
\node[block, fill=Bubble, minimum width=1.4cm] at (0.8, 2.0) {Idle};
\node[block, fill=Stage2] at (2.3, 2.0) {Req1};
\node[block, fill=Bubble, minimum width=1.4cm] at (3.7, 2.0) {Idle};
% Dev 2
\node[anchor=east] at (0, 1.2) {GPU 2};
\node[block, fill=Bubble, minimum width=2.8cm] at (1.5, 1.2) {Idle};
\node[block, fill=Stage3] at (3.7, 1.2) {Req1};
% Total Time Arrow
\draw[<->] (0.1, 0.8) -- (4.4, 0.8) node[midway, below, font=\scriptsize] {Total Latency = Sum of Stages};
% Scenario B: Pipelined (Throughput)
\node[label, anchor=west] at (0, -0.5) {B. Pipelined Batch (Throughput)};
% Shift x origin for alignment
\begin{scope}[shift={(0, -3)}]
% Dev 0
\node[anchor=east] at (0, 2.0) {GPU 0};
\node[block, fill=Stage1] at (0.8, 2.0) {Req1};
\node[block, fill=Stage1] at (2.3, 2.0) {Req2};
\node[block, fill=Stage1] at (3.8, 2.0) {Req3};
% Dev 1
\node[anchor=east] at (0, 1.2) {GPU 1};
\node[block, fill=Bubble] at (0.8, 1.2) {Idle};
\node[block, fill=Stage2] at (2.3, 1.2) {Req1};
\node[block, fill=Stage2] at (3.8, 1.2) {Req2};
\node[block, fill=Stage2] at (5.3, 1.2) {Req3};
% Dev 2
\node[anchor=east] at (0, 0.4) {GPU 2};
\node[block, fill=Bubble, minimum width=2.8cm] at (1.5, 0.4) {Idle};
\node[block, fill=Stage3] at (3.8, 0.4) {Req1};
\node[block, fill=Stage3] at (5.3, 0.4) {Req2};
% Note
\node[anchor=west, font=\scriptsize, align=left] at (6.5, 1.2) {High Throughput\\after fill};
\end{scope}
\end{tikzpicture}
```
![](images/svg/serving-pipeline.svg){width=100%}
:::
For a single request, pipeline parallelism provides no latency benefit: the request must traverse all stages sequentially. The pipeline fill time equals the sequential execution time.

View File

@@ -1617,7 +1617,13 @@ To synchronize 175B parameters (FP16), a Ring All-Reduce must move approximately
## Foundational Concepts {#sec-vol2-introduction-foundational-concepts}
The preceding sections established a concrete engineering problem: training a 175-billion-parameter model across thousands of accelerators requires balancing computation, communication, and coordination while tolerating routine hardware failure and satisfying regulatory obligations. No single framework captures all of these concerns simultaneously. The C$^3$ Taxonomy diagnoses *where* time is lost in a distributed training step. Scaling laws predict *how much* computation a target capability level demands. Governance constraints define *what* the fleet must never do. Reasoning about these interconnections requires organizing frameworks at different levels of analysis: the **AI Triad at Scale** reveals component interdependencies, the **Five-Pillar Framework** organizes the engineering discipline itself, and the **Fleet Stack** guides layered architectural decisions from silicon to society.
The preceding sections established a concrete engineering problem: training a 175-billion-parameter model across thousands of accelerators requires balancing computation, communication, and coordination while tolerating routine hardware failure and satisfying regulatory obligations. No single framework captures all of these concerns simultaneously.
::: {#fig-vol2-c3-taxonomy fig-env="figure" fig-pos="htb" fig-cap="**The C$^3$ Taxonomy**. A diagnostic framework for decomposing the wall-clock time of a distributed training step into its constituent components: **Computation** (forward/backward passes), **Communication** (gradient/activation exchange), and **Coordination** (synchronization barriers and collective overhead). Optimizing a distributed system requires identifying which 'C' dominates the step time." fig-alt="Venn diagram or stacked bar chart showing Computation, Communication, and Coordination as the three components of training time."}
![](images/svg/_c3-taxonomy.svg){width=100%}
:::
The C$^3$ Taxonomy diagnoses *where* time is lost in a distributed training step. Scaling laws predict *how much* computation a target capability level demands. Governance constraints define *what* the fleet must never do. Reasoning about these interconnections requires organizing frameworks at different levels of analysis: the **AI Triad at Scale** reveals component interdependencies, the **Five-Pillar Framework** organizes the engineering discipline itself, and the **Fleet Stack** guides layered architectural decisions from silicon to society.
@fig-fleet-stack organizes the complexity of this book into **The Fleet Stack**\index{Fleet Stack!definition}, a four-layer framework where engineering decisions at the bottom constrain possibilities at the top.

View File

@@ -92,7 +92,8 @@ from mlsysim.core.constants import (
H100_FLOPS_FP16_TENSOR, H100_TDP, H100_MEM_CAPACITY, H100_MEM_BW,
A100_FLOPS_FP16_TENSOR,
B200_FLOPS_FP16_TENSOR, B200_MEM_BW,
Gbps, GB, TB, second, watt, GB, TFLOPs, flop, byte, NS
Gbps, GB, TB, second, watt, GB, TFLOPs, flop, byte, NS,
OPTICS_POWER_PLUGGABLE_400G_W, OPTICS_POWER_CPO_400G_W
)
from mlsysim.fmt import fmt, sci, check, md, md_math
@@ -106,7 +107,13 @@ class FECLatency:
Consider the running example that threads through this volume: a 175-billion-parameter language model partitioned across 1,000 GPUs. Each training step requires an AllReduce of 350 GB of gradient data, meaning every GPU must send and receive its share before the next step can begin. If even one link in the fabric is slow, all 999 other GPUs wait. The network is not auxiliary infrastructure; it is the synchronization backbone that determines whether this cluster trains efficiently or wastes millions of dollars in idle compute.
In the **Fleet Stack** (@sec-vol2-introduction), network fabrics form the connective tissue binding the Infrastructure Layer into a coherent whole. @sec-compute-infrastructure established the building blocks: accelerators, power delivery, and cooling. Those components define what each node can compute in isolation. This chapter wires those nodes together, because at scale, communication cost dominates computation cost. The **Law of Distributed Efficiency** (@eq-iron-law-scale) makes this explicit: the $T_{\text{sync}} / T_{\text{compute}}$ ratio in the Scaling Factor is determined almost entirely by the network fabric. The fabric constrains every layer above it in the stack: @sec-distributed-training-systems cannot overlap communication with computation unless the fabric provides sufficient bandwidth, @sec-collective-communication cannot choose optimal algorithms without knowing the topology, and @sec-fault-tolerance-reliability must account for network partitions alongside node failures.
In the **Fleet Stack** (@sec-vol2-introduction), network fabrics form the connective tissue binding the Infrastructure Layer into a coherent whole. @sec-compute-infrastructure established the building blocks: accelerators, power delivery, and cooling. Those components define what each node can compute in isolation.
::: {#fig-network-five-level-model fig-env="figure" fig-pos="htb" fig-cap="**The Five-Level Model of ML Networking**. Network design for the ML fleet spans from physical signaling to cluster-scale orchestration. (1) Wire: signaling and link budget. (2) Transport: RDMA and lossless protocols. (3) Topology: fat-trees and rail-optimization. (4) Behavior: congestion control and routing. (5) Cluster: virtualization and isolation. Each level must be co-designed to minimize the synchronization tax." fig-alt="Stacked diagram with five levels: Wire, Transport, Topology, Behavior, Cluster. Arrows show inter-level dependencies."}
![](images/svg/five-level-model.svg){width=100%}
:::
This chapter wires those nodes together, because at scale, communication cost dominates computation cost. The **Law of Distributed Efficiency** (@eq-iron-law-scale) makes this explicit: the $T_{\text{sync}} / T_{\text{compute}}$ ratio in the Scaling Factor is determined almost entirely by the network fabric. The fabric constrains every layer above it in the stack: @sec-distributed-training-systems cannot overlap communication with computation unless the fabric provides sufficient bandwidth, @sec-collective-communication cannot choose optimal algorithms without knowing the topology, and @sec-fault-tolerance-reliability must account for network partitions alongside node failures.
The physical network fabric exists to carry three fundamental collective communication patterns. An **AllReduce**[^fn-allreduce-forward] sums gradients across thousands of GPUs so that every device holds the identical average, forming the heartbeat of synchronous training. An **AllGather**[^fn-collectives-forward] collects different model portions so that every GPU can reconstruct the full model state. An AllToAll, the most demanding pattern, requires every GPU to send unique data to every other GPU, a requirement critical to **expert parallelism**[^fn-moe-forward]. @sec-collective-communication covers the *algorithms* that orchestrate these patterns; this chapter covers the *physics* of the wires and switches that carry them. The distinction matters because the fabric's physical properties (bandwidth, latency, and topology) determine which patterns are efficient and which become bottlenecks.
@@ -401,6 +408,10 @@ In an ML fleet, distance is money. A 10,000-GPU cluster requires ~20,000 optical
:::
::: {#fig-pam4-vs-nrz fig-env="figure" fig-pos="htb" fig-cap="**Signaling Evolution: NRZ vs. PAM4**. To increase bandwidth without doubling clock frequency, modern interconnects transitioned from NRZ (2 voltage levels, 1 bit per symbol) to PAM4 (4 voltage levels, 2 bits per symbol). The trade-off is reduced noise margin: the 'eye' of the signal is smaller, necessitating Forward Error Correction (FEC) and increasing the irreducible $\alpha$ latency of the fabric." fig-alt="Signal waveform comparison. NRZ shows two levels (0, 1). PAM4 shows four levels (00, 01, 10, 11). Shaded regions indicate the reduced noise margin in PAM4."}
![](images/svg/pam4-vs-nrz.svg){width=100%}
:::
### SerDes, Link Budget, and Power {#sec-network-fabrics-serdes}
\index{SerDes}
@@ -435,6 +446,10 @@ RDMA bypasses this entire layer. By offloading the transport logic to the NIC ha
[^fn-gpudirect-rdma]: **GPUDirect RDMA**: Introduced by NVIDIA in 2013 with the Kepler architecture, GPUDirect RDMA enables the NIC to read and write GPU memory directly over PCIe without staging through host RAM. Before GPUDirect, every gradient transfer incurred two extra memory copies (GPU-to-host, host-to-NIC), adding 10--20 $\mu$s per message and consuming CPU memory bandwidth that competes with data loading. Eliminating this bounce path is what makes overlapping communication with backward-pass computation feasible at scale. \index{GPUDirect RDMA!zero-copy}
::: {#fig-gpudirect-data-path fig-env="figure" fig-pos="htb" fig-cap="**GPUDirect RDMA Data Path**. Comparison of traditional vs. GPUDirect data paths. Traditional RDMA (top) requires data to be copied to host RAM before transfer to the NIC. GPUDirect RDMA (bottom) enables the NIC to access GPU memory directly via the PCIe bus, eliminating redundant copies and reducing latency for bulk gradient transfers." fig-alt="Two-panel diagram. Top: Traditional path with GPU to CPU RAM to NIC hops. Bottom: Direct path from GPU to NIC via PCIe."}
![](images/svg/gpudirect-data-path.svg){width=100%}
:::
### InfiniBand and RoCE {#sec-network-fabrics-ib-roce}
\index{InfiniBand}\index{RoCE}
@@ -815,6 +830,10 @@ To mitigate congestion at this edge, network architects maximize the switch **ra
2. **Distinction (Durable):** Unlike a standard tree (where bandwidth at the root is a single bottleneck shared by all leaves), a fat-tree replaces each root with multiple spine switches whose combined uplink capacity matches the total edge bandwidth, eliminating the bottleneck.
3. **Common Pitfall:** A frequent misconception is that fat-trees guarantee zero network cost. They require $O(N \log N)$ switches and dense cabling: a 4,096-GPU non-blocking fat-tree needs roughly 2,048 switches, costing \$20100 million in switching hardware alone.
::: {#fig-fat-tree-detail fig-env="figure" fig-pos="htb" fig-cap="**Non-blocking Fat-Tree Topology**. A three-tier Clos network built from radix-k switches. By ensuring that the number of uplinks at each level matches the number of downlinks, the topology provides full bisection bandwidth between any two pods. The multiple parallel paths between leaf and spine enable hardware-based adaptive routing to spray packets and avoid congestion." fig-alt="Hierarchical switch diagram with three tiers: Leaf, Spine, and Core. Multiple parallel paths connect switches across tiers, illustrating full bisection bandwidth."}
![](images/svg/fat-tree-detail.svg){width=100%}
:::
:::
The fat-tree[^fn-fat-tree-clos] is the industry standard for ML clusters because it strictly guarantees full bisection bandwidth, a non-negotiable requirement for the AllReduce collective, which demands simultaneous, all-to-all communication. The network is constructed in hierarchical tiers: **Leaf** switches (ToR) connect directly to servers, **Spine** switches interconnect all leaves within a locality domain known as a **pod**, and **Core** switches bind multiple pods together.
@@ -1029,22 +1048,8 @@ The trade-offs between these topologies become stark when quantified for a large
@fig-network-topologies illustrates the structural differences between these common families.
::: {#fig-network-topologies fig-env="figure" fig-pos="htb" fig-cap="**Network Topologies for ML**. (A) Fat-Tree provides full bisection bandwidth through hierarchical switch layers. (B) Torus connects neighbors, optimizing for local communication patterns such as those in TPU pods. (C) Rail-Optimized designs dedicate switch infrastructure to corresponding accelerator positions across nodes, minimizing hop count for tensor parallelism." fig-alt="Three-panel diagram: Fat-Tree with hierarchical switches, Torus with neighbor connections, Rail-Optimized with dedicated switch rails."}
```{.tikz}
\usetikzlibrary{positioning}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.85, transform shape]
\definecolor{NodeColor}{HTML}{D1E6F3}
\definecolor{SwitchColor}{HTML}{006395}
\definecolor{RailColor}{HTML}{E67817}
\definecolor{GPUColor}{HTML}{F5D2D5}
\definecolor{RedLine}{HTML}{CB202D}
\tikzset{
switch/.style={circle, fill=SwitchColor, draw=black!70, line width=0.75pt, inner sep=2pt, minimum size=0.45cm, text=white, font=\tiny\bfseries, anchor=center},
nodebox/.style={rectangle, fill=NodeColor, draw=black!70, line width=0.75pt, inner sep=2pt, minimum size=0.4cm, anchor=center},
hostbox/.style={draw=black!50, line width=0.75pt, rounded corners=2pt, fill=black!5, inner sep=4pt}
}
![](images/svg/rail-optimized.svg){width=100%}
:::
% Fat Tree
\begin{scope}[local bounding box=fattree]
\node[anchor=south, text=RedLine] (titleA) at (2, 3.5) {\textbf{A. Leaf-Spine (Fat-Tree)}};
@@ -1255,76 +1260,8 @@ Because BSP enforces a global barrier at each superstep, the entire cluster move
The danger of PFC lies in its cascading nature. When a switch port's buffer fills, it sends a PAUSE frame upstream, which causes *that* switch's buffers to fill, which triggers *another* PAUSE frame further upstream. In theory, this backpressure should throttle the source. In practice, a single slow receiver can propagate pauses across the entire fabric in milliseconds, freezing links that have no direct relationship to the original congestion point. As @fig-congestion-cascade illustrates, this cascading behavior, known as **congestion spreading** or **victim flows**, is the primary operational risk of PFC-based lossless Ethernet.
::: {#fig-congestion-cascade fig-env="figure" fig-pos="htb" fig-cap="**PFC Pause Frame Propagation**. Congestion at a tail switch (S4) triggers a backpressure cascade, freezing the entire communication path upstream." fig-alt="Diagram showing PFC pause frame propagation from congested switch S4 upstream through the network path."}
```{=latex}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth]
\usetikzlibrary{positioning, shapes.geometric}
\definecolor{BlueLine}{HTML}{006395}
\definecolor{BlueL}{HTML}{D1E6F3}
\definecolor{RedLine}{HTML}{CB202D}
\definecolor{RedL}{HTML}{F5D2D5}
\definecolor{GreenLine}{HTML}{008F45}
\definecolor{GreenL}{HTML}{D4EFDF}
\tikzset{
switch/.style={draw=black!70, fill=gray!10, line width=0.75pt, rounded corners=2pt, minimum width=1.5cm, minimum height=1cm, font=\bfseries, align=center},
gpu/.style={draw=black!70, circle, fill=black!10, line width=0.75pt, minimum size=0.6cm, inner sep=0pt, font=\bfseries},
dataflow/.style={->, GreenLine, line width=1.2pt},
pauseflow/.style={->, RedLine, dashed, line width=1.0pt},
status/.style={font=\bfseries, RedLine, anchor=north},
time/.style={font=\footnotesize, black!70}
}
% Switches
\node[switch] (swS1) {S1};
\node[switch, right=1.5cm of swS1] (swS2) {S2};
\node[switch, right=1.5cm of swS2] (swS3) {S3};
\node[switch, right=1.5cm of swS3] (swS4) {S4};
% GPUs
\node[gpu, below=1.2cm of swS1] (gpuS1) {G};
\node[gpu, below=1.2cm of swS2] (gpuS2) {G};
\node[gpu, below=1.2cm of swS3] (gpuS3) {G};
\node[gpu, below=1.2cm of swS4] (gpuS4) {G};
% Vertical links
\draw[line width=1.0pt, black!60] (swS1) -- (gpuS1);
\draw[line width=1.0pt, black!60] (swS2) -- (gpuS2);
\draw[line width=1.0pt, black!60] (swS3) -- (gpuS3);
\draw[line width=1.0pt, black!60] (swS4) -- (gpuS4);
% Time labels
\node[time, below=0.1cm of swS1] {t=3 ms};
\node[time, below=0.1cm of swS2] {t=2 ms};
\node[time, below=0.1cm of swS3] {t=1 ms};
\node[time, below=0.1cm of swS4] {t=0};
% Data Flow
\draw[dataflow] ([xshift=-1.5cm]swS1.west) -- (swS1.west);
\draw[dataflow] (swS1.east) -- (swS2.west);
\draw[dataflow] (swS2.east) -- (swS3.west);
\draw[dataflow] (swS3.east) -- (swS4.west);
\node[GreenLine, font=\small\bfseries, above=0.2cm of swS2.east] {Data Flow};
% Pause Flow
\draw[pauseflow] (swS4.south west) to[bend left=20] node[midway, below, font=\scriptsize] {PAUSE} (swS3.south east);
\draw[pauseflow] (swS3.south west) to[bend left=20] node[midway, below, font=\scriptsize] {PAUSE} (swS2.south east);
\draw[pauseflow] (swS2.south west) to[bend left=20] node[midway, below, font=\scriptsize] {PAUSE} (swS1.south east);
% Congestion indicator
\node[star, star points=10, fill=RedL, draw=RedLine, line width=0.75pt, text=black, minimum size=0.8cm, inner sep=0pt, above=0.2cm of swS4] (bang) {\textbf{!}};
\node[RedLine, font=\bfseries, above=0.1cm of bang] {CONGESTION};
% Status labels
\node[status, below=0.2cm of gpuS1] {FROZEN};
\node[status, below=0.2cm of gpuS2] {PAUSED};
\node[status, below=0.2cm of gpuS3] {PAUSED};
\node[status, below=0.2cm of gpuS4] {SOURCE};
\end{tikzpicture}
```
::: {#fig-congestion-cascade fig-env="figure" fig-pos="htb" fig-cap="**PFC Pause Frame Propagation**. Congestion at a tail switch (S4) triggers a backpressure cascade, freezing the entire communication path upstream. This 'congestion spreading' can idle thousands of GPUs across unrelated process groups in a lossless fabric." fig-alt="Diagram showing PFC pause frame propagation from congested switch S4 upstream through the network path."}
![](images/svg/incast-flow.svg){width=100%}
:::
::: {.callout-war-story title="The PFC Storm That Froze a Cluster (2022)"}
@@ -1453,54 +1390,7 @@ In a well-designed cluster, the network fabric acts as a scheduler-aware extensi
The bandwidth staircase shown in @fig-hierarchical-staircase dictates the parallelism strategy to mitigate this cliff. **Tensor Parallelism**, requiring massive bandwidth for frequent activation exchanges, is confined to the NVLink domain within a node. **Pipeline Parallelism**, involving point-to-point transfers of activations between pipeline stages, spans the InfiniBand links between nodes. **Data Parallelism**, tolerant of lower bandwidth through gradient accumulation and overlap, stretches across the full fabric.
::: {#fig-hierarchical-staircase fig-env="figure" fig-pos="htb" fig-cap="**The Hierarchical Bandwidth Staircase**. Communication bandwidth drops by orders of magnitude as the distance from the chip increases, dictating parallelism strategies." fig-alt="Staircase or stepped diagram showing bandwidth decreasing from on-chip through NVLink, PCIe, and network levels."}
```{=latex}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\usetikzlibrary{positioning, fit}
\tikzset{
bar/.style 2 args={
draw=#1, fill=#2, line width=0.75pt,
minimum height=1cm, inner sep=0pt,
anchor=north west
},
label/.style={anchor=west, font=\bfseries},
value/.style={anchor=east, font=\bfseries},
annot/.style={anchor=west, font=\small\itshape}
}
% NVLink
\node[bar={BlueLine}{BlueL}] (nvlink) at (0, 0) [minimum width=9cm] {};
\node[label, text=BlueLine] at ([xshift=0.2cm]nvlink.west) {NVLink (Intra-Node)};
\node[value] at ([xshift=-0.2cm]nvlink.east) {900 GB/s};
\node[annot] at ([xshift=0.2cm]nvlink.east) {$\leftarrow$ Tensor Parallelism};
% PCIe
\node[bar={GreenLine}{GreenL}, below=0.2cm of nvlink.south west] (pcie) [minimum width=5cm] {};
\node[label, text=GreenLine] at ([xshift=0.2cm]pcie.west) {PCIe (Chip-to-Host)};
\node[value] at ([xshift=-0.2cm]pcie.east) {64 GB/s};
% InfiniBand
\node[bar={OrangeLine}{OrangeL}, below=0.2cm of pcie.south west] (ib) [minimum width=4cm] {};
\node[label, text=OrangeLine] at ([xshift=0.2cm]ib.west) {InfiniBand (Inter-Node)};
\node[value] at ([xshift=-0.2cm]ib.east) {50 GB/s};
\node[annot] at ([xshift=1.2cm]ib.east) {$\leftarrow$ Pipeline Parallelism};
% Ethernet
\node[bar={RedLine}{RedL}, below=0.2cm of ib.south west] (eth) [minimum width=2.5cm] {};
\node[label, text=RedLine] at ([xshift=0.2cm]eth.west) {Ethernet (Data Center)};
\node[value] at ([xshift=-0.2cm]eth.east) {25 GB/s};
\node[annot] at ([xshift=0.7cm]eth.east) {$\leftarrow$ Data Parallelism};
% Dashed line
\draw[dashed, black!50, line width=1.0pt] (nvlink.south east) -- (nvlink.south east |- eth.south);
% Distance arrow
\draw[->, line width=1.2pt, black!70] ([xshift=-0.5cm]nvlink.west) -- ([xshift=-0.5cm]eth.west)
node[midway, left, rotate=90, font=\small] {Increasing Distance / Latency};
\end{tikzpicture}
```
![](images/svg/topology-bandwidth.svg){width=100%}
:::
### Case Study: NVIDIA DGX SuperPOD {#sec-network-fabrics-dgx-superpod}
@@ -1531,7 +1421,54 @@ As cluster sizes push past 100,000 accelerators, the physics tax of current inte
The **Ultra Ethernet Consortium (UEC)** represents a clean-slate overhaul of Ethernet specifically for AI and HPC. Recognizing that TCP/IP is too heavy and RoCEv2's reliance on PFC is too fragile, UEC integrates HPC-native features directly into the standard: native **packet spraying** to use all paths without ECMP hashing collisions, hardware-enforced **in-order delivery** to simplify NIC design, and a new credit-based congestion control mechanism that eliminates the need for Priority Flow Control entirely. The goal is to provide the losslessness of InfiniBand with the ubiquity and multi-vendor economics of Ethernet.
The power consumption of pluggable transceivers is becoming unsustainable at scale. **Co-Packaged Optics (CPO)** addresses this by moving the optical engine from the faceplate directly into the switch ASIC package. By driving optical signals almost from the silicon itself, CPO eliminates the power-hungry electrical traces across the PCB, reducing power per port by 3050%. For a 10,000-GPU cluster, eliminating pluggable modules could save over 100 kW of power—energy that can be redirected to computation. CPO also removes the transceiver as a discrete field-replaceable unit, eliminating a common point of mechanical failure.
The power consumption of pluggable transceivers is becoming unsustainable at scale. **Co-Packaged Optics (CPO)** addresses this by moving the optical engine from the faceplate directly into the switch ASIC package. By driving optical signals almost from the silicon itself, CPO eliminates the power-hungry electrical traces across the PCB, reducing power per port by 3050%.
```{python}
#| label: optical-dividend-calc
#| echo: false
from mlsysim.fmt import check
class OpticalDividend:
"""Power savings from Co-Packaged Optics (CPO)."""
# ┌── 1. LOAD ──────────────────────────────────────────
n_ports = 128 # 51.2 Tbps switch
p_pluggable_w = OPTICS_POWER_PLUGGABLE_400G_W.m_as(watt)
p_cpo_w = OPTICS_POWER_CPO_400G_W.m_as(watt)
# ┌── 2. EXECUTE ───────────────────────────────────────
total_pluggable_kw = (n_ports * p_pluggable_w) / 1000
total_cpo_kw = (n_ports * p_cpo_w) / 1000
savings_kw = total_pluggable_kw - total_cpo_kw
# ┌── 3. GUARD ─────────────────────────────────────────
check(savings_kw == 1.28, f"Savings {savings_kw:.2f}kW unexpected")
# ┌── 4. OUTPUT ────────────────────────────────────────
savings_kw_str = f"{savings_kw:.2f}"
total_pluggable_kw_str = f"{total_pluggable_kw:.2f}"
@classmethod
def plot(cls):
"""Visualizes the Optical Dividend."""
from mlsysim import viz
return viz.bar_compare(
labels=["Pluggable", "CPO"],
values=[cls.total_pluggable_kw, cls.total_cpo_kw],
title="Optical Power Dividend",
ylabel="Total Optics Power (kW)"
)
```
::: {.callout-notebook title="Napkin Math: The Optical Dividend"}
**Problem**: Calculate the power savings of moving a 51.2 Tbps switch from pluggable transceivers to Co-Packaged Optics (CPO).
1. **Pluggable Architecture**: 128 ports $\times$ `{python} int(OpticalDividend.p_pluggable_w)` W = **`{python} OpticalDividend.total_pluggable_kw_str` kW** for optics alone.
2. **CPO Architecture**: 128 engines $\times$ `{python} int(OpticalDividend.p_cpo_w)` W = **1.28 kW**.
3. **The Dividend**: You save **`{python} OpticalDividend.savings_kw_str` kW** of power *per switch*.
**The Systems Insight**: In a cluster with 1,000 switches, pluggable optics consume **1.28 Megawatts** just to move light. CPO halves this "Network Tax," redirecting enough power to fuel an additional **1,800 GPUs**. At the Bisection Bandwidth Wall, sustainability is not a choice; it is an architectural requirement driven by the thermal limits of the faceplate.
:::
For a 10,000-GPU cluster, eliminating pluggable modules could save over 100 kW of power—energy that can be redirected to computation. CPO also removes the transceiver as a discrete field-replaceable unit, eliminating a common point of mechanical failure.
The bandwidth march continues to **800G and 1.6T links** (XDR InfiniBand and 800GbE/1.6TbE). A single 1.6 Tbps port delivers the bandwidth of four 400G lanes, allowing next-generation switches to provide 51.2 Tbps of aggregate throughput. The increased density allows architects to flatten the topology: a cluster that previously required three tiers of switches can now be serviceable with two, halving the transceiver count and reducing tail latency by removing an entire hop of switching and FEC overhead.

View File

@@ -801,48 +801,8 @@ Organizations progress through distinct maturity levels as their ML operations c
: **MLOps Maturity Hierarchy**: Four levels of operational capability from Level 0 (manual ad-hoc processes supporting 1-2 models) through Level 3 (enterprise governance with organization-wide automation supporting 500+ models). Most organizations operate at Level 1 with per-model automation; advancing to Levels 2-3 provides superlinear returns on infrastructure investment. {#tbl-ops-scale-maturity}
::: {.callout-note title="Figure: MLOps Maturity Staircase"}
![](images/svg/mlops-maturity.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[
font=\small\usefont{T1}{phv}{m}{n},
node distance=0mm,
step box/.style={
draw,
fill=BlueL,
line width=0.75pt,
align=center,
minimum width=2cm,
minimum height=1cm,
anchor=south west
},
annotation/.style={
font=\scriptsize,
text=black!60,
anchor=north
},
>={Stealth[round]},
thick
]
\definecolor{BlueL}{HTML}{D1E6F3}
% Axes
\draw[->] (0,0) -- (8,0) node[right] {Maturity};
\draw[->] (0,0) -- (0,5) node[above] {Scale / Automation};
% Steps
\node[step box] (l0) at (0,0) {Level 0\\Manual};
\node[step box, minimum height=2cm, right=of l0] (l1) {Level 1\\Per-Model CI/CD};
\node[step box, minimum height=3cm, right=of l1] (l2) {Level 2\\Platform Ops};
\node[step box, minimum height=4cm, right=of l2] (l3) {Level 3\\Enterprise Gov};
% Annotations
\node[annotation, below=2mm of l0] {1-2 Models};
\node[annotation, below=2mm of l1] {10-20 Models};
\node[annotation, below=2mm of l2] {50-200 Models};
\node[annotation, below=2mm of l3] {500+ Models};
\end{tikzpicture}
```
**MLOps Maturity Levels**. Visualizing the progression from manual processes (Level 0) to automated pipelines (Level 1), platform orchestration (Level 2), and enterprise governance (Level 3). Each level reduces operational friction and increases the scale of manageable models.
:::
@@ -1179,47 +1139,8 @@ A complete training pipeline includes data validation, training execution, model
::: {.callout-note title="Figure: ML CI/CD Pipeline"}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=0.5cm and 0.5cm]
\definecolor{StageColor}{HTML}{F5F5F5}
\definecolor{GateColor}{HTML}{FCE4CC}
\definecolor{RedLine}{HTML}{FF0000}
\tikzset{
stage/.style={draw=black!70, line width=0.75pt, rounded corners=2pt, align=center, minimum width=2.2cm, minimum height=1cm, fill=StageColor},
gate/.style={draw=RedLine!70, line width=0.75pt, diamond, aspect=1.5, fill=GateColor, font=\tiny, align=center},
edge/.style={->, line width=1.0pt},
feedback/.style={edge, dashed, RedLine}
}
% Stages and Gates
\node[stage] (Data) {Data Val};
\node[gate, right=of Data] (Gate1) {Schema\\Check};
\node[stage, right=of Gate1] (Train) {Training};
\node[gate, right=of Train] (Gate2) {Loss\\Check};
\node[stage, right=of Gate2] (Eval) {Evaluation};
\node[gate, right=of Eval] (Gate3) {Metric\\Check};
\node[stage, right=of Gate3] (Reg) {Registry};
\node[gate, right=of Reg] (Gate4) {Approval};
\node[stage, right=of Gate4] (Deploy) {Canary};
% Edges
\draw[edge] (Data) -- (Gate1);
\draw[edge] (Gate1) -- (Train);
\draw[edge] (Train) -- (Gate2);
\draw[edge] (Gate2) -- (Eval);
\draw[edge] (Eval) -- (Gate3);
\draw[edge] (Gate3) -- (Reg);
\draw[edge] (Reg) -- (Gate4);
\draw[edge] (Gate4) -- (Deploy);
% Feedback
\draw[feedback] (Gate3) to[bend left=30] node[midway, above, font=\scriptsize] {Fail} (Train);
\end{tikzpicture}
```
**ML CI/CD Pipeline**. The automated workflow transforming code and data into a deployed service. Stages include Data Validation (schema/drift checks), Training, Evaluation (metric gates), Artifact Registration, and Staged Deployment (canary rollout). Feedback loops automatically trigger retrains or alerts if gates fail.
::: {#fig-ml-cicd fig-env="figure" fig-pos="htb" fig-cap="**ML CI/CD Pipeline**. The automated workflow transforming code and data into a deployed service. Stages include Data Validation (schema/drift checks), Training, Evaluation (metric gates), Artifact Registration, and Staged Deployment (canary rollout). Feedback loops automatically trigger retrains or alerts if gates fail." fig-alt="Pipeline diagram showing stages from data validation to deployment with gates and feedback loops."}
![](images/svg/ml-cicd.svg){width=100%}
:::
1. **Data Validation**: Verify input data meets schema requirements and statistical expectations
@@ -1627,44 +1548,8 @@ Organizations deploying safety-critical models typically implement coordinator r
Shadow deployment runs the new model in parallel with production, receiving the same inputs and logging outputs, but not affecting user-visible results. This provides the highest fidelity testing environment short of actual production exposure, enabling detection of issues that escape offline validation.
::: {.callout-note title="Figure: Shadow Deployment Architecture"}
![](images/svg/shadow-deployment.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=1.5cm and 2cm]
\definecolor{ProdColor}{HTML}{D1E6F3}
\definecolor{ShadowColor}{HTML}{EBEBEB}
\tikzset{
router/.style={draw, circle, fill=black!5, line width=0.75pt},
model/.style={draw, minimum width=2.5cm, minimum height=1cm, line width=0.75pt, align=center},
edge/.style={->, line width=1.0pt},
dashed edge/.style={->, dashed, line width=1.0pt},
dotted edge/.style={->, dotted, line width=1.0pt}
}
% Router and Request
\node[router] (Router) {Router};
\node[left=of Router] (Request) {Request};
\draw[edge] (Request) -- (Router);
% Production Path
\node[model, fill=ProdColor, above=of Router] (Prod) {Production Model};
\draw[edge] (Router) |- (Prod);
\node[right=of Prod] (ProdResp) {Response (User)};
\draw[edge] (Prod) -- (ProdResp);
% Shadow Path
\node[model, fill=ShadowColor, below=of Router] (Shadow) {Shadow Model};
\draw[dashed edge] (Router) |- (Shadow);
\node[right=of Shadow] (ShadowResp) {Log (No User)};
\draw[dashed edge] (Shadow) -- (ShadowResp);
% Async Comparison
\node[draw, dashed, inner sep=5pt, line width=0.75pt, right=of Router] (Compare) {Compare};
\draw[dotted edge] (Prod) -- (Compare);
\draw[dotted edge] (Shadow) -- (Compare);
\end{tikzpicture}
```
**Shadow Deployment Architecture**. Production traffic is mirrored to the shadow model asynchronously. The router returns the production response to the user immediately, while both responses are logged for offline quality comparison and operational validation.
:::
@@ -2287,42 +2172,8 @@ Even if 99% of these are deduplicated or auto-resolved, the remaining 144 alerts
The alert fatigue problem demands a fundamentally different approach. The solution is hierarchical monitoring that presents different levels of detail to different audiences and aggregates signals to reduce alert volume while maintaining detection capability.
::: {.callout-note title="Figure: Hierarchical Monitoring Pyramid"}
![](images/svg/monitoring-pyramid.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=0.5cm and 0.5cm]
\definecolor{L1Color}{HTML}{D1E6F3}
\definecolor{L2Color}{HTML}{D4EFDF}
\definecolor{L3Color}{HTML}{FCE4CC}
\definecolor{L4Color}{HTML}{E6D4E5}
\definecolor{RedLine}{HTML}{FF0000}
\definecolor{OrangeLine}{HTML}{FFA500}
\tikzset{
pyramid/.style={draw, line width=0.75pt, minimum width=4cm, minimum height=1cm},
label/.style={align=center},
annotation/.style={anchor=west, font=\scriptsize}
}
% Pyramid Layers
\node[pyramid, fill=L4Color, anchor=south west] (Base) {};
\node[pyramid, fill=L3Color, anchor=south west, below=of Base] (Layer3) {};
\node[pyramid, fill=L2Color, anchor=south west, below=of Layer3] (Layer2) {};
\node[pyramid, fill=L1Color, anchor=south west, below=of Layer2] (Layer1) {};
% Labels
\node[label, anchor=center] at (Base) {\textbf{Business Metrics}\\(Revenue, Engagement)};
\node[label, anchor=center] at (Layer3) {\textbf{Portfolio Metrics}\\(Search, Ads, RecSys)};
\node[label, anchor=center] at (Layer2) {\textbf{Model Metrics}\\(Latency, Accuracy, Drift)};
\node[label, anchor=center] at (Layer1) {\textbf{Infrastructure Metrics}\\(GPU Util, Network)};
% Annotations
\node[annotation, text=RedLine!80, right=of Base] {Alert Execs};
\node[annotation, text=OrangeLine, right=of Layer3] {Alert Product Owners};
\node[annotation, text=black!60, right=of Layer2] {Alert Model Owners};
\node[annotation, text=black!60, right=of Layer1] {Alert Platform Team};
\end{tikzpicture}
```
**Hierarchical Monitoring Architecture**. To prevent alert fatigue, monitoring operates at four abstraction levels. High-level business metrics trigger alarms for broad issues, while lower-level metrics are used primarily for investigation and root cause analysis.
:::
@@ -3515,6 +3366,10 @@ During training, features are computed in batch over historical data. There is n
### Feature Store Architecture {#sec-ml-operations-scale-feature-store-architecture-51da}
::: {#fig-feature-store-architecture fig-env="figure" fig-pos="htb" fig-cap="**Feature Store Architecture**. Resolving the conflict between training (high-throughput batch scans) and serving (low-latency point lookups) through a dual-store architecture. Features are materialized from batch and streaming sources into an Offline Store (for training) and an Online Store (for serving), ensuring consistency across the ML lifecycle." fig-alt="System diagram with Data Sources feeding into a Materialization layer. This layer writes to an Offline Store (Data Lake) and an Online Store (Key-Value). Training reads from Offline; Serving reads from Online."}
![](images/svg/feature-store-architecture.svg){width=100%}
:::
A feature store is not merely a database; it is an architectural pattern designed to resolve the fundamental conflict between the data access patterns of model training and real-time serving. Training requires high-throughput analytical scans over massive historical datasets, while serving requires low-latency point lookups for individual prediction requests. No single database system can efficiently satisfy both constraints, forcing the adoption of a **dual-store architecture** composed of an offline store and an online store.
The **offline store** is the system of record for all historical feature data, often holding petabytes of information. It is optimized for the massive sequential reads characteristic of training data generation, where a single query might scan terabytes of data to build a feature set for millions of examples. The key metric is throughput, not latency. Systems like BigQuery, Snowflake, or data lakes built on S3 with formats like Apache Iceberg are common choices, designed to parallelize these large-scale analytical queries.

View File

@@ -270,86 +270,7 @@ $$ {#eq-ridge-point}
Workloads with $I < I_{\text{ridge}}$ are memory-bound: their performance is limited by how fast data can be loaded, not how fast it can be processed. Workloads with $I > I_{\text{ridge}}$ are compute-bound: the arithmetic units are the bottleneck. @fig-roofline-model illustrates this relationship graphically.
::: {#fig-roofline-model fig-env="figure" fig-pos="htb" fig-cap="**The Roofline Model**. Achievable performance (y-axis) as a function of arithmetic intensity (x-axis) on a log-log plot. The sloped line represents the memory bandwidth ceiling; the flat line represents the compute ceiling. Their intersection is the ridge point. Most transformer inference operations fall in the memory-bound region (left of the ridge point), while large batched GEMMs fall in the compute-bound region (right)." fig-alt="Log-log plot showing roofline model with memory bandwidth ceiling as diagonal line and compute ceiling as horizontal line, meeting at ridge point. Workload types are marked: LLM decode and element-wise ops on the left (memory-bound), large GEMM on the right (compute-bound)."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth, scale=1.0]
\usetikzlibrary{positioning}
\definecolor{BlueLine}{HTML}{006395}
\definecolor{BlueL}{HTML}{D1E6F3}
\definecolor{GreenLine}{HTML}{008F45}
\definecolor{GreenL}{HTML}{D4EFDF}
\definecolor{OrangeLine}{HTML}{E67817}
\definecolor{OrangeL}{HTML}{FCE4CC}
\definecolor{RedLine}{HTML}{CB202D}
\definecolor{RedL}{HTML}{F5D2D5}
\definecolor{VioletLine}{HTML}{7E317B}
\definecolor{VioletL}{HTML}{E6D4E5}
\tikzset{
Box/.style={draw=BlueLine, fill=BlueL, thick, rounded corners=2pt, minimum width=2cm, minimum height=0.6cm, align=center, font=\scriptsize\bfseries}
}
% Axes
\draw[thick, ->] (0,0) -- (10,0) node[below] {Arithmetic Intensity (FLOP/byte)};
\draw[thick, ->] (0,0) -- (0,7) node[above, rotate=90, anchor=south] {Achievable TFLOPS};
% Axis labels (log scale markers)
\node[below] at (1,0) {\small 1};
\node[below] at (3,0) {\small 10};
\node[below] at (5,0) {\small 100};
\node[below] at (7,0) {\small 1000};
\node[left] at (0,1) {\small 1};
\node[left] at (0,3) {\small 100};
\node[left] at (0,5) {\small 989};
\node[left] at (0,6) {\small 1979};
% Memory bandwidth ceiling (slope = bandwidth)
\draw[BlueLine, very thick] (0.5,0.5) -- (5.5,5.5);
% Compute ceiling FP16
\draw[RedLine, very thick] (5.5,5.5) -- (9.5,5.5);
% Compute ceiling FP8 (higher)
\draw[RedLine!50, thick, dashed] (4.8,6.2) -- (9.5,6.2);
\draw[BlueLine!50, thick, dashed] (0.5,0.9) -- (4.8,6.2);
% Ridge point
\node[circle, fill=black, inner sep=1.5pt] (ridge) at (5.5,5.5) {};
\node[above right=0.1cm and 0.1cm of ridge] {\small Ridge Point};
\node[below right=0.1cm and 0.1cm of ridge, font=\scriptsize] {$\sim$295 FLOP/byte};
% FP8 ridge point
\node[circle, fill=black!50, inner sep=1pt] (fp8ridge) at (4.8,6.2) {};
\node[above=0.1cm of fp8ridge, font=\scriptsize] {FP8 Ridge};
% Regions
\node[BlueLine, font=\small, rotate=0] at (2.5,1.5) {Memory-Bound};
\node[RedLine, font=\small] at (7.5,4.8) {Compute-Bound};
% Workload markers
\node[circle, fill=OrangeLine, inner sep=1.5pt] (llm) at (1.2,1.2) {};
\node[right=0.1cm of llm, font=\scriptsize, OrangeLine] {LLM Decode (B=1)};
\node[circle, fill=OrangeLine, inner sep=1.5pt] (elem) at (1.8,1.8) {};
\node[right=0.1cm of elem, font=\scriptsize, OrangeLine] {Element-wise};
\node[circle, fill=GreenLine, inner sep=1.5pt] (gemm) at (7.5,5.5) {};
\node[below=0.1cm of gemm, font=\scriptsize, GreenLine] {Large GEMM};
\node[circle, fill=VioletLine, inner sep=1.5pt] (attn) at (3.8,3.8) {};
\node[right=0.1cm of attn, font=\scriptsize, VioletLine] {Attention};
% Legend
\draw[RedLine, very thick] (0.5,6.8) -- (1.2,6.8);
\node[right] at (1.3,6.8) {\scriptsize FP16 Ceiling (989 TFLOPS)};
\draw[RedLine!50, thick, dashed] (0.5,6.4) -- (1.2,6.4);
\node[right] at (1.3,6.4) {\scriptsize FP8 Ceiling (1979 TFLOPS)};
\draw[BlueLine, very thick] (0.5,6.0) -- (1.2,6.0);
\node[right] at (1.3,6.0) {\scriptsize HBM BW (3.35 TB/s)};
\end{tikzpicture}
```
![](images/svg/_roofline-model.svg){width=100%}
:::
The ridge point of the NVIDIA H100 at FP16 precision is:
@@ -2084,51 +2005,7 @@ Recognizing these pitfalls saves teams from wasting months optimizing the wrong
Performance engineering transforms a model that should be efficient into one that is, by attacking the fundamental bottleneck of modern ML systems: the memory wall. @fig-optimization-hierarchy summarizes how the techniques in this chapter layer from hardware-level primitives to algorithmic innovations.
::: {#fig-optimization-hierarchy fig-env="figure" fig-pos="htb" fig-cap="**The Performance Engineering Hierarchy**. Optimization techniques organized by their level of abstraction, from hardware-level precision engineering at the base to algorithmic innovations at the top. Each layer builds on and benefits from the layers below it. The annotations show the primary mechanism and typical speedup range for each technique." fig-alt="Layered hierarchy diagram showing five optimization levels from bottom to top: Precision Engineering, Operator Fusion, Graph Compilation, Communication Overlap, and Algorithmic Innovation, with arrows showing how they interact."}
```{.tikz}
\begin{tikzpicture}[
font=\small\usefont{T1}{phv}{m}{n},
layer/.style={draw, thick, rounded corners=3pt, minimum width=11cm, minimum height=1.2cm, font=\small},
label/.style={font=\scriptsize, text width=4cm, align=left},
>=stealth
]
\definecolor{BlueLine}{HTML}{006395}
\definecolor{BlueL}{HTML}{D1E6F3}
\definecolor{GreenLine}{HTML}{008F45}
\definecolor{GreenL}{HTML}{D4EFDF}
\definecolor{OrangeLine}{HTML}{E67817}
\definecolor{OrangeL}{HTML}{FCE4CC}
\definecolor{RedLine}{HTML}{CB202D}
\definecolor{RedL}{HTML}{F5D2D5}
\definecolor{VioletLine}{HTML}{7E317B}
\definecolor{VioletL}{HTML}{E6D4E5}
% Layers from bottom to top
\node[layer, draw=BlueLine, fill=BlueL] (hw) at (0,0) {\textbf{Precision Engineering} (FP8, INT4, KV Cache Compression)};
\node[layer, draw=GreenLine, fill=GreenL] (fuse) at (0,1.6) {\textbf{Operator Fusion \& Tiling} (FlashAttention, Fused Kernels)};
\node[layer, draw=OrangeLine, fill=OrangeL] (comp) at (0,3.2) {\textbf{Graph Compilation} (torch.compile, XLA, TensorRT)};
\node[layer, draw=VioletLine, fill=VioletL] (comm) at (0,4.8) {\textbf{Communication Overlap} (Gradient Pipelining, Zero-Bubble)};
\node[layer, draw=RedLine, fill=RedL] (algo) at (0,6.4) {\textbf{Algorithmic Innovation} (Speculative Decoding, MoE)};
% Right-side annotations
\node[label, right] at (6.2,0) {Mechanism: Reduce bytes/value\\Speedup: 2--4$\times$};
\node[label, right] at (6.2,1.6) {Mechanism: Reduce HBM trips\\Speedup: 2--32$\times$};
\node[label, right] at (6.2,3.2) {Mechanism: Automate fusion\\Speedup: 1.1--2$\times$};
\node[label, right] at (6.2,4.8) {Mechanism: Hide latency\\Speedup: 1.1--1.5$\times$};
\node[label, right] at (6.2,6.4) {Mechanism: Change algorithm\\Speedup: 1.5--10$\times$};
% Arrows between layers
\draw[->, thick, black!50] (hw.north) -- (fuse.south);
\draw[->, thick, black!50] (fuse.north) -- (comp.south);
\draw[->, thick, black!50] (comp.north) -- (comm.south);
\draw[->, thick, black!50] (comm.north) -- (algo.south);
% Left-side label
\node[rotate=90, font=\small, anchor=south] at (-6.5,3.2) {Increasing Abstraction $\longrightarrow$};
\end{tikzpicture}
```
![](images/svg/profiling-hierarchy.svg){width=100%}
:::
The **Roofline Model** provides the diagnostic framework, classifying operations as compute-bound or memory-bound based on their arithmetic intensity relative to the hardware's ridge point. For the NVIDIA H100, this ridge point is approximately `{python} RooflineRidgeCalc.h100_fp16_ridge_str` FLOP/byte at FP16, meaning most transformer operations fall in the memory-bound regime.

View File

@@ -405,33 +405,8 @@ Goal 1 (Demographic Parity) would be to admit students so that the admitted clas
The impossibility theorem demonstrates that both goals cannot always be satisfied simultaneously. If one group has a higher proportion of qualified applicants, achieving demographic parity (Goal 1) requires rejecting some of their qualified applicants, violating equal opportunity (Goal 2). No mathematical fix exists; the choice is a value judgment about which definition of fairness to prioritize. Satisfying one criterion may preclude satisfying another, reflecting the reality that fairness involves tradeoffs between competing normative goals. Determining which metric to prioritize requires careful consideration of the application context, potential harms, and stakeholder values as detailed in @sec-responsible-ai-normative-pluralism-value-conflicts-d61f [@barocas-hardt-narayanan].
::: {.callout-note title="Figure: Fairness Impossibility"}
![](images/svg/fairness-impossibility.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth]
\tikzset{
NodeParity/.style={draw=BlueLine, line width=0.75pt, circle, fill=BlueL, minimum size=2.5cm, align=center},
NodeOdds/.style={draw=OrangeLine, line width=0.75pt, circle, fill=OrangeL, minimum size=2.5cm, align=center},
NodeCalib/.style={draw=GreenLine, line width=0.75pt, circle, fill=GreenL, minimum size=2.5cm, align=center},
ConnectingLine/.style={black!30, line width=1.0pt},
IncompatibleText/.style={font=\bfseries, text=RedLine}
}
% Nodes at vertices
\node[NodeParity] (parity) {\textbf{Demographic}\\\textbf{Parity}\\$P(\hat{Y}|S)$};
\node[NodeOdds, right=3.5cm of parity] (odds) {\textbf{Equalized}\\\textbf{Odds}\\$P(\hat{Y}|S, Y)$};
\path (parity) -- (odds) coordinate[midway] (mid);
\node[NodeCalib, above=4cm of mid] (calib) {\textbf{Predictive}\\\textbf{Parity}\\$P(Y|\hat{Y}, S)$};
% The Triangle Lines
\draw[ConnectingLine] (parity) -- (odds) node[midway, below=0.1cm, IncompatibleText] {Incompatible};
\draw[ConnectingLine] (parity) -- (calib) node[midway, sloped, above=0.1cm, IncompatibleText] {Incompatible};
\draw[ConnectingLine] (odds) -- (calib) node[midway, sloped, above=0.1cm, IncompatibleText] {Incompatible};
\node[anchor=north, font=\scriptsize, text=black!70, text width=8cm, align=center, below=1cm of mid] {When base rates differ between groups ($P(Y=1|S=a) \neq P(Y=1|S=b)$), it is mathematically impossible to satisfy all three criteria simultaneously.};
\end{tikzpicture}
```
**Fairness Impossibility Theorem**. Visualizing the mathematical conflict between fairness criteria. A single classifier cannot simultaneously satisfy Demographic Parity (equal outcomes), Equalized Odds (equal error rates), and Calibration (equal predictive meaning) unless the groups have identical base rates. This forces engineers to make explicit normative choices based on the application context.
:::
@@ -784,49 +759,7 @@ Addressing these challenges requires understanding privacy as a system principle
@fig-privacy-risk-flow outlines key privacy checkpoints in the early stages of a data pipeline, highlighting where core safeguards such as consent acquisition, encryption, and differential privacy should be applied. Actual implementations often involve more nuanced tradeoffs and context-sensitive decisions, but this diagram provides a scaffold for identifying where privacy risks arise and how they can be mitigated through responsible design choices.
::: {#fig-privacy-risk-flow fig-env="figure" fig-pos="htb" fig-cap="**Privacy-Aware Data Flow**: Responsible data governance requires proactive safeguards throughout a machine learning pipeline, including consent acquisition, encryption, and differential privacy mechanisms applied at key decision points to mitigate privacy risks and ensure accountability. This diagram structures these considerations, enabling designers to identify potential vulnerabilities and implement appropriate controls during data collection, processing, and storage." fig-alt="Flowchart with 4 diamond decision points: PII check, consent acquired, log encryption, and differential privacy. Yes paths flow downward to data eligible for training. No paths branch to action boxes for requesting consent, encrypting, or adding DP."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{%
Line/.style={black!30,-{Triangle[width = 5pt, length = 6pt]}, line width = 1.25pt,text=black},
Box/.style={inner xsep=2pt,inner ysep=6pt,
node distance=0.7,
draw=BlueLine,
line width=0.75pt,
fill=BlueL!40,
align=flush center,
text width=35mm,
minimum width=35mm, minimum height=10mm
},
Box2/.style={Box,draw=RedLine,fill=RedL,rounded corners=12pt},
Box3/.style={draw=VioletLine,fill=VioletL2, trapezium,aspect=2,inner xsep=-1ex,
inner ysep=-1ex,text width=30mm,
diamond, minimum width=45mm, align= flush center},
}
\node[Box2](B1){Data Collected};
\node[Box3,below=0.7of B1](B2){Does it include\\ PII?};
\node[Box,right=2 of B2](B22){Proceed with preprocessing};
\node[Box3,below=0.7of B2](B3){Was user\\ consent acquired?};
\node[Box,left=2 of B3](B33){Reject data or request consent};
\node[Box3,below=0.7of B3](B4){Is log access encrypted?};
\node[Box,right=2 of B4](B44){Encrypt or secure logging infrastructure};
\node[Box3,below=0.7of B4](B5){Is DP or LDP implemented?};
\node[Box,below=of B5](B6){Data eligible for model training};
\node[Box,left=2 of B5](B55){Add privacy protections (e.g., DP-SGD, LDP)};
%
\draw[Line](B1)--(B2);
\draw[Line](B2)--node[right]{Yes}(B3);
\draw[Line](B3)--node[right]{Yes}(B4);
\draw[Line](B4)--node[right]{Yes}(B5);
\draw[Line](B5)--node[right]{Yes}(B6);
\draw[Line](B2)--node[above,pos=0.2]{No}(B22);
\draw[Line](B3)--node[above,pos=0.2]{No}(B33);
\draw[Line](B4)--node[above,pos=0.2]{No}(B44);
\draw[Line](B5)--node[above,pos=0.2]{No}(B55);
\draw[Line](B44)|-node[right,pos=0.1]{}(B5);
\end{tikzpicture}
```
![](images/svg/_privacy-flow.svg){width=100%}
:::
The consequences of weak data governance are well documented. Systems trained on poorly understood or biased datasets may perpetuate structural inequities or expose sensitive attributes unintentionally. In the COMPAS example introduced earlier, the lack of transparency surrounding data provenance and usage precluded effective evaluation or redress. In clinical applications, datasets frequently reflect artifacts such as missing values or demographic skew that compromise both performance and privacy. Without clear standards for data quality and documentation, such vulnerabilities become systemic.
@@ -1855,84 +1788,7 @@ Privacy preservation does not end at training time. In many real-world systems,
Traditional approaches to data deletion assume that the full training dataset remains accessible and that models can be retrained from scratch after removing the targeted records. @fig-machine-unlearning contrasts traditional model retraining with emerging machine unlearning approaches: while retraining involves reconstructing the model from scratch using a modified dataset, unlearning aims to remove a specific datapoint's influence without repeating the entire learning process.
::: {#fig-machine-unlearning fig-env="figure" fig-pos="htb" fig-cap="**Model Update Strategies**: Retraining reconstructs a model from scratch, while machine unlearning modifies an existing model to remove the influence of specific data points without complete reconstruction, an important distinction for resource-constrained deployments. This approach minimizes computational cost and allows privacy-preserving data deletion after initial model training." fig-alt="Two-panel comparison. Left panel labeled Retraining shows dataset cylinders with data removal flowing to model. Right panel labeled Machine Unlearning shows same flow but asks whether data influence can be removed without full retraining cost."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{%
mycylinder/.style={cylinder, shape border rotate=90, aspect=1.3, draw, fill=white,
minimum width=25mm,minimum height=11mm,line width=\Linewidth,node distance=-0.15},
Box/.style={align=flush center,
inner xsep=2pt,
node distance=3,
draw=BlueLine,
line width=0.75pt,
fill=BlueL,
minimum width=80mm, minimum height=8mm
},
LineA/.style={line width=1.75pt,black!50,-latex,text=black},
LineB/.style={line width=6pt,-{Triangle[width=\the\dimexpr1.8\pgflinewidth,length=\the\dimexpr0.8\pgflinewidth]}},
LineD/.style={VioletLine!60, -{Triangle[width = 8pt, length = 6pt]},
line width = 4pt,shorten >=2mm,shorten <=2mm,text=black},
pics/data/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box=STREAMING,scale=\scalefac, every node/.append style={transform shape}]
\node[mycylinder,fill=\channelcolor!50] (A\picname) {};
\node[mycylinder, above=of A\picname,fill=\channelcolor!30] (B\picname) {};
\node[mycylinder, above=of B\picname,fill=\channelcolor!10] (C\picname) {};
\end{scope}
}
}
}
\pgfkeys{
/channel/.cd,
channelcolor/.store in=\channelcolor,
drawchannelcolor/.store in=\drawchannelcolor,
scalefac/.store in=\scalefac,
Linewidth/.store in=\Linewidth,
picname/.store in=\picname,
channelcolor=BrownLine,
drawchannelcolor=BrownLine,
scalefac=1,
Linewidth=1.6pt,
picname=C
}
\begin{scope}[local bounding box=LEFT,shift={($(0,0)+(0,0)$)},
scale=1, every node/.append style={transform shape}]
\pic[shift={(0,0)}] at (0,0){data={scalefac=0.7,picname=1,channelcolor=RedLine, Linewidth=1.0pt}};
\pic[shift={(0,0)}] at (6,0){data={scalefac=0.7,picname=2,channelcolor=GreenD, Linewidth=1.0pt}};
\draw[LineD](B1.east)--node[above](RE){Removing}(B2.west);
\node[below=4pt of A1.south](DA){Dataset};
\node[below=4pt of A2.south](ND1){New dataset};
\node[Box,below=of RE](MO){Model};
\draw[LineA](DA)--node[right]{Machine Learning}(DA|-MO.north);
\draw[LineA](ND1)--node[right]{Retraining}(ND1|-MO.north);
\draw[LineB,green!60!black]([yshift=-1mm]MO.352)--++(0,-0.81)
node[below,align=center,black]{Computational power and\\ time consumption};
\end{scope}
\begin{scope}[local bounding box=RIGHT,shift={($(0,0)+(11,0)$)},
scale=1, every node/.append style={transform shape}]
\pic[shift={(0,0)}] at (0,0){data={scalefac=0.7,picname=1,channelcolor=RedLine, Linewidth=1.0pt}};
\pic[shift={(0,0)}] at (6,0){data={scalefac=0.7,picname=2,channelcolor=OrangeLine, Linewidth=1.0pt}};
\draw[LineD](B1.east)--node[above](RE){Removing}(B2.west);
\node[below=4pt of A1.south](DA){Dataset};
\node[below=4pt of A2.south](ND1){New dataset};
\node[Box,below=of RE](MO){Model};
\draw[LineA](DA)--node[right]{Machine Learning}(DA|-MO.north);
\draw[LineA](ND1)--node[right,align=left]{Machine\\ unlearning}(ND1|-MO.north);
\draw[LineB,orange!80!black]([yshift=-1mm]MO.352)--++(0,-0.81)
node[below,align=center,black]{Can we remove all influence of\\
someone's data when they ask to \\ delete it, but avoid the full cost of\\ retraining from scratch?};
\end{scope}
\node[below=3mm of RIGHT.south](ML){\textbf{b) Machine unlearning}};
\path[red](ML)-|coordinate(S)(LEFT.south);
\node[]at(S){\textbf{a) Machine retraining}};
\end{tikzpicture}
```
![](images/svg/_unlearning-strategies.svg){width=100%}
:::
The distinction between retraining and unlearning becomes critical in systems with tight latency, compute, or privacy constraints, because the assumptions underlying full retraining rarely hold in practice. Many deployed machine learning systems do not retain raw training data due to security, compliance, or cost constraints. In such environments, full retraining is often impractical and operationally disruptive, especially when data deletion must be verifiable, repeatable, and audit-ready.
@@ -2172,7 +2028,13 @@ The implications of such drift extend beyond raw accuracy. Fairness guarantees m
To ensure responsible behavior over time, machine learning systems must incorporate mechanisms for continual monitoring, evaluation, and corrective action. Monitoring involves more than tracking aggregate accuracy, it requires surfacing performance metrics across relevant subgroups, detecting shifts in input distributions, identifying anomalous outputs, and capturing meaningful user feedback. These signals must then be compared to predefined expectations around fairness, robustness, and transparency, and linked to actionable system responses such as model retraining, recalibration, or rollback.
Implementing effective monitoring depends on robust infrastructure. Systems must log inputs, outputs, and contextual metadata in a structured and secure manner. This requires telemetry pipelines that capture model versioning, input characteristics, prediction confidence, and post-inference feedback. These logs support drift detection and provide evidence for retrospective audits of fairness and robustness. Monitoring systems must also be integrated with alerting, update scheduling, and policy review processes to support timely and traceable intervention.
Implementing effective monitoring depends on robust infrastructure. Systems must log inputs, outputs, and contextual metadata in a structured and secure manner.
::: {#fig-monitoring-pipeline fig-env="figure" fig-pos="htb" fig-cap="**Responsible AI Monitoring Pipeline**. End-to-end observability for deployed models. Inferences are sampled and logged; metrics are computed across demographic subgroups; drift detectors identify performance or fairness regressions; and alerts trigger automated retraining or manual review. This continuous feedback loop ensures that responsible AI properties are maintained post-deployment." fig-alt="Flowchart showing Inference logs feeding into a Metric Computation engine. Metrics are split by demographic groups. Output flows to a Drift Detector which connects to an Alert System and a Retraining Trigger."}
![](images/svg/monitoring-pipeline.svg){width=100%}
:::
This requires telemetry pipelines that capture model versioning, input characteristics, prediction confidence, and post-inference feedback. These logs support drift detection and provide evidence for retrospective audits of fairness and robustness. Monitoring systems must also be integrated with alerting, update scheduling, and policy review processes to support timely and traceable intervention.
Monitoring also supports feedback-driven improvement. For example, repeated user disagreement, correction requests, or operator overrides can signal problematic behavior. This feedback must be aggregated, validated, and translated into updates to training datasets, data labeling processes, or model architecture. However, such feedback loops carry risks: biased user responses can introduce new inequities, and excessive logging can compromise privacy. Designing these loops requires careful coordination between user experience design, system security, and ethical governance.
@@ -2200,6 +2062,10 @@ Monitoring mechanisms provide the operational observability required to sustain
The transition from discriminative classification to generative large language models (LLMs) fundamentally alters the engineering surface of responsibility. Fairness is no longer merely a statistical parity metric between labeled groups; it evolves into **Generative Alignment**, the complex optimization problem of constraining open-ended stochastic outputs to remain helpful, harmless, and honest across a combinatorial explosion of possible prompts. This requires a transition from static dataset curation to dynamic behavioral shaping.
::: {#fig-rlhf-pipeline fig-env="figure" fig-pos="htb" fig-cap="**RLHF Alignment Pipeline**. Three-stage process for aligning generative models: (1) Supervised Fine-Tuning (SFT) on high-quality demonstrations; (2) Reward Model training on human preference rankings; (3) Proximal Policy Optimization (PPO) to fine-tune the model to maximize the reward signal. This process 'compiles' human values into model weights." fig-alt="Three-stage horizontal flow. Stage 1: Dataset to SFT Model. Stage 2: Prompts to Model to Rankings to Reward Model. Stage 3: Prompts to Policy Model to Reward to PPO update loop."}
![](images/svg/rlhf-pipeline.svg){width=100%}
:::
The primary mechanism for this shaping, Reinforcement Learning from Human Feedback (RLHF), serves as a sociotechnical bridge between human values and model weights. By training a reward model on human preferences---typically requiring 50,000 to 500,000 pairwise comparisons at a cost of \$0.50 to \$5.00 per label---engineers effectively compile subjective ethics into a differentiable loss function. This alignment process introduces an **alignment tax**, often observed as a 2--8% degradation in standard NLP benchmarks as the model trades raw capability for safety constraints. The reliance on human raters introduces a **representativeness gap**: if the labeling investment reflects only a narrow demographic slice, the resulting "aligned" model will inherently overfit to that specific cultural or socioeconomic context. Constitutional AI offers an alternative engineering path, using a set of high-level principles to guide AI feedback on its own outputs, thereby reducing the dependency on massive-scale human annotation while making the values explicit in the prompt rather than implicit in the rater pool.
In Retrieval-Augmented Generation (RAG) architectures (@sec-inference-scale), responsibility becomes decoupled from the core model. An LLM may be perfectly aligned via extensive RLHF, yet still generate toxic or biased responses if the **retrieval layer** surfaces contaminated context. If a retrieval index disproportionately surfaces biased historical documents, the model---conditioned to be faithful to its context---will propagate that bias regardless of its internal safety training. This necessitates **context filtering** as a distinct infrastructure component, validating retrieved chunks for toxicity and bias before they reach the generation context window.
@@ -2229,32 +2095,8 @@ The Sociotechnical Feedback Invariant (Principle \ref{nte-sociotechnical-feedbac
Machine learning systems do not merely observe and model the world; they also shape it. Once deployed, their predictions and decisions often influence the environments they are intended to analyze. This feedback alters future data distributions, modifies user behavior, and affects institutional practices, creating a recursive loop between model outputs and system inputs. Over time, such dynamics can amplify biases, entrench disparities, or unintentionally shift the objectives a model was designed to serve.
::: {.callout-note title="Figure: Bias Feedback Loop"}
![](images/svg/bias-loop.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth]
\tikzset{
block/.style={line width=0.75pt, rounded corners=2pt, align=center, minimum width=2.5cm, minimum height=1.2cm},
arrow/.style={->, line width=1.0pt},
centerText/.style={text=RedLine, font=\bfseries}
}
% The Loop
\node[block, draw=BrownLine, fill=BrownL] (Data) {\textbf{Biased Training Data}\\(Historical Disparities)};
\node[block, draw=BlueLine, fill=BlueL, right=1.5cm of Data] (Model) {\textbf{Model Learning}\\(Captures Correlations)};
\node[block, draw=OrangeLine, fill=OrangeL, below=1.8cm of Model] (Decision) {\textbf{Biased Decisions}\\(e.g., Target Specific Area)};
\node[block, draw=BrownLine, fill=BrownL, below=1.8cm of Data] (NewData) {\textbf{New Observations}\\(Confirms the Bias)};
% Arrows
\draw[arrow] (Data) -- (Model);
\draw[arrow] (Model) -- (Decision);
\draw[arrow] (Decision) -- (NewData);
\draw[arrow] (NewData) -- node[left, align=center, font=\tiny] {Retrain\\on new data} (Data);
\path (Data) -- (Decision) coordinate[midway] (center);
\node[centerText] at (center) {SELF-REINFORCING};
\end{tikzpicture}
```
**Bias Amplification Loop**. Visualizing how a deployed model influences future training data. A model trained on biased data makes biased decisions (e.g., more police patrols in specific areas). These decisions generate new data (more arrests in those areas), which is then used to re-train the model, reinforcing the original bias in a self-fulfilling prophecy.
:::
@@ -2452,52 +2294,7 @@ Establishing effective organizational structures for responsible AI requires mor
The responsibility for ethical system behavior is distributed across multiple constituencies, including industry, academia, civil society, and government. @fig-human-centered-ai maps this distribution across nested layers of accountability, from individual teams implementing technical practices through organizational safety culture to industry-wide certification and government regulation [@schneiderman2020]. Within organizations, this distribution must be mirrored by mechanisms that connect technical design with strategic oversight and operational control. Without these linkages, responsibility becomes diffuse, and well-intentioned efforts may be undermined by systemic misalignment.
::: {#fig-human-centered-ai fig-env="figure" fig-pos="htb" fig-cap="**Stakeholder Responsibility**: Effective human-centered AI implementation requires shared accountability across industry, academia, civil society, and government to address ethical considerations and systemic risks. These diverse groups shape technical design, strategic oversight, and operational control, ensuring responsible AI development and deployment throughout the model lifecycle. " fig-alt="Four nested ellipses representing responsibility layers. Inner to outer: Team with technical practices, Organization with safety culture, Industry with certification and independent oversight, Government regulation at outermost layer."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{%
elli1/.style={draw,line width=1pt,ellipse,inner sep=-2pt,align=left, anchor=south west,fill=cyan!50,align=center,
text width=45mm, minimum height=33mm},
elli2/.style={elli1,fill=cyan!30,align=center, text width=80mm, minimum height=55mm},
elli3/.style={elli1,fill=cyan!15,align=center, text width=110mm, minimum height=75mm},
elli4/.style={elli1,fill=cyan!8,align=center, text width=120mm, minimum height=90mm},
}
\node[elli4](E4)at(-0.35,-0.35){};
\node[align=center,anchor=north, below=9pt of E4.north]{\textbf{GOVERNMENT REGULATION}};
%
\node[elli3](E3)at(-0.2,-0.2){};
\node[align=center,anchor=north, below=4pt of E3.north]{\textbf{INDUSTRY:} \\
\textbf{Trustworthy Certification:} \\ \textbf{External Reviews}};
\node[font=\footnotesize\usefont{T1}{phv}{m}{n},align=left,anchor=east, left=10pt of E3.east,yshift=7mm]{%
\textbf{Independent Oversight:}\\
\quad Auditing Firms, \\
\qquad Insurance Companies, \\
\qquad\quad NGOs \& Civil Society\\
\qquad\qquad Professional Societies};
%
\node[elli2](E2)at(-0.1,-0.1){};
\node[align=center,anchor=north, below=4pt of E2.north]{\textbf{ORGANIZATION:} \\
\textbf{Safety Culture:} \\ \textbf{Organizational Design}};
\node[font=\footnotesize\usefont{T1}{phv}{m}{n},align=left,anchor=east, left=20pt of E2.east]{%
\textbf{Management Strategies:}\\
\quad Leadership Commitment, \\
\qquad Hiring \& Training, \\
\qquad\quad Failures \& Near Misses,\\
\qquad\qquad Internal Reviews\\
\qquad \qquad\quad Industry Standards};
%
\node[elli1](E1)at(0,0){};
\node[align=center,anchor=north, below=4pt of E1.north](TE1){
\textbf{TEAM:}\\ \textbf{Reliable Systems:}\\ \textbf{Software Engineering}};
\node[font=\footnotesize\usefont{T1}{phv}{m}{n},align=left,anchor=east, below=20pt of TE1,yshift=7mm]{%
\textbf{Technical Practices:}\\
\quad Audit Trails, SE Workflows\\
\qquad Verification \& Bias Testing\\
\qquad \quad Explainable UIs};
\end{tikzpicture}
```
![](images/svg/_human-centered-ai.svg){width=100%}
:::
Responsible AI is not merely a question of technical excellence or regulatory compliance. It is a systems-level challenge that requires aligning ethical objectives with the institutional structures through which machine learning systems are designed, deployed, and maintained. Creating and sustaining these structures is important for ensuring that responsibility is embedded not only in the model, but in the organization that governs its use.
@@ -2658,39 +2455,7 @@ One concrete example comes from recommendation systems, where a model trained to
[^fn-reward-hacking]: **Reward Hacking**: When an AI system maximizes its reward function through unintended means that violate designer intent. A Tetris AI learned to pause indefinitely to avoid losing; a cleaning robot knocked over objects to create messes it could then clean up. For production ML systems, reward hacking manifests subtly: recommendation models that maximize engagement by promoting addictive content, or chatbots that maximize helpfulness ratings by being sycophantic rather than accurate. The failure mode scales with model capability. \index{Reward Hacking!value misalignment}
::: {#fig-reward-hacking-loop fig-env="figure" fig-pos="htb" fig-cap="**Reward Hacking Loop**: Maximizing measurable rewards, like clicks, can incentivize unintended model behaviors that undermine the intended goal of user satisfaction. Optimizing for proxy metrics creates misalignment between a system's objective and desired outcomes, posing challenges for value alignment in AI safety." fig-alt="Flowchart showing reward hacking cycle. True goal of user satisfaction leads to ML recommender, which produces clickbait behavior causing misinformation. Proxy reward of maximizing clicks feeds back to agent, bypassing the intended objective."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{%
LineA/.style={line width=1.0pt,black!50,latex-latex},
Line/.style={BrownLine!60, -{Triangle[width = 6pt, length = 6pt]}, line width = 1.7pt,text=black},
Box/.style={inner xsep=2pt,inner ysep=6pt,
node distance=0.9,
draw=RedLine,
line width=0.75pt,
fill=RedL!60,
align=flush center,
text width=59mm,
minimum width=59mm, minimum height=13mm
},
Box2/.style={Box,draw=GreenLine,fill=GreenL},
Box3/.style={Box, draw=VioletLine,fill=VioletL2},
Box4/.style={Box,draw=BlueLine,fill=BlueL!50},
Box5/.style={Box,draw=OrangeLine,fill=OrangeL!50},
}
\node[Box](B1){\textbf{True Goal:}\\ Maximize User Satisfaction};
\node[Box2,below=of B1](B2){\textbf{Agent:}\\ ML Recommender System};
\node[Box3,below=of B2](B3){\textbf{Behavior:}\\ Promote Clickbait or Addictive Content};
\node[Box4,below=of B3](B4){\textbf{Unintended Consequences:}\\ Misinformation, Addiction, Misuse};
\node[Box5,right=3of B4](B5){\textbf{Proxy Reward:}\\ Maximize Clicks};
\draw[Line](B1)--node[right]{Intended Objective}(B2);
\draw[Line](B2)--(B3);
\draw[Line](B3)--(B4);
\draw[Line](B4)--node[above]{Feedback}(B5);
\draw[Line](B5)|-node[above,pos=0.66]{Optimized Instead}(B2);
\end{tikzpicture}
```
![](images/svg/_reward-hacking-loop.svg){width=100%}
:::
In 1960, Norbert Wiener wrote, "if we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively... we had better be quite sure that the purpose put into the machine is the purpose which we desire" [@wiener1960some].

View File

@@ -347,51 +347,8 @@ class RobustAISetup:
Consider how hardware reliability directly impacts ML performance. A single bit flip in a critical neural network weight can degrade ResNet-50 classification accuracy from 76.0% (top-1) to 11% on ImageNet, while memory subsystem failures during training corrupt gradient updates and prevent model convergence. Modern transformer models such as GPT-3 with `{python} RobustAISetup.gpt3_params_b`&nbsp;B parameters execute 10^15 floating-point operations per inference and create over one million opportunities for hardware faults during a single forward pass. GPU memory systems operating at up to `{python} RobustAISetup.v100_mem_bw` GB/s bandwidth (such as V100 HBM2) process 10^11 bits per second, where base error rates of 10^-17 errors per bit translate to multiple potential faults per hour of operation.
::: {.callout-note title="Figure: Weight Corruption via Bit Flip"}
![](images/svg/_weight-corruption.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
bit/.style={draw=BrownLine, line width=0.75pt, minimum width=0.4cm, minimum height=0.6cm, font=\tiny, outer sep=0pt},
signBit/.style={bit, fill=BlueL},
expBit/.style={bit, fill=OrangeL},
mantBit/.style={bit, fill=GreenL},
flipBit/.style={bit, fill=RedL},
crossLine/.style={RedLine, line width=1.0pt},
arrowLine/.style={->, line width=1.0pt, RedLine}
}
% IEEE 754 Layout
\node[signBit] (S) {0};
\node[font=\tiny, above=0.1cm of S] {Sign};
\node[expBit, right=0pt of S] (E1) {0};
\foreach \i [evaluate=\i as \prev using int(\i-1)] in {2,...,8} {
\node[expBit, right=0pt of E\prev] (E\i) {0};
}
\node[font=\tiny, above=0.1cm of E4, xshift=0.2cm] {Exponent (8 bits)};
\node[mantBit, right=0pt of E8] (M9) {1};
\foreach \i [evaluate=\i as \prev using int(\i-1)] in {10,...,12} {
\node[mantBit, right=0pt of M\prev] (M\i) {1};
}
\node[right=0.2cm of M12] (Dots) {...};
\node[font=\tiny, above=0.1cm of M10, xshift=0.2cm] {Mantissa (23 bits)};
% Clean Value
\node[anchor=west, right=0.5cm of Dots] (CleanVal) {$\approx 0.001$};
% Bit flip animation
\draw[crossLine] (E4.south west) -- (E4.north east);
\draw[crossLine] (E4.north west) -- (E4.south east);
\node[flipBit, below=1cm of E4] (E4f) {1};
\draw[arrowLine] (E4) -- (E4f) node[midway, right, font=\tiny] {Flip!};
% Corrupted Value
\node[anchor=west, text=RedLine, below=0.4cm of CleanVal.west] (CorruptVal) {$\approx 1.2 \times 10^{19}$};
\node[anchor=north, font=\scriptsize, text=BrownLine, below=0.5cm of E4f, xshift=2cm] {Exponent bit-flip causes order-of-magnitude explosion.};
\end{tikzpicture}
```
**Weight Corruption via Bit Flip**. Visualization of a floating-point number's bit representation. A single bit flip in the exponent can change a weight from a small value (0.001) to a astronomical magnitude ($10^{19}$), causing neurons to saturate and propagating massive errors that lead to catastrophic misclassification.
:::
@@ -745,54 +702,8 @@ Detection requires proving that the observed change is statistically unlikely un
:::
::: {.callout-note title="Figure: Types of Distribution Shift"}
![](images/svg/shift-types.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.7, node distance=6cm and 4cm]
\tikzset{
axis/.style={->, line width=1.0pt, BrownLine},
dist/.style={line width=1.0pt, domain=-1.5:1.5, samples=50},
shifted/.style={dashed, line width=1.0pt, domain=-1.5:1.5, samples=50},
box/.style={minimum width=1cm, minimum height=2cm, anchor=south west, inner sep=0pt},
label/.style={align=center, font=\tiny}
}
% 1. Covariate Shift
\begin{scope}[local bounding box=covariate]
\node[axis] (x-axis) at (0,0) {};
\node[axis, above=3cm of x-axis] (y-axis) {};
\draw[axis] (x-axis) -- (4,0) node[right] {$x$};
\draw[axis] (0,0) -- (0,3) node[above] {$P(x)$};
\draw[dist, BlueLine] plot (\x+1.5, {2*exp(-(\x)*(\x))});
\draw[shifted, RedLine] plot (\x+2.5, {2*exp(-(\x)*(\x))});
\node[label, below=1cm of x-axis] {\textbf{Covariate Shift}\\$P(x)$ changes\\$P(y|x)$ constant};
\end{scope}
% 2. Label Shift
\begin{scope}[right=of covariate, local bounding box=labelshift]
\node[axis] (x-axis) at (0,0) {};
\node[axis, above=3cm of x-axis] (y-axis) {};
\draw[axis] (x-axis) -- (4,0) node[right] {$y$};
\draw[axis] (0,0) -- (0,3) node[above] {$P(y)$};
\node[fill=BlueL, box] at (0.5,0) {};
\node[fill=BlueL, minimum height=1cm, box] at (2.5,0) {};
\node[fill=RedL, opacity=0.5, box] at (0.5,0) {};
\node[fill=RedL, opacity=0.5, minimum height=2cm, box] at (2.5,0) {};
\node[label, below=1cm of x-axis] {\textbf{Label Shift}\\$P(y)$ changes\\$P(x|y)$ constant};
\end{scope}
% 3. Concept Drift
\begin{scope}[right=of labelshift, local bounding box=conceptdrift]
\node[axis] (x-axis) at (0,0) {};
\node[axis, above=3cm of x-axis] (y-axis) {};
\draw[axis] (x-axis) -- (4,0) node[right] {$x$};
\draw[axis] (0,0) -- (0,3) node[above] {$y$};
\draw[line width=1.0pt, BlueLine] (0.5, 0.5) -- (3.5, 2.5);
\draw[dashed, line width=1.0pt, RedLine] (0.5, 2.5) -- (3.5, 0.5);
\node[label, below=1cm of x-axis] {\textbf{Concept Drift}\\$P(y|x)$ changes\\Relationship evolves};
\end{scope}
\end{tikzpicture}
```
**Types of Distribution Shift**. Comparison of Covariate Shift ($P(x)$ changes), Label Shift ($P(y)$ changes), and Concept Drift ($P(y|x)$ changes). Understanding the specific type of shift is crucial for selecting the correct adaptation strategy (e.g., importance re-weighting vs. model retraining).
:::
@@ -804,39 +715,8 @@ Distribution shifts occur naturally as environments evolve. User preferences cha
Covariate shift occurs when the input distribution changes while the relationship between inputs and outputs remains constant [@quinonero2009dataset]. Autonomous vehicle perception models trained on daytime images (luminance 1,000-100,000 lux) experience 15--30% accuracy degradation when deployed in nighttime conditions (0.1-10 lux), despite unchanged object recognition tasks. Weather conditions introduce additional covariate shift: rain reduces object detection mAP by 12%, snow by 18%, and fog by 25% compared to clear conditions. These environmental changes effectively shift data points relative to the learned *decision boundary*, causing misclassification without any change to the model itself.
::: {.callout-note title="Decision Boundary Under Shift"}
![](images/svg/boundary-shift.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=2cm,
BlueLine/.style={line width=1.0pt, color=blue},
RedLine/.style={line width=1.0pt, color=red},
GreenL/.style={fill=green},
BrownLine/.style={dashed, color=brown, font=\tiny},
every node/.style={font=\small\usefont{T1}{phv}{m}{n}}
]
% Decision Boundary (Curved)
\draw[BlueLine] plot [smooth, tension=1] coordinates {(-1,4) (2,2) (5,0)};
\node[BlueLine, font=\scriptsize, anchor=south west] at (4, 1.5) {Decision Boundary};
% Regions
\node (ClassA) at (0, 1) {\textbf{Class A} (Stop)};
\node (ClassB) at (4, 3) {\textbf{Class B} (Speed)};
% Clean Point
\node[circle, GreenL, inner sep=2pt, label={below:Training $x$}, anchor=center] (X) at (1, 1.5) {};
% Shifted Point
\node[circle, fill=red, inner sep=2pt, label={above:Shifted $x'$}, anchor=center] (Xadv) at (2.5, 2.5) {};
% Shift Vector
\draw[->, RedLine] (X) -- (Xadv) node[midway, sloped, above, font=\tiny] {Distribution shift $\Delta$};
% Distance constraint
\draw[BrownLine] (X) circle (2.2);
\node[BrownLine, anchor=north] at (1, -0.8) {Shift magnitude $|\|\Delta|\|$};
\end{tikzpicture}
```
**Decision Boundary Under Distribution Shift**. Environmental changes (e.g., daytime to nighttime, clear to foggy) shift data points in the input space. When the shift moves points across the learned decision boundary, the model misclassifies inputs that it would have handled correctly under training conditions. Unlike adversarial perturbations, these shifts arise naturally and affect entire populations of inputs rather than individual examples.
:::
@@ -1289,9 +1169,7 @@ The gradient $\nabla_x J(\theta, x, y)$ quantifies how the loss function changes
@fig-gradient-attack visualizes how this approach generates adversarial examples by taking a single step in the direction that increases the loss most rapidly, moving the input across the decision boundary with minimal perturbation.
::: {#fig-gradient-attack fig-env="figure" fig-pos="htb" fig-cap="**Adversarial Perturbations**: Gradient-based attacks generate subtle, intentionally crafted input noise with magnitude controlled by $\\epsilon$ that maximizes the loss function $j(\\theta, x, y)$ and causes misclassification by the model. These perturbations, imperceptible to humans, exploit model vulnerabilities by moving the input $x$ across the decision boundary. Source: [ivezic](HTTPS://defence.AI/AI-security/gradient-based-attacks/)." fig-alt="Diagram showing FGSM attack process: original input x, gradient computation, epsilon-scaled perturbation, and resulting adversarial example crossing decision boundary."}
![](./images/png/gradient_attack.png)
![](images/svg/gradient-attack.svg){width=100%}
:::
The Projected Gradient Descent (PGD) attack [@madry2017towards] extends FGSM by iteratively applying the gradient update step, producing more refined adversarial examples with higher attack success rates. PGD projects each perturbation step back into a constrained norm ball around the original input, ensuring that the adversarial example remains within a specified distortion limit. The iterative refinement makes PGD a stronger white-box attack and a benchmark for evaluating model robustness.

View File

@@ -89,24 +89,8 @@ The root cause is the difference between transient processing and persistent lea
Architectural complexity compounds these challenges. A contemporary ML deployment spans data ingestion pipelines, distributed training infrastructure, model serving systems, and continuous monitoring frameworks. Each component introduces distinct vulnerabilities that propagate through the entire computational stack. Continuous adaptation at edge nodes and federated coordination protocols further expand the attack surface while complicating comprehensive security implementation.
::: {.callout-note title="Figure: ML System Attack Surface"}
![](images/svg/attack-surface-taxonomy.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=0.5cm]
\tikzset{
layer/.style={draw=GrayLine, line width=0.75pt, rounded corners=2pt, align=center, minimum width=6cm, minimum height=1.2cm},
arrow/.style={->, line width=1.0pt, GrayLine!40}
}
% Layers
\node[layer, fill=GreenL] (Api) {\textbf{API / Interface Layer}\\Adversarial Examples, Model Extraction,\\Membership Inference};
\node[layer, fill=OrangeL, below=of Api] (Model) {\textbf{Model Layer}\\Backdoor Injection, Trojan Weights,\\Parameter Theft};
\node[layer, fill=BlueL, below=of Model] (Data) {\textbf{Data Layer}\\Poisoning, Label Manipulation,\\Data Leakage};
\node[layer, fill=GrayL, below=of Data] (Hard) {\textbf{Hardware / Infrastructure Layer}\\Side-channel Attacks, Fault Injection,\\Supply Chain Compromise};
% Connecting arrows indicating depth
\draw[arrow] ([xshift=-0.5cm]Api.west) -- ([xshift=-0.5cm]Hard.west) node[midway, sloped, above, text=black] {Increasing Access Depth};
\end{tikzpicture}
```
**ML System Attack Surface**. Visualizing entry points for adversarial actions across the ML lifecycle. Defense requires a multi-layered approach: protecting data collection (Data Layer), securing weights and training (Model Layer), hardening inference endpoints (API Layer), and anchoring trust in silicon (Hardware Layer).
:::
@@ -306,18 +290,8 @@ In 2010, the [Stuxnet](https://www.research-collection.ethz.ch/bitstream/handle/
Modern ML supply chains face four analogous vectors: compromised dependencies (malicious packages in PyPI and conda repositories), poisoned datasets on public platforms, backdoored model weights in model repositories, and tampered accelerator firmware. Defense requires cryptographic signing of all model artifacts, immutable provenance logs for training data and code, automated scanning for backdoors before deployment, and controlled dependency management in air-gapped training environments. @fig-stuxnet maps these parallels between the Stuxnet attack chain and modern ML supply chain vulnerabilities.
::: {#fig-stuxnet fig-env="figure" fig-pos="htb" fig-cap="**Stuxnet**: Targets PLCs by exploiting Windows and Siemens software vulnerabilities, demonstrating supply chain compromise that enabled digital malware to cause physical infrastructure damage. Modern ML systems face analogous risks through compromised training data, backdoored dependencies, and tampered model weights." fig-alt="Flowchart showing Stuxnet attack chain: USB infection spreads through Windows vulnerabilities, targets Siemens Step7 software, compromises PLCs, and causes physical centrifuge damage."}
```{.tikz}
\begin{tikzpicture}[line cap=round,line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{%
TxtL/.style = {font=\large\usefont{T1}{phv}{m}{n},text width=90mm,align=justify,anchor=north},
Line/.style={violet!50, line width=1.1pt,shorten <=1pt,shorten >=2pt},
LineA/.style={violet!50,line width=2.0pt,{-{Triangle[width=1.1*6pt,length=2.0*6pt]}},shorten <=3pt,shorten >=2pt},
ALine/.style={black!50, line width=1.1pt,{{Triangle[width=1.1*6pt,length=2*6pt]}-}},
Larrow/.style={fill=violet!50, single arrow, inner sep=2pt, single arrow head extend=3pt,
single arrow head indent=0pt,minimum height=9mm, minimum width=15pt}
}
%Skull
![](images/svg/stuxnet.svg){width=100%}
:::%Skull
\tikzset{pics/skull/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
@@ -898,20 +872,11 @@ Consider these threat priority categories:
The framework guides resource allocation: the most common and accessible threats (model theft, data poisoning, and adversarial attacks) come first, followed by more specialized hardware and infrastructure vulnerabilities. Implementing defenses in this sequence maximizes security benefit per invested effort.
::: {.callout-note title="Figure: Threat Prioritization Matrix"}
![](images/svg/threat-prioritization-matrix.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=1cm and 1cm]
\tikzset{
point/.style={circle, inner sep=2pt},
axis/.style={->, line width=1.0pt},
quadrant/.style={dashed, GrayLine, line width=0.75pt},
label/.style={font=\bfseries},
critical/.style={text=RedLine},
common/.style={text=OrangeLine},
specialized/.style={text=OrangeLine},
routine/.style={text=GrayLine}
}
**Threat Prioritization Matrix**. A 2x2 matrix classifying ML threats by Likelihood and Impact. **Critical** threats (e.g., Data Poisoning, Prompt Injection) require high-priority, automated defenses. **Specialized** threats (e.g., Hardware Side-channels) require deep engineering but may be less frequent. **Common** threats (e.g., Model Extraction) are often addressed through rate limiting and API design.
:::
% Axes
\node (origin) at (0,0) {};
\draw[axis] (origin) -- ++(8,0) node[right] {Likelihood};
@@ -1031,21 +996,8 @@ Understanding when and where different attacks occur in the ML lifecycle helps p
The lifecycle perspective reveals that different threats require different defensive strategies. Data validation protects the collection phase, secure training environments protect the training phase, access controls and API design protect deployment, and input validation protects inference. Mapping attacks to lifecycle stages allows security teams to implement appropriate defenses at the right architectural layers.
::: {#fig-ml-lifecycle-threats fig-env="figure" fig-pos="htb" fig-cap="**ML Lifecycle Threats**: Model theft, data poisoning, and adversarial attacks target distinct stages of the machine learning lifecycle (from data ingestion to model deployment and inference), creating unique vulnerabilities at each step. Understanding these lifecycle positions clarifies attack surfaces and guides the development of targeted defense strategies for robust AI systems." fig-alt="Vertical flowchart with four ML lifecycle stages. Threat arrows point to each: poisoning targets collection, backdoors target training, model theft targets deployment, adversarial examples target inference."}
```{.tikz}
\scalebox{0.85}{%
\begin{tikzpicture}[scale=0.9, transform shape, line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{%
Line/.style={line width=0.75pt,black!50},
Box/.style={inner xsep=2pt,
node distance=0.6,
draw=GreenLine,
line width=0.75pt,
fill=GreenL,
align=flush center,
minimum width=28mm, minimum height=8mm
},
Box2/.style={Box, node distance=2.3,draw=BrownLine,fill=BrownL},
![](images/svg/_ml-lifecycle-threats.svg){width=100%}
:::Box2/.style={Box, node distance=2.3,draw=BrownLine,fill=BrownL},
Box3/.style={Box, draw=VioletLine,fill=VioletL2,}
}
\node[Box](B1){Data Collection};
@@ -1115,21 +1067,8 @@ Model theft can target two distinct objectives: extracting exact model propertie
@fig-model-theft-types distinguishes two distinct attack paths. In exact model theft, the attacker gains access to the model's internal components, including serialized files, weights, and architecture definitions, and reproduces the model directly. In contrast, approximate model theft relies on observing the model's input-output behavior, typically through a public API. By repeatedly querying the model and collecting responses, the attacker trains a surrogate that mimics the original model's functionality. The first approach compromises the model's internal design and training investment, while the second threatens its predictive value and can facilitate further attacks such as adversarial example transfer or model inversion.
::: {#fig-model-theft-types fig-env="figure" fig-pos="htb" fig-cap="**Model Theft Strategies**: Attackers can target either a model's internal parameters or its external behavior to create a stolen copy. Direct theft extracts model weights and architecture, while approximate theft trains a surrogate model by querying the original's input-output behavior, potentially enabling further attacks despite lacking direct access to internal components." fig-alt="Two parallel flowcharts. Left shows approximate theft: API access, crafted queries, record responses, train surrogate, replicate predictions. Right shows exact theft: access model file, extract parameters, reconstruct model, use proprietary IP."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{%
Line/.style={line width=1.0pt,black!50},
Box/.style={inner xsep=2pt,inner ysep=4pt,
node distance=0.4,
draw=GreenLine,
line width=0.75pt,
fill=GreenL,
align=flush center,
text width=38mm,
minimum width=38mm, minimum height=8mm
},
Box2/.style={Box, node distance=1.9,draw=BrownLine,fill=BrownL},
![](images/svg/_model-theft-types.svg){width=100%}
:::Box2/.style={Box, node distance=1.9,draw=BrownLine,fill=BrownL},
Box3/.style={Box, draw=VioletLine,fill=VioletL2,text width=42mm,}
}
@@ -1623,22 +1562,8 @@ These threat types span different stages of the ML lifecycle and demand distinct
The appropriate defense for a given threat depends on its type, attack vector, and where it occurs in the ML lifecycle. Matching threats to defenses becomes clearer through the decision flow in @fig-threat-mitigation-flow, which connects common threat categories, such as model theft, data poisoning, and adversarial examples, to corresponding defensive strategies. While real-world deployments may require more nuanced combinations of defenses as discussed in our layered defense framework, this flowchart serves as a conceptual guide for aligning threat models with practical mitigation techniques.
::: {#fig-threat-mitigation-flow fig-env="figure" fig-pos="H" fig-cap="**Threat Mitigation Flow**: This diagram maps common machine learning threats to corresponding defense strategies, guiding selection based on attack vector and lifecycle stage. By following this flow, practitioners can align threat models with practical mitigation techniques, such as secure model access and data sanitization, to build more robust AI systems." fig-alt="Three-column flowchart mapping threats to defenses. Model theft: secure access, encrypt artifacts. Data poisoning: validate data, provenance checks. Adversarial examples: input validation, adversarial training."}
```{.tikz}
\scalebox{0.65}{%
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{%
Line/.style={line width=1.0pt,black!50},
Box/.style={inner xsep=2pt,inner ysep=2pt,
node distance=0.5,
draw=GreenLine,
line width=0.75pt,
fill=GreenL,
align=flush center,
text width=33mm,
minimum width=33mm, minimum height=9.5mm
},
Box2/.style={Box,draw=BrownLine,fill=BrownL},
![](images/svg/_threat-mitigation-flow.svg){width=100%}
:::Box2/.style={Box,draw=BrownLine,fill=BrownL},
Box3/.style={Box, draw=VioletLine,fill=VioletL2},
Box4/.style={Box, draw=OrangeLine,fill=OrangeL!50,text width=43mm}
}

View File

@@ -614,75 +614,7 @@ A central ethical challenge lies in balancing technological progress with ecolog
The ethical imperative extends beyond sustainability to encompass broader concerns related to transparency, fairness, and accountability. @fig-ethical-ai illustrates the ethical challenges associated with AI development, linking different types of concerns, including inscrutable evidence, unfair outcomes, and traceability, to issues like opacity, bias, and automation bias [@coe2023ethical]. These concerns extend to sustainability, as the environmental trade-offs of AI development are often opaque and difficult to quantify. The lack of traceability in energy consumption and carbon emissions can lead to unjustified actions, where companies prioritize performance gains without fully understanding or disclosing the environmental costs.
::: {#fig-ethical-ai fig-env="figure" fig-pos="htb" fig-cap="**Ethical AI Concerns**: AI systems introduce ethical challenges across transparency, fairness, and sustainability; these concerns interrelate and stem from issues like opacity, bias, and a lack of traceability in resource consumption. Addressing these challenges requires proactive design choices that prioritize accountability and minimize negative societal and environmental impacts. " fig-alt="Flowchart linking 6 types of AI concerns on left to 12 ethical challenges on right. Evidence concerns connect to opacity and bias. Unfair outcomes link to discrimination. Traceability connects to responsibility and auditing issues."}
```{.tikz}
\scalebox{0.75}{%
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
Line/.style={line width=1.0pt,BrownLine,text=black},
Box/.style={inner xsep=2pt,
node distance=0.2,
draw=VioletLine2, line width=0.75pt,
fill=VioletL2,
text width=36mm,align=flush center,
minimum width=36mm, minimum height=7.7mm
},
Box1/.style={inner xsep=2pt,
node distance=0.2,
draw=OrangeLine, line width=0.75pt,
fill=OrangeL!70,
text width=36mm,align=flush center,
minimum width=36mm, minimum height=7.7mm
},
}
\node[Box](G1){Unjustified actions};
\node[Box,below =of G1](G2){Opacity};
\node[Box,below =of G2](G3){Bias};
\node[Box,below =of G3](G4){Discrimination};
\node[Box,below =of G4](G5){Autonomy};
\node[Box,below =of G5](G6){Informational privacy};
\node[Box,below =of G6](G7){Group privacy};
\node[Box,below =of G7](G8){Moral responsibility};
\node[Box,below =of G8](G9){Distributed responsibility};
\node[Box,below =of G9](G10){Automation bias};
\node[Box,below =of G10](G11){Safety and resilience};
\node[Box,below =of G11](G12){Ethical auditing};
%
\node[Box1,node distance=3.9,left =of G1](LG1){Inconclusive evidence};
\node[Box1,below =of LG1](LG2){Inscrutable evidence};
\node[Box1,below =of LG2](LG3){Misguided evidence};
\node[Box1,below =of LG3](LG4){Unfair outcomes};
%
\node[Box1,node distance=3.9,left =of G6](LG5){Transformative effects};
\node[Box1,node distance=3.9,left =of G10](LG6){Traceability};
\node[above=0.1 of G1]{\textbf{Ethical Challenges}};
\node[above=0.1 of LG1]{\textbf{Types of concerns}};
%
\foreach \x in {2,3,4,5,6}{
\draw[line width=1.5pt,BrownLine](LG1.west)--++(180:0.65)|-(LG\x);
}
%
\draw[thick,blue!80!black!99,decoration={brace,amplitude=6pt},decorate]
([yshift=0mm,xshift=1mm]LG1.north east)--([yshift=0mm,xshift=1mm]LG3.south east)
node [blue,midway,below=1mm] {};
%
\draw[thick,blue!80!black!99,decoration={brace,amplitude=6pt},decorate]
([yshift=0mm,xshift=1mm]LG4.north east)--([yshift=0mm,xshift=1mm]LG5.south east)
node [blue,midway,below=1mm] {};
%
\foreach \x in {8,...,12}{
\draw[Line,-latex,shorten <=5mm](LG6.east)--++(0:2)|-(G\x);
}
\foreach \x in {5,6,7}{
\draw[Line,-latex,shorten <=5mm](LG5.east)--++(0:2)|-(G\x);
}
\foreach \x in {1,2,3,4}{
\draw[Line,-latex,shorten <=5mm](LG\x.east)--(G\x);
}
\end{tikzpicture}}
```
![](images/svg/_ethical-concerns.svg){width=100%}
:::
Addressing these concerns demands greater transparency and accountability from AI companies. Large technology firms operate extensive cloud infrastructures that power modern AI applications, yet their environmental impact remains opaque. Organizations must measure, report, and reduce their carbon footprint throughout the AI lifecycle, from hardware manufacturing to model training and inference. Voluntary self-regulation provides an initial step, but policy interventions and industry-wide standards may be necessary to ensure long-term sustainability. Reported metrics such as energy consumption, carbon emissions, and efficiency benchmarks can hold organizations accountable.
@@ -917,53 +849,8 @@ The carbon impact of electricity consumption depends critically on the energy ge
Geographic optimization can reduce carbon emissions by 10-50$\times$ through strategic training location selection.
::: {.callout-note title="Figure: Carbon Intensity Variation"}
![](images/svg/_regional-carbon.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
region/.style={
draw=Black,
line width=0.75pt,
rounded corners=2pt,
minimum width=2.5cm,
minimum height=1.5cm,
align=center
},
scale_bar/.style={
top color=HighC,
bottom color=LowC,
shading angle=0,
minimum width=10cm,
minimum height=0.5cm,
anchor=west
},
arrow/.style={
->,
line width=1.0pt
}
}
\definecolor{LowC}{RGB}{200,255,200}
\definecolor{MedC}{RGB}{255,255,200}
\definecolor{HighC}{RGB}{255,200,200}
% Regions
\node[region, fill=LowC] (Quebec) at (0,0) {\textbf{Quebec}\\Hydro\\20g $CO_2$/kWh};
\node[region, fill=MedC, right=1.5cm of Quebec] (Texas) {\textbf{Texas}\\Gas + Wind\\350g $CO_2$/kWh};
\node[region, fill=HighC, right=1.5cm of Texas] (Poland) {\textbf{Poland}\\Coal\\800g $CO_2$/kWh};
% Scale Bar
\node[scale_bar] (scale) at ([yshift=-1cm, xshift=-1cm]Quebec.south west) {};
\node[left] at (scale.west) {Low Carbon};
\node[right] at (scale.east) {High Carbon};
% Annotation
\node[align=center, above=1cm of Texas] (annotation) {\textbf{Geographic Optimization}\\Training in Quebec vs Poland reduces emissions by 40$\times$.};
\draw[arrow] (annotation.south) -- (Quebec.north);
\draw[arrow] (annotation.south) -- (Poland.north);
\end{tikzpicture}
```
**Geographic Carbon Intensity**. The carbon footprint of a training job depends critically on *where* it runs. Regions with hydro or nuclear power (e.g., Quebec, France) have carbon intensities 10-50$\times$ lower than regions reliant on coal (e.g., Poland, West Virginia). Carbon-aware scheduling exploits this variance by moving non-urgent jobs to cleaner grids.
:::
@@ -1037,47 +924,8 @@ $$AI_{crossover} = \frac{100 \text{ pJ/byte}}{10 \text{ pJ/FLOP}} = 10 \text{ FL
The energy roofline model visualizes this relationship between arithmetic intensity and energy efficiency, revealing how different workload types are constrained by different bottlenecks.
::: {.callout-note title="Figure: Energy Roofline Model"}
![](images/svg/energy-roofline.svg){width=100%}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.9]
\tikzset{
axis_line/.style={->, line width=1.0pt},
roofline_mem/.style={line width=1.0pt, BlueLine},
roofline_comp/.style={line width=1.0pt, RedLine},
region_mem/.style={fill=MemColor, opacity=0.3},
region_comp/.style={fill=CompColor, opacity=0.3},
dot/.style={circle, inner sep=2pt}
}
\definecolor{MemColor}{RGB}{200,220,255}
\definecolor{CompColor}{RGB}{255,220,200}
% Axes
\draw[axis_line] (0,0) -- (8,0) node[right] {Arithmetic Intensity (FLOP/Byte)};
\draw[axis_line] (0,0) -- (0,5) node[above] {Energy Efficiency (FLOP/Joule)};
% Roofline
\draw[roofline_mem] (0,0) -- (4,4) node[midway, sloped, above] {Memory Bound};
\draw[roofline_comp] (4,4) -- (8,4) node[midway, above] {Compute Bound};
% Ridge Point
\fill[black] (4,4) circle (2pt);
\draw[dashed] (4,0) -- (4,4);
\node[below] at (4,0) {$AI_{crossover}$};
% Regions
\fill[region_mem] (0,0) -- (4,4) -- (4,0) -- cycle;
\node[BlueLine, font=\tiny] at (2.5, 1) {Data Movement Dominates};
\fill[region_comp] (4,0) -- (4,4) -- (8,4) -- (8,0) -- cycle;
\node[RedLine, font=\tiny] at (6, 2) {Arithmetic Dominates};
% Workload dots
\node[dot, fill=BlueL, label={right:\tiny Element-wise}] at (1, 1) {};
\node[dot, fill=RedL, label={below:\tiny MatMul}] at (6, 4) {};
\end{tikzpicture}
```
**Energy Roofline Model**. Just as performance rooflines limit FLOPs/sec based on bandwidth, energy rooflines limit FLOPs/Joule. Workloads with low arithmetic intensity (left) are dominated by memory energy ($E_{byte}$), while compute-heavy workloads (right) are limited by arithmetic energy ($E_{flop}$). Optimizing the wrong metric yields diminishing returns.
:::
@@ -1841,46 +1689,7 @@ The environmental impact of AI workloads has emerged as a concern, with carbon e
[^fn-nas-carbon-cost]: **Neural Architecture Search (NAS) Carbon Cost**: The 284,000 kg CO₂ figure from Strubell et al. (2019) represents evaluating 12,800 architecture configurations, equivalent to the annual emissions of 140 average Americans. This extreme cost catalyzed efficient NAS research: weight-sharing methods like DARTS reduced search cost by 1,000$\times$, demonstrating that the meta-optimization of *how* we search for architectures is itself a sustainability lever. \index{Neural Architecture Search!carbon cost}
::: {#fig-carbonfootprint fig-env="figure" fig-pos="htb" fig-cap="**Carbon Footprint Benchmarks**: Training large AI models generates carbon emissions, comparable to everyday activities and long-distance travel, emphasizing the environmental impact of increasingly complex machine learning workloads. The comparison to roundtrip flights, average human lifespans, and vehicle lifetimes contextualizes the energy demands of training a transformer model with neural architecture search as high. " fig-alt="Horizontal bar chart of CO2 emissions in kg. From lowest to highest: NY-SF flight at 900, human life at 5,000, American life at 16,400, US car lifetime at 57,150, Transformer with NAS at 284,000."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\begin{axis}[title = {Common carbon footprint benchmarks},
title style={font=\usefont{T1}{phv}{m}{n}\bfseries},
xbar,
/pgf/number format/.cd,
use comma,
1000 sep={,},fixed,
y axis line style = { opacity = 0 },
axis x line = none,
tickwidth = 0pt,
enlarge y limits = 0.2,
enlarge x limits = 0.02,
symbolic y coords = {Transformer {(213M parameters)}\\ w/ neural architecture search,
US car including fuel\\ {(avg. 1 lifetime)},
American life {(avg. 1 year)},
Human life {(avg. 1 year)},
Roundtrip flight b/w NY and SF\\ {(1 passenger)}},
yticklabel style={font=\small\usefont{T1}{phv}{m}{n},text width=50mm,align=flush left},
nodes near coords={\footnotesize\usefont{T1}{phv}{m}{n}\pgfmathprintnumber[assume math mode=true]{\pgfplotspointmeta}},
every node near coord/.append style={anchor=west, align=right,text=black,
font=\sffamily},
bar width=17pt
]
\addplot[fill=BlueLine,draw=none]
coordinates {
(626155,Transformer {(213M parameters)}\\ w/ neural architecture search)
(126000,US car including fuel\\ {(avg. 1 lifetime)})
(36156,American life {(avg. 1 year)})
(11023,Human life {(avg. 1 year)})
(1984,Roundtrip flight b/w NY and SF\\ {(1 passenger)})
};
\end{axis}
\node[below=-3mm, align=center,font=\small\bfseries\usefont{T1}{phv}{m}{n}]
at (current axis.north) {in lbs of CO2 equivalent};
\end{tikzpicture}
```
![](images/svg/_carbon-benchmarks.svg){width=100%}
:::
The training phase of large natural language processing models produces carbon dioxide emissions comparable to hundreds of transcontinental flights. When examining the broader industry impact, AI's aggregate computational carbon footprint is approaching parity with the commercial aviation sector. As AI applications scale to serve billions of users globally, the cumulative emissions from continuous inference operations may ultimately exceed those generated during training.
@@ -1888,124 +1697,8 @@ The training phase of large natural language processing models produces carbon d
@fig-meta-analysis provides a detailed analysis of carbon emissions across various large-scale machine learning tasks at Meta, illustrating the environmental impact of different AI applications and architectures. This quantitative assessment of AI's carbon footprint underscores the need for more sustainable approaches to machine learning development and deployment, grounding mitigation strategies in measured environmental costs rather than estimates.
::: {#fig-meta-analysis fig-env="figure" fig-pos="htb" fig-cap="**Inference-Training Market Growth**: The rapidly expanding market for inference workloads, projected to more than double from 2017 to 2025, outpaces growth in training, reflecting the increasing demand for deploying AI models at scale. This disparity emphasizes that the operational energy footprint of running AI applications is becoming a dominant cost factor compared to model development itself. Source: Umckinsey." fig-alt="Stacked bar chart of CO2 emissions for 13 ML models. GPT-3 shows highest at nearly 1 million kg. Facebook recommendation models show significant inference portions. OSS models show training-only footprints."}
```{.tikz}
% couleurs de Poly
\definecolor{blpoly}{RGB}{65,170,230}
\definecolor{vrpoly}{RGB}{140,200,60}
\definecolor{orgpoly}{RGB}{250,150,30}
\definecolor{rgpoly}{RGB}{185,30,50}
\def\legende{{"Offline Training","Online Training","Inference"}}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\begin{axis}[ clip mode = individual,
title = {Operational Carbon Footprint of Large-Scale ML Tasks},
title style={font=\usefont{T1}{phv}{m}{n}\bfseries},
ylabel={CO2e (kg)},
axis y line=left,
axis x line=bottom,
axis line style={thick,-latex},
/pgf/number format/.cd,fixed,
legend style={at={(0.5,1.0),font=\small\usefont{T1}{phv}{m}{n}},
anchor=north,
draw=none},
legend columns=-1,
ybar stacked,
ymin=0,
ymax=1.05,
width=130mm,
height=6cm,
bar width=7.5mm,
scale only axis,
xtick=data,
ytick={0.00,0.50,1.00},
area style,
enlarge x limits=0.05,
xticklabel style={align=right,rotate=90,font=\small\usefont{T1}{phv}{m}{n}},
tick label style={/pgf/number format/assume math mode=true},
yticklabel style={font=\small\usefont{T1}{phv}{m}{n},
/pgf/number format/.cd, fixed, fixed zerofill, precision=2},
symbolic x coords={LM, RM-1, RM-2, RM-3, RM-4, RM-5, BERT-NAS, Evolved Transformer, T5, Meena, GShard-600B, Switch Transformer, GPT-3},
]
%
\addplot [fill=rgpoly, bar shift=0.5pt] coordinates {
(LM, 0.06)
(RM-1, 0.352)
(RM-2, 0.243)
(RM-3, 0.126)
(RM-4, 0.128)
(RM-5, 0.153)
(BERT-NAS, 0)
(Evolved Transformer, 0.006)
(T5, 0.041)
(Meena, 0.090)
(GShard-600B, 0.006)
(Switch Transformer, 0.065)
(GPT-3, 0.544)
};
\addlegendentry{Offline Training~~~~}
\addplot [fill=orgpoly, bar shift=0.5pt] coordinates {
(LM, 0.115)
(RM-1, 0.041)
(RM-2, 0.026)
(RM-3, 0.02)
(RM-4, 0.02)
(RM-5, 0.02)
(BERT-NAS, 0)
(Evolved Transformer, 0)
(T5, 0.)
(Meena, 0)
(GShard-600B, 0)
(Switch Transformer, 0.)
(GPT-3, 0)
};
\addlegendentry{Online Training~~~~}
\addplot [fill=vrpoly, bar shift=0.5pt] coordinates {
(LM, 0)
(RM-1, 0.497)
(RM-2, 0.310)
(RM-3, 0.265)
(RM-4, 0.231)
(RM-5, 0.289)
(BERT-NAS, 0)
(Evolved Transformer, 0)
(T5, 0.)
(Meena, 0)
(GShard-600B, 0)
(Switch Transformer, 0.)
(GPT-3, 0)
};
\addlegendentry{Inference}
\addplot [fill=BlueLine, bar shift=0.5pt] coordinates {
(LM, 0)
(RM-1, 0)
(RM-2, 0)
(RM-3, 0)
(RM-4, 0)
(RM-5, 0)
(BERT-NAS, 0.279)
(Evolved Transformer, 0)
(T5, 0)
(Meena, 0)
(GShard-600B, 0)
(Switch Transformer, 0)
(GPT-3, 0)
};
\node[anchor=south,rotate=90] at (axis description cs:-0.07,0.9) {Millions};
\end{axis}
%
\draw[dashed, thick] ({rel axis cs:0.045,0}) --({rel axis cs:0.045,-0.65})coordinate(X1);
\draw[dashed, thick] ({rel axis cs:0.51,0}) -- ({rel axis cs:0.51,-0.65})coordinate(X2);
\draw[dashed, thick] ({rel axis cs:1.03,0}) -- ({rel axis cs:1.03,-0.65})coordinate(X3);
\path[](X1)--node[above]{Facebook}(X2);
\path[](X2)--node[above]{OSS Large-Scale ML Models}(X3);
\path[](X2)--node[below]{\textbf{*Training footprint only}}(X3);
\end{tikzpicture}
```
![](images/svg/carbon-lifecycle.svg){width=100%}
:::
Carbon footprint of large-scale ML tasks.
:::
@@ -2043,9 +1736,7 @@ The GHG Protocol[^fn-ghg-protocol-standard] framework [@ghgprotocol2023] provide
- **Scope 3 (Value Chain Emissions)**: Extend beyond direct control---semiconductor manufacturing, hardware transportation, end-of-life disposal of AI accelerators.
::: {#fig-ghg-protocol fig-env="figure" fig-pos="htb" fig-cap="**GHG Emission Scopes**: Organizations categorize carbon emissions into scope 1 (direct), scope 2 (purchased energy), and scope 3 (value chain) to comprehensively assess environmental impact. Source: Ucircularise." fig-alt="Diagram showing three concentric emission scopes: Scope 1 for direct emissions from owned sources, Scope 2 for purchased energy emissions, and Scope 3 for value chain emissions including manufacturing and disposal."}
![](images/png/ghg_protocol.png)
![](images/svg/ghg-protocol.svg){width=100%}
:::
Categorizing these emissions into Scope 1, 2, and 3 frameworks provides a standardized vocabulary for corporate environmental reporting. Correctly applying this framework in practice requires classifying the various hidden emission sources across a typical ML platform's operational lifecycle.
@@ -2190,6 +1881,10 @@ Unlike traditional software applications with fixed energy footprints, inference
#### The Energy Inefficiency of the Decode Phase {#sec-sustainable-ai-energy-inefficiency-decode-phase-24da}
::: {#fig-prefill-decode-energy fig-env="figure" fig-pos="htb" fig-cap="**Prefill vs. Decode Energy Intensity**. During the prefill phase, the GPU achieves high arithmetic intensity and high energy efficiency (pJ/FLOP). In the decode phase, the system becomes memory-bandwidth bound, reading the full weight set for every token. This results in significant static power waste as compute units sit idle while waiting for memory, making decode 10-50x less energy-efficient than prefill per operation." fig-alt="Two roofline plots. Left (Prefill) shows dot high on the compute ceiling. Right (Decode) shows dot low on the bandwidth slope. Annotation shows 'Static Power Waste' area representing idle compute units drawing power."}
![](images/svg/prefill-decode-energy.svg){width=100%}
:::
The distinction between "Prefill" and "Decode" established in @sec-inference-scale extends beyond latency into energy efficiency. Recent analysis [@ma2024challenges] reveals that autoregressive generation is inherently energy-wasteful compared to batch processing.
- **Prefill (Compute-Bound)**: High arithmetic intensity allows the GPU to perform thousands of operations for every byte read from memory, achieving near-peak energy efficiency (pJ/FLOP).
@@ -2439,157 +2134,8 @@ Life Cycle Assessments reveal that discarding functional hardware purely for mod
Each of the four primary lifecycle stages contributes to an AI system's total environmental footprint. @fig-ai_lca visualizes this progression from design through disposal, highlighting the interdependencies between phases and the environmental impact categories associated with each stage.
::: {#fig-ai_lca fig-env="figure" fig-pos="htb" fig-cap="**AI System Lifecycle**: Analyzing AI systems across design, manufacture, use, and disposal stages exposes the full environmental impact beyond operational energy consumption, encompassing resource depletion and electronic waste. This lifecycle assessment allows targeted interventions to improve sustainability throughout the entire AI system's existence." fig-alt="Four connected arrow boxes showing AI lifecycle phases: Design with computer icon, Manufacture with factory icon, Use with mobile device icon, and Disposal with recycling bin icon. Labeled as Life Cycle Analysis."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]
\definecolor{Green}{RGB}{84,180,53}
\definecolor{Red}{RGB}{249,56,39}
\definecolor{Blue}{RGB}{0,97,168}
\definecolor{Violet}{RGB}{178,108,186}
\tikzset{
comp/.style = {draw,
minimum width =20mm,
minimum height = 12mm,
inner sep = 0pt,
rounded corners,
draw = BlueLine,
fill=cyan!10,
line width=2.0pt
},
arrowbox/.style={signal,
node distance=1.2,
signal from=west,
signal to=east,
minimum width=45mm,
minimum height=15mm,
text=black,
fill=BlueLine!70,
align=center}
}
\node[arrowbox] (A1) {~~Design Phase};
\node[arrowbox, fill=orange!80, right=of A1] (A2) {~~Manufacture Phase};
\node[arrowbox, fill=Violet, right=of A2] (A3) {~~Use Phase};
\node[arrowbox, fill=Green!80, right=of A3] (A4) {~~Disposal Phase};
%%%
%recycled
\begin{scope}[local bounding box=MOB,scale=0.7, every node/.append style={transform shape},
shift={($(A4.south)+(0,-2.05)$)}]
\coordinate(A)at(-0.5,-0.63);
\coordinate(B)at(0.5,-0.63);
\coordinate(C)at(0.7,1.15);
\coordinate(C1)at($(C)+(0.15,0)$);
\coordinate(D)at(-0.7,1.15);
\coordinate(D1)at($(D)+(-0.15,0)$);
\draw[line width=1.5pt,fill=brown!70](A)--(B)--(C)--(D)--cycle;
\draw[line width=2.5pt](D1)--(C1);
\draw[line width=1.5pt]($(C1)!0.38!(D1)$)--++(90:0.2)-|($(C1)!0.62!(D1)$);
\draw[line width=1.5pt]($(C1)!0.5!(D1)+(0,-0.2)$)--++(270:1.35);
\draw[line width=1.5pt]($(C1)!0.3!(D1)+(0,-0.2)$)--++(265:1.35);
\draw[line width=1.5pt]($(C1)!0.7!(D1)+(0,-0.2)$)--++(275:1.35);
\end{scope}
%%%%
%mobile
\begin{scope}[local bounding box=MOB,scale=0.4, every node/.append style={transform shape},
shift={($(A3.south)+(0,-2.45)$)}]
\node[rectangle,draw,minimum height=94,minimum width=47,
rounded corners=6,thick,fill=Red](R1){};
\node[rectangle,draw,minimum height=60,minimum width=32,thick,fill=white](R2){};
\node[circle,draw,minimum size=8,below= 2pt of R2,inner sep=0pt,thick,fill=brown]{};
\node[rectangle,fill=black,minimum height=1,minimum width=20,above= 4pt of R2,inner sep=0pt,thick]{};
%
\coordinate(G)at(-1.01,-2.12);
\coordinate(D)at(-1.62,-1.59);
\coordinate(D1)at(-0.03,-1.65);
\coordinate(D2)at(-0.82,-0.25);
\coordinate(D3)at(-1.17,0.25);
\coordinate(D4)at(-0.73,-1.0);
\coordinate(D5)at(-1.14,0.6);
\coordinate(D6)at(-1.29,-0.1);
\coordinate(D7)at(-1.55,-0.71);
%\fill[blue](G)circle(1pt);
\coordinate(P1)at(0.83,0.80);
\coordinate(P11)at(0.93,0.80);
\coordinate(P2)at(0.83,0.36);
\coordinate(P22)at(0.93,0.36);
\coordinate(P3)at(0.76,-0.03);
\coordinate(P4)at(0.76,-0.56);
\coordinate(P5)at(0.76,-1.09);
%hand
\draw[thick,fill=orange!20](D)--(G)to[out=320,in=190] (D1)%--++(180:0.05)
to[out=180,in=270,distance=15] (D4)--(D2)to[out=50,in=50,distance=13] (D5)
to[out=230,in=70] (D6)to[out=250,in=70] (D7)to[out=250,in=150] (D);
\node[rectangle,draw,minimum height=12,minimum width=23,thick,
rounded corners=2.5,rotate=35,fill=orange!20]at(P3)(PR2){};
\node[rectangle,draw,minimum height=12,minimum width=23,thick,
rounded corners=2.5,rotate=35,fill=orange!20]at(P4)(PR3){};
\node[rectangle,draw,minimum height=12,minimum width=19,thick,
rounded corners=2.5,rotate=35,fill=orange!20]at(P5)(PR4){};
\draw[thick,fill=orange!20](P1)--(P11)to[out=355,in=5,distance=9] (P22)--(P2)--cycle;
%\fill[blue](D4)circle(1pt);
\end{scope}
%%%
%factory
\begin{scope}[local bounding box=MOB,scale=1.4, every node/.append style={transform shape},
shift={($(A2.south)+(0,-1.05)$)}]
\node[rectangle,draw,fill=brown,minimum height=15,minimum width=23,line width=1.0pt](R1){};
\draw[fill=brown,line width=1.0pt]($(R1.40)+(0,-0.01)$)--++(110:0.2)--++(180:0.12)|-($(R1.40)+(0,-0.01)$);
\draw[line width=1.0pt,fill=green](-0.68,-0.27)--++(88:0.9)--++(0:0.15)--(-0.48,-0.27)--cycle;
\draw[line width=2.5pt](-0.8,-0.27)--(0.55,-0.27);
\foreach \x in{0.25,0.45,0.65}{
\node[rectangle,fill=black,minimum height=2,minimum width=5,thick,inner sep=0pt]
at ($(R1.north)!\x!(R1.south)$){};
}
\foreach \x in{0.25,0.45,0.65}{
\node[rectangle,fill=black,minimum height=2,minimum width=5,thick,inner sep=0pt]
at ($(R1.130)!\x!(R1.230)$){};
}
\foreach \x in{0.25,0.45,0.65}{
\node[rectangle,fill=black,minimum height=2,minimum width=5,thick,inner sep=0pt]
at ($(R1.50)!\x!(R1.310)$){};
}
\end{scope}
%%
%display
\colorlet{BlueLine}{BrownLine!70!black!99}
\begin{scope}[local bounding box=COMP, shift={($(A1.south)+(0,-1.05)$)}]
\node[comp,fill=BrownLine!10](COM){};
\draw[draw = BlueLine,line width=1.0pt]
($(COM.north west)!0.85!(COM.south west)$)
-- ($(COM.north east)!0.85!(COM.south east)$);
\draw[draw = BlueLine,line width=1.0pt]
($(COM.south west)!0.4!(COM.south east)$)--++(270:0.2)coordinate(DL);
\draw[draw = BlueLine,line width=1.0pt]
($(COM.south west)!0.6!(COM.south east)$)--++(270:0.2)coordinate(DD);
\draw[draw = BlueLine,line width=3.0pt,shorten <=-3mm,shorten >=-3mm](DL)--(DD);
\node[GreenLine](CB1) at ($(COM.north west)!0.25!(COM.south west)+(0.3,0)$) {$\checkmark$};
\node[GreenLine](CB2) at ($(COM.north west)!0.6!(COM.south west)+(0.3,0)$) {$\checkmark$};
\draw[GreenLine,decoration={zigzag,segment length=4pt, amplitude=0.5pt},decorate]
($(CB1)+(0.3,0.05)$)--++(0:1.3);
\draw[GreenLine,decoration={zigzag,segment length=4pt, amplitude=0.5pt},decorate]
($(CB1)+(0.3,-0.12)$)--++(0:1.0);
\draw[GreenLine,decoration={zigzag,segment length=4pt, amplitude=0.5pt},decorate]
($(CB2)+(0.3,0.05)$)--++(0:1.3);
\draw[GreenLine,decoration={zigzag,segment length=4pt, amplitude=0.5pt},decorate]
($(CB2)+(0.3,-0.12)$)--++(0:1.0);
\end{scope}
%%%%%%
%pencil
\begin{scope}[rotate=300,scale=0.3,shift={($(COMP)+(-0.15,1.05)$)}]
\fill[Green] (0,4) -- (0.4,4) -- (0.4,0) --(0.3,-0.15) -- (0.2,0) -- (0.1,-0.14) -- (0,0) -- cycle;
\draw[color=yellow,thick] (0.2,4) -- (0.2,0);
\fill[black] (0,3.5) -- (0.2,3.47) -- (0.4,3.5) -- (0.4,4) arc(30:150:0.23cm);
\fill[brown!60] (0,0) -- (0.2,-0.8)node[coordinate,pos=0.75](a){} -- (0.4,0)node[coordinate,pos=0.25](b){} -- (0.3,-0.15) -- (0.2,0) -- (0.1,-0.14) -- cycle;
\fill[gray] (a) -- (0.2,-0.8) -- (b) -- cycle;
\end{scope}
%
\node[above=0.9 of $(A2)!0.5!(A3)$](LCA){\textbf{Life Cycle Analysis}};
\node[above=0 of LCA](LCA){AI System};
\end{tikzpicture}
```
![](images/svg/ai-lca.svg){width=100%}
:::
:::
@@ -2963,23 +2509,8 @@ Carbon-aware scheduling is fundamentally a **load shifting software problem**. T
Google's carbon-intelligent computing platform[^fn-carbon-scheduling-results] demonstrated this approach at scale, achieving a 40% reduction in carbon footprint by shifting workloads between datacenters globally.
::: {.callout-note title="Carbon-Aware Scheduling Framework"}
![](images/svg/intervention-cascade.svg){width=100%}
```mermaid
graph TD
A[Start: AI Job Submission] --> B{Is Job Urgent?}
B -- Yes --> C[Execute Immediately]
B -- No --> D[Check Grid Carbon Intensity]
D --> E{Carbon Intensity < Threshold?}
E -- Yes --> C
E -- No --> F[Delay Job / Shift Region]
F --> D
C --> G[Monitor Renewable Availability]
G --> H[Job Complete]
style A fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#9f9,stroke:#333,stroke-width:2px
style F fill:#ff9,stroke:#333,stroke-width:2px
```
**Carbon-Aware Workload Scheduling**. Conceptual diagram showing the decision logic for temporal and geographic shifting. The scheduler evaluates urgency and grid carbon intensity to decide whether to execute immediately, delay until renewable energy is abundant (temporal shifting), or route the job to a cleaner region (geographic shifting).
:::

View File

@@ -245,6 +245,11 @@ NETWORK_10G_BW = 10 * Gbps
NETWORK_100G_BW = 100 * Gbps
NETWORK_5G_ENERGY_PER_MB_MJ = 100 * ureg.millijoule / MB
# Optical Interconnects (2025-2026 Reference)
OPTICS_POWER_PLUGGABLE_400G_W = 20 * watt
OPTICS_POWER_CPO_400G_W = 10 * watt
OPTICS_POWER_LPO_400G_W = 12 * watt # Linear Pluggable Optics
# Intra-node interconnects
NVLINK_V100_BW = 300 * GB / second # NVLink 2.0 (V100, 6 links × 50 GB/s)
NVLINK_A100_BW = 600 * GB / second # NVLink 3.0 (A100, 12 links × 50 GB/s)
@@ -281,6 +286,11 @@ ENERGY_ADD_INT8_PJ = 0.03 * ureg.picojoule
# Network transfer energy (reference)
NETWORK_ENERGY_1KB_PJ = 1_000_000 * ureg.picojoule # ~1 microjoule for 1KB
# --- Infrastructure & Grid ---
LEAD_TIME_GPU_MONTHS = 6
LEAD_TIME_SUBSTATION_MONTHS = 24
GRID_INTERCONNECTION_QUEUE_US_GW = 2000
# --- Physics ---
SPEED_OF_LIGHT_FIBER_KM_S = 200000 * ureg.kilometer / second
@@ -386,6 +396,13 @@ INT8_BITS = 8
MNIST_IMAGE_WIDTH = 28
MNIST_IMAGE_HEIGHT = 28
# Synthetic Data Constraints
SYNTHETIC_PROVENANCE_OVERHEAD = 0.4
SYNTHETIC_VERIFICATION_PASSES = 3
# Inference Scaling
LOGIC_WALL_REASONING_STEPS_EXAMPLE = 128
# Statistics
KS_TEST_COEFFICIENT = 1.36
@@ -489,10 +506,13 @@ GPU_MTTF_HOURS = 50_000 # Single GPU die (datacenter, steady-state)
NIC_MTTF_HOURS = 150_000 # Network interface card
PSU_MTTF_HOURS = 100_000 # Power supply unit
PCIE_SWITCH_MTTF_HOURS = 200_000 # PCIe switch/bridge
CABLE_MTTF_HOURS = 500_000 # Optical cable / transceiver
CABLE_MTTF_HOURS = 50_000 # Optical cable / transceiver (lowered for SDC analysis)
TOR_SWITCH_MTTF_HOURS = 300_000 # Top-of-rack switch
HBM_MTTF_HOURS = 200_000 # HBM memory module
# Silent Data Corruption (SDC) Assumptions
P_SDC_PER_GPU_HR = 1e-6
# Recovery time assumptions (seconds)
HEARTBEAT_TIMEOUT_S = 30 # Failure detection latency
RESCHEDULE_TIME_S = 60 # Time to allocate replacement node

View File

@@ -24,6 +24,20 @@ class PerformanceProfile:
peak_bw_actual: Q_
feasible: bool
def plot(self, mode="latency"):
"""Generates a visualization of this profile.
Args:
mode (str): 'latency' for breakdown, 'roofline' for roofline plot.
"""
from ..viz import plot_latency_breakdown, plot_roofline
if mode == "latency":
return plot_latency_breakdown(self)
elif mode == "roofline":
return plot_roofline(self)
else:
raise ValueError(f"Unknown plot mode: {mode}")
class Engine:
"""
Unified solver for ML Systems trade-offs.

View File

@@ -5,10 +5,12 @@
try:
import matplotlib.pyplot as plt
_matplotlib_available = True
import numpy as np
_viz_available = True
except ImportError:
plt = None
_matplotlib_available = False
np = None
_viz_available = False
# --- Brand & Book Palette ---
COLORS = {
@@ -27,16 +29,11 @@ COLORS = {
}
def set_book_style():
"""Applies the global matplotlib style configuration.
Font priority mirrors TikZ's \\usefont{T1}{phv}{m}{n} (Helvetica).
The fallback chain covers macOS (Helvetica), Linux TeX installs
(Nimbus Sans L, TeX Gyre Heros), and generic Linux (DejaVu Sans).
"""
if not _matplotlib_available:
"""Applies the global matplotlib style configuration."""
if not _viz_available:
raise ImportError(
"matplotlib is required for plot generation. "
"Install it with: pip install matplotlib"
"matplotlib and numpy are required for plot generation. "
"Install them with: pip install matplotlib numpy"
)
plt.rcParams.update({
'font.family': 'sans-serif',
@@ -79,23 +76,82 @@ def set_book_style():
'figure.autolayout': True
})
# --- Font Size Convention for Diagram Figures ---
# All diagram figures (flowcharts, pipelines, etc.) should use:
# - Node/box labels: fontsize=9, fontweight='bold'
# - Edge/arrow labels: fontsize=8
# - Step/annotation: fontsize=8
# - Supplementary text: fontsize=7 (italic gray for minor labels)
# - In-plot headings: fontsize=10-12, fontweight='bold'
# Data plot text inherits from rcParams (axes: 11, ticks: 9, legend: 9).
# --- Lightweight helpers ---
def setup_plot(figsize=None):
"""
One-line plot setup for QMD blocks.
Returns (fig, ax, COLORS, plt) after applying book style.
The plt is returned so code blocks don't need separate matplotlib import.
"""
"""One-line plot setup for QMD blocks."""
set_book_style()
fig, ax = plt.subplots(figsize=figsize)
return fig, ax, COLORS, plt
def bar_compare(labels, values, title, ylabel, goal_line=None, colors=None):
"""Creates a standard comparison bar chart with value labels."""
fig, ax, COLORS, plt = setup_plot()
if colors is None:
colors = [COLORS['BlueLine'], COLORS['GreenLine'], COLORS['OrangeLine'], COLORS['VioletLine']]
bars = ax.bar(labels, values, color=colors[:len(labels)], alpha=0.8, edgecolor='white', linewidth=1)
ax.set_title(title)
ax.set_ylabel(ylabel)
# Add value labels
for bar in bars:
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height + (max(values)*0.02),
f'{height:.2f}', ha='center', va='bottom', fontsize=9, fontweight='bold')
if goal_line:
ax.axhline(y=goal_line, color=COLORS['RedLine'], linestyle='--', linewidth=1.5, label='Constraint')
ax.legend()
return fig
def plot_latency_breakdown(profile, title="Latency Breakdown"):
"""Plots a stacked bar showing Compute vs Memory vs Overhead."""
fig, ax, COLORS, plt = setup_plot(figsize=(6, 5))
comp = profile.latency_compute.m_as('ms')
mem = profile.latency_memory.m_as('ms')
ovh = profile.latency_overhead.m_as('ms')
labels = ['Latency']
ax.bar(labels, [comp], label='Compute', color=COLORS['BlueLine'], alpha=0.8)
ax.bar(labels, [mem], bottom=[comp], label='Memory', color=COLORS['OrangeLine'], alpha=0.8)
ax.bar(labels, [ovh], bottom=[comp+mem], label='Overhead', color=COLORS['RedLine'], alpha=0.8)
ax.set_title(title)
ax.set_ylabel("Time (ms)")
ax.legend(loc='upper right')
total = comp + mem + ovh
ax.text(0, total/2, f"Total: {total:.2f} ms", ha='center', fontweight='bold', bbox=dict(facecolor='white', alpha=0.8))
return fig
def plot_roofline(profile, title="System Roofline Analysis"):
"""Plots the hardware roofline and the current workload point."""
fig, ax, COLORS, plt = setup_plot()
max_perf = profile.peak_flops_actual.m_as('GFLOPs/s')
max_bw = profile.peak_bw_actual.m_as('GB/s')
ridge_point = max_perf / max_bw
ai = profile.arithmetic_intensity.m_as('flop/byte')
achieved_perf = min(ai * max_bw, max_perf)
# X axis: Arithmetic Intensity
x = np.logspace(np.log10(ridge_point/100), np.log10(ridge_point*100), 100)
y = np.minimum(x * max_bw, max_perf)
ax.plot(x, y, color=COLORS['primary'], linewidth=2, label='Roofline')
ax.scatter([ai], [achieved_perf], color=COLORS['RedLine'], s=100, zorder=5, label=f'Workload (AI={ai:.1f})')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Arithmetic Intensity (FLOP/Byte)')
ax.set_ylabel('Performance (GFLOPs/s)')
ax.set_title(title)
ax.grid(True, which="both", ls="-", alpha=0.2)
ax.legend()
return fig