vol2: finalize visual narrative (added power path, 3D parallelism, continuous batching, and carbon sankey diagrams)

2026-03-11 17:49:25 -05:00 · 2026-02-24 14:30:44 -05:00
parent 08e2f0f37a
commit af865684e0
5 changed files with 246 additions and 5063 deletions
--- a/book/quarto/contents/vol2/compute_infrastructure/compute_infrastructure.qmd
+++ b/book/quarto/contents/vol2/compute_infrastructure/compute_infrastructure.qmd
@@ -1218,6 +1218,42 @@ For capacity planning, the sustained throughput rate, not the peak rate, should

 The Memory Wall constrains how fast data reaches the compute units; the Roofline Model diagnoses whether compute or memory is the binding constraint; and Tensor Cores maximize the arithmetic value of every byte fetched. A third physical constraint also limits the accelerator's performance: the heat generated by all this computation. Every FLOP dissipates energy, and the faster we compute, the more heat we must remove. This is the Power Wall.

+::: {#fig-power-path fig-env="figure" fig-pos="htb" fig-cap="**The Power Delivery Path**. The journey of energy from the high-voltage grid to the low-voltage transistor. Each stage involves conversion losses (quantified by PUE) and requires stabilizing infrastructure to handle the massive current ramps of ML training. The critical engineering challenge is the Rack PDU to VRM transition, where 10--40 kW of power must be delivered within a single cabinet." fig-alt="Flowchart showing power journey from Grid Substation to Datacenter UPS, to Rack PDU, to Server PSU, to Voltage Regulator Module, finally to GPU Die. Arrows show power flow."}
+```{.tikz}
+\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=1.2cm]
+  \definecolor{GridColor}{RGB}{200,200,200}
+  \definecolor{FacilityColor}{RGB}{200,240,255}
+  \definecolor{RackColor}{RGB}{255,240,210}
+  \definecolor{SiliconColor}{RGB}{255,220,220}
+
+  \tikzset{
+    stage/.style={draw=black!70, thick, rounded corners=2pt, align=center, minimum width=2.8cm, minimum height=1.0cm}
+  }
+
+  % Nodes
+  \node[stage, fill=GridColor] (Sub) {Grid\\Substation\\(115 kV+)};
+  \node[stage, fill=FacilityColor, below=of Sub] (UPS) {Facility UPS\\\& Transformer\\(480 V)};
+  \node[stage, fill=RackColor, below=of UPS] (PDU) {Rack PDU\\(208--415 V)};
+  \node[stage, fill=RackColor, below=of PDU] (PSU) {Server PSU\\(12 V / 48 V)};
+  \node[stage, fill=SiliconColor, below=of PSU] (VRM) {On-Board VRM\\(0.8--1.2 V)};
+  \node[stage, fill=SiliconColor, below=of VRM] (Die) {\textbf{GPU Die}\\{(1,000+ Amps)}};
+
+  % Flows
+  \draw[->, ultra thick] (Sub) -- (UPS);
+  \draw[->, ultra thick] (UPS) -- (PDU);
+  \draw[->, ultra thick] (PDU) -- (PSU);
+  \draw[->, ultra thick] (PSU) -- (VRM);
+  \draw[->, ultra thick] (VRM) -- (Die);
+
+  % Annotations
+  \node[right=0.5cm of UPS, text=BlueLine] {\textbf{Facility Level}};
+  \node[right=0.5cm of PSU, text=OrangeLine] {\textbf{Rack Level}};
+  \node[right=0.5cm of Die, text=RedLine] {\textbf{Silicon Level}};
+
+\end{tikzpicture}
+```
+:::
+
 ## Thermal Design Power {#sec-compute-tdp}

 \index{TDP}
--- a/book/quarto/contents/vol2/distributed_training/distributed_training.qmd
+++ b/book/quarto/contents/vol2/distributed_training/distributed_training.qmd
--- a/book/quarto/contents/vol2/inference/inference.qmd
+++ b/book/quarto/contents/vol2/inference/inference.qmd
@@ -1153,6 +1153,58 @@ Continuous batching (also called iteration-level batching) decouples batch membe
 **Archetype A (GPT-4 / Llama-3)** (@sec-vol2-introduction-archetypes) relies on continuous batching to solve its primary efficiency paradox. The decode phase is memory-bandwidth bound, meaning the GPU compute cores are idle waiting for weights to load. Continuous batching saturates this bandwidth by processing unrelated requests together. Without this technique, serving Archetype A (GPT-4 / Llama-3) models would be economically unviable due to low GPU utilization.
 :::

+Unlike static or dynamic batching, which group requests at the *request* level, continuous batching operates at the *iteration* level, dynamically reshaping the compute tensor at each clock cycle.
+
+::: {#fig-continuous-batching-comparison fig-env="figure" fig-pos="htb" fig-cap="**Static vs. Continuous Batching**. In Static Batching (A), all requests in a batch must wait for the longest request to complete before the GPU can begin the next batch, leading to significant idle compute time (shaded gray). Continuous Batching (B) allows new requests to enter the batch as soon as any request finishes, keeping the GPU saturated and dramatically improving throughput." fig-alt="Two-panel timeline comparison. Left: Static batching shows requests of different lengths with large white space representing idle time. Right: Continuous batching shows requests filling the gaps as soon as one ends."}
+```{.tikz}
+\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, xscale=0.8, yscale=0.7]
+  \definecolor{StaticColor}{RGB}{200,200,200}
+  \definecolor{ReqColor}{RGB}{0,99,149}    % BlueLine
+  \definecolor{WaitColor}{RGB}{240,240,240}
+
+  \tikzset{
+    req/.style={fill=ReqColor!60, draw=black!60, thick},
+    idle/.style={fill=black!5, draw=black!30, dashed}
+  }
+
+  % Panel A: Static Batching
+  \begin{scope}
+    \node[anchor=west, crimson] at (0, 4.5) {\textbf{A. Static Batching}};
+    % Batch 1
+    \draw[req] (0, 3) rectangle (2, 3.8) node[midway, white] {R1};
+    \draw[req] (0, 2) rectangle (5, 2.8) node[midway, white] {R2};
+    \draw[req] (0, 1) rectangle (3, 1.8) node[midway, white] {R3};
+    % Idle regions
+    \draw[idle] (2, 3) rectangle (5, 3.8);
+    \draw[idle] (3, 1) rectangle (5, 1.8);
+    % Vertical barrier
+    \draw[thick, red, dashed] (5, 0.5) -- (5, 4.2) node[above, font=\tiny] {Batch Barrier};
+    
+    % Batch 2 starts after barrier
+    \draw[req] (5.2, 3) rectangle (8, 3.8) node[midway, white] {R4};
+    \node at (2.5, -0.5) {Idle Compute};
+  \end{scope}
+
+  % Panel B: Continuous Batching
+  \begin{scope}[shift={(10,0)}]
+    \node[anchor=west, crimson] at (0, 4.5) {\textbf{B. Continuous Batching}};
+    % R1 ends at 2, R4 enters immediately
+    \draw[req] (0, 3) rectangle (2, 3.8) node[midway, white] {R1};
+    \draw[req, fill=OrangeLine!60] (2.1, 3) rectangle (5, 3.8) node[midway, white] {R4};
+    \draw[req, fill=OrangeLine!60] (5.1, 3) rectangle (8, 3.8) node[midway, white] {R5};
+    
+    \draw[req] (0, 2) rectangle (5, 2.8) node[midway, white] {R2};
+    \draw[req, fill=GreenLine!60] (5.1, 2) rectangle (9, 2.8) node[midway, white] {R6};
+    
+    \draw[req] (0, 1) rectangle (3, 1.8) node[midway, white] {R3};
+    \draw[req, fill=PurpleLine!60] (3.1, 1) rectangle (7, 1.8) node[midway, white] {R7};
+    
+    \node at (4.5, -0.5) {Continuous Utilization};
+  \end{scope}
+\end{tikzpicture}
+```
+:::
+
 ### Continuous Batching Throughput Analysis {#sec-inference-scale-continuous-batching-throughput-analysis-ecd2}

 Continuous batching's dynamic batch management maintains high GPU utilization regardless of sequence length variance. The throughput improvement depends on sequence length distribution. For a distribution with coefficient of variation $CV = \sigma / \mu$, the gain is approximately @eq-continuous-batching-gain:
--- a/book/quarto/contents/vol2/performance_engineering/performance_engineering.qmd
+++ b/book/quarto/contents/vol2/performance_engineering/performance_engineering.qmd
--- a/book/quarto/contents/vol2/sustainable_ai/sustainable_ai.qmd
+++ b/book/quarto/contents/vol2/sustainable_ai/sustainable_ai.qmd
@@ -1357,6 +1357,43 @@ The geographic choice alone produces a `{python} emissions_ratio_str`-fold diffe

 #### Embodied Carbon Assessment {#sec-sustainable-ai-embodied-carbon-assessment-9de0}

+::: {#fig-carbon-sankey fig-env="figure" fig-pos="htb" fig-cap="**The Total Carbon of Ownership (TCO)**. Sankey-style flow visualizing how carbon emissions accumulate across the AI lifecycle. For a typical datacenter deployment, operational energy (training and serving) dominates total emissions. However, as the grid shifts to renewables, the **Embodied Carbon** from semiconductor fabrication and datacenter construction becomes the binding sustainability constraint, making hardware longevity a critical engineering lever." fig-alt="Sankey diagram showing three input flows: Raw Materials, Semiconductor Fab, and Grid Energy. These merge into Training and Serving phases, ending in AI Model Value. Widths show relative carbon impact."}
+```{.tikz}
+\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, line join=round]
+  \definecolor{EmbodiedColor}{RGB}{204,85,0}  % OrangeLine
+  \definecolor{OpsColor}{RGB}{0,99,149}       % BlueLine
+  \definecolor{ValueColor}{RGB}{0,143,69}     % GreenLine
+
+  % Input flows
+  \fill[EmbodiedColor!30] (0, 3) -- (3, 2.5) -- (3, 1.5) -- (0, 2) -- cycle;
+  \node[left, align=right] at (0, 2.5) {\textbf{Raw Materials}\\Extraction \& Log.};
+  
+  \fill[EmbodiedColor!50] (0, 1.5) -- (3, 1.5) -- (3, 0.5) -- (0, 1) -- cycle;
+  \node[left, align=right] at (0, 1.25) {\textbf{Chip Fabrication}\\(Embodied Carbon)};
+
+  \fill[OpsColor!40] (0, 0) -- (3, 0.5) -- (3, -2.5) -- (0, -2) -- cycle;
+  \node[left, align=right] at (0, -1) {\textbf{Grid Energy}\\(Coal/Gas/Renew.)};
+
+  % Lifecycle phases
+  \fill[OpsColor!60] (3, 2.5) -- (6, 2.5) -- (6, 0.5) -- (3, 0.5) -- cycle;
+  \node at (4.5, 1.5) {\textbf{Model Training}};
+
+  \fill[OpsColor!80] (3, 0.5) -- (6, 0.5) -- (6, -2.5) -- (3, -2.5) -- cycle;
+  \node at (4.5, -1) {\textbf{Inference / Serving}};
+
+  % Output flow
+  \fill[ValueColor!40] (6, 2.5) -- (9, 1.5) -- (9, -0.5) -- (6, -2.5) -- cycle;
+  \node[right, align=left] at (9, 0.5) {\textbf{AI Fleet Intelligence}\\(System Value)};
+
+  % Labels
+  \node[above, EmbodiedColor] at (1.5, 3) {\textit{Upstream}};
+  \node[above, OpsColor] at (4.5, 2.5) {\textit{Operational}};
+  \node[above, ValueColor] at (7.5, 2) {\textit{Downstream}};
+
+\end{tikzpicture}
+```
+:::
+
 Embodied carbon encompasses emissions from raw material extraction, semiconductor fabrication, assembly, transportation, and end-of-life disposal. For AI hardware, manufacturing emissions are dominated by the energy-intensive nature of advanced semiconductor processes.

 A single NVIDIA H100 GPU embodies approximately 150 to 200 kg CO2eq from manufacturing, including wafer fabrication at advanced process nodes, high-bandwidth memory production, and packaging. @eq-embodied-daily amortizes this embodied carbon over the hardware lifetime to compute per-use emissions: