vol2: comprehensive standalone transformation (Appendix A foundations review, progressive disclosure pass, and term decoupling)

2026-04-28 08:39:14 -05:00 · 2026-02-23 18:07:29 -05:00
parent 5b0b2235f3
commit 9f931da1ea
4 changed files with 55 additions and 47 deletions
--- a/book/quarto/contents/vol2/backmatter/appendix_systems.qmd
+++ b/book/quarto/contents/vol2/backmatter/appendix_systems.qmd
@@ -1,58 +1,38 @@
-# Appendix A: System Architectures & Foundations {#sec-appendix-systems}
+# Appendix A: Foundations Review {#sec-appendix-systems}

 ## Purpose {.unnumbered}

-_What “lineage” does a large-scale ML system inherit, and what does that imply for its architecture?_
+_What foundational knowledge is required to understand machine learning systems at scale?_

-At scale, ML infrastructure looks like a hybrid: training resembles high-performance computing (tight coupling, bandwidth-hungry collectives), while inference resembles warehouse-scale computing (service reliability, tail latency, elastic fleets). When we treat ML as “just HPC” or “just web services,” we often inherit the wrong assumptions about scheduling, fault tolerance, and networking.
-
-This appendix provides a compact reference for the three paradigms used throughout this book—HPC, WSC, and the ML Fleet—and a short recap of the foundations from the preceding book that this one builds on.
+Large-scale ML infrastructure is a hybrid of high-performance computing (tight coupling, bandwidth-hungry collectives) and warehouse-scale computing (service reliability, tail latency, elastic fleets). This appendix provides a compact reference for these architectural paradigms and a recap of the machine learning system fundamentals --- hardware hierarchies, the Iron Law, and the D·A·M taxonomy --- that underpin the scale-focused topics in this book.

 ## How to Use This Appendix {.unnumbered}

-Use this appendix as a **reference** when you need to quickly map a workload to its closest paradigm and anticipate the corresponding design trade-offs (checkpointing vs. redundancy, gang scheduling vs. elasticity, latency vs. throughput).
-
-Conventions used here follow the book-wide notation established earlier in the text (see “Notation and Conventions” in the preceding book).
-
-This appendix provides a reference for the architectural paradigms and foundational concepts that underpin the distributed machine learning systems discussed in this book. It contrasts the emerging "ML Fleet" architecture with traditional High-Performance Computing (HPC) and Warehouse-Scale Computing (WSC) models, and recaps core ML systems concepts from the preceding book.
+Use this appendix as a **Foundational Review** when you need to quickly refresh your intuition on single-machine bottlenecks or map a distributed workload to its architectural lineage.

 ## The Three System Paradigms {#sec-appendix-systems-three-paradigms}

-Machine Learning Systems at scale represent a convergence of two distinct computing lineages: the raw computational power of HPC and the fault-tolerant scalability of WSC. Understanding how ML systems differ from their predecessors is crucial for designing infrastructure that meets the unique demands of large-scale training and inference.
+Machine Learning Systems at scale represent a convergence of two distinct computing lineages: the raw computational power of High-Performance Computing (HPC) and the fault-tolerant scalability of Warehouse-Scale Computing (WSC). 

 ### 1. High-Performance Computing (HPC) {#sec-appendix-systems-hpc}
 *The lineage of Supercomputers.*

 *   **Philosophy**: "Maximize FLOPs per second."
-*   **Workload**: Tightly coupled simulations (e.g., weather, nuclear physics).
-*   **Hardware**: Specialized interconnects (InfiniBand), low-latency fabrics, homogeneous nodes.
 *   **Fault Tolerance**: **Checkpoint/Restart**. If one node fails, the entire job stops, rolls back to the last checkpoint, and restarts.
 *   **Scheduling**: Batch scheduling (Slurm). Jobs request rigid resource shapes (e.g., "512 nodes for 24 hours").
-*   **The "Pets" Model**: Nodes are individually important. A specific node failure is a critical event.

 ### 2. Warehouse-Scale Computing (WSC) {#sec-appendix-systems-wsc}
 *The lineage of the Web.*

 *   **Philosophy**: "Maximize Requests per second (QPS)."
-*   **Workload**: Loosely coupled services (e.g., Search, Gmail, Social Media).
-*   **Hardware**: Commodity Ethernet, varying generations of hardware, heterogeneous nodes.
 *   **Fault Tolerance**: **Redundancy**. If one node fails, the load balancer reroutes traffic to another replica. The user never notices.
 *   **Scheduling**: Service orchestration (Kubernetes, Borg). Jobs are dynamic, elastic, and bin-packed.
-*   **The "Cattle" Model**: Nodes are interchangeable and expendable. Failures are routine background noise.

 ### 3. The Machine Learning Fleet {#sec-appendix-systems-ml-fleet}
 *The Hybrid Architecture.*

 ML Systems require the **throughput** of HPC (to train massive models) but must operate on the **scale** and **unreliability** of WSC.

-*   **Philosophy**: "Maximize Model Quality per Dollar/Watt."
-*   **Workload**:
-    *   *Training*: Synchronous, bandwidth-heavy (like HPC) but long-running and resilient (like WSC).
-    *   *Inference*: Latency-sensitive (like WSC) but computationally heavy (like HPC).
-*   **Hardware**: A fusion. TCP/IP for control planes, InfiniBand/NVLink for data planes. GPU-centric nodes.
-*   **Fault Tolerance**: **Elasticity**. Training jobs can shrink/expand or pause/resume without full restarts. Inference serves degrade gracefully.
-*   **Scheduling**: Gang scheduling (all-or-nothing allocation) combined with dynamic preemption and replacement.
-
 | **Feature**    | **HPC (Supercomputer)** | **WSC (Web Cloud)**   | **ML Fleet (AI Cluster)**              |
 |:---------------|:------------------------|:----------------------|:---------------------------------------|
 | **Coupling**   | Tight (MPI)             | Loose (RPC/HTTP)      | Hybrid (NCCL + RPC)                    |
@@ -60,24 +40,26 @@ ML Systems require the **throughput** of HPC (to train massive models) but must
 | **Network**    | Latency-optimized       | Bandwidth-optimized   | Bisection-Bandwidth critical           |
 | **Bottleneck** | Compute (FLOPs)         | I/O (Disk/Net)        | Memory Bandwidth (HBM)                 |

-## Foundations of ML Systems (Recap) {#sec-appendix-systems-vol1-recap}
+## Recap of ML System Fundamentals {#sec-appendix-systems-vol1-recap}

-This section provides a condensed reference of the foundational concepts from the preceding book. Those fundamentals—spanning the ML lifecycle, hardware foundations, and optimization techniques—form the building blocks upon which the scale-focused topics here are constructed.
-
-### The Machine Learning Lifecycle {#sec-appendix-systems-lifecycle}
-1.  **Data Engineering**: ETL pipelines, Feature Stores, Validation.
-2.  **Model Development**: Architecture search, Hyperparameter tuning.
-3.  **Deployment**: Serving, Monitoring, Incident Response.
-4.  **Operations (MLOps)**: CI/CD for data/models, Drift detection.
+Distributed systems reasoning depends on a precise understanding of the single-machine constraints that they attempt to overcome.

 ### Hardware Foundations {#sec-appendix-systems-hardware}
-*   **CPUs**: Sequential logic, preprocessing, orchestration.
-*   **GPUs**: Parallel tensor arithmetic (Matrix Multiply), HBM.
-*   **TPUs/NPUs**: Systolic arrays for dense matrix ops, low power.
-*   **Memory Hierarchy**: HBM (Fast/Small) $\to$ SRAM (Faster/Tiny) $\to$ NVMe (Slow/Huge).
+*   **GPUs**: Parallel tensor arithmetic engines optimized for Matrix Multiplication. They hide memory latency by maintaining thousands of threads in flight (SIMT).
+*   **High Bandwidth Memory (HBM)**: 3D-stacked DRAM that provides the terabytes-per-second bandwidth required to feed parallel arithmetic units.
+*   **Memory Hierarchy**: The fundamental gap in speed and capacity between **Registers** (few MB, infinite speed), **SRAM** (tens of MB, TB/s), **HBM** (tens of GB, TB/s), and **NVMe** (TBs, GB/s).

-### Optimization Fundamentals {#sec-appendix-systems-optimization}
-*   **Precision**: FP32 (Training) vs. BF16 (Mixed) vs. INT8 (Inference).
-*   **Sparsity**: Structured (N:M) vs. Unstructured pruning.
-*   **Distillation**: Teacher $\to$ Student knowledge transfer.
-*   **The Iron Law**: $L = \frac{D}{B} + \frac{Ops}{P} + L_{lat}$. Performance is bounded by Data movement or Compute.
+### The Iron Law of ML Systems {#sec-appendix-systems-iron-law}
+The time $T$ required for an ML operation is the sum of three terms representing the system's physical limits:
+$$T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak}} + L_{lat}$$
+Where:
+- $D_{vol}$: Volume of data moved.
+- $BW$: System bandwidth.
+- $O$: Total operations (FLOPs).
+- $R_{peak}$: Peak throughput (FLOPS).
+- $L_{lat}$: Fixed latency (speed of light, software overhead).
+
+### Optimization Techniques {#sec-appendix-systems-optimization}
+*   **Quantization**: Reducing numerical precision (FP32 $\to$ FP16 $\to$ INT8) to shrink $D_{vol}$ and increase effective $BW$.
+*   **Pruning/Sparsity**: Removing weights to reduce both $D_{vol}$ and $O$.
+*   **Operator Fusion**: Combining operations to keep data in fast SRAM, minimizing trips to HBM.
--- a/book/quarto/contents/vol2/inference/inference.qmd
+++ b/book/quarto/contents/vol2/inference/inference.qmd
@@ -4767,7 +4767,7 @@ Scaling global infrastructure aggressively is highly effective, but it is also e
 What if we could take a 140 GB language model that requires two A100 GPUs to serve, and crush it down to 35 GB so it runs blisteringly fast on a single GPU? Weight quantization for serving achieves this by brutally compressing the precision of the model's neural connections from 16-bit floats to 8-bit or 4-bit integers, trading a negligible fraction of accuracy for massive operational savings.

 ::: {.callout-note title="Prior Knowledge: Quantization"}
-This section assumes familiarity with quantization fundamentals including **post-training quantization (PTQ)**, which quantizes a trained model without retraining, and **quantization-aware training (QAT)**, which incorporates quantization effects during training for higher accuracy. Readers should also understand precision formats (FP32, FP16, INT8) and the basic trade-off between numerical precision and memory/compute efficiency. These concepts are covered in standard deep learning optimization references and in the preceding book's treatment of model compression.
+This section assumes familiarity with quantization fundamentals including **post-training quantization (PTQ)**, which quantizes a trained model without retraining, and **quantization-aware training (QAT)**, which incorporates quantization effects during training for higher accuracy. Readers should also understand precision formats (FP32, FP16, INT8) and the basic trade-off between numerical precision and memory/compute efficiency. These concepts are covered in standard deep learning optimization references and in the foundational summary in @sec-appendix-systems.
 :::

 Quantization reduces numerical precision of model weights and activations, decreasing memory footprint by 2-4$\times$ while increasing decode throughput, which is memory-bandwidth limited rather than compute limited. While these quantization fundamentals are established techniques, serving at scale introduces distinct challenges: models must be quantized after training without access to training data, quality must be preserved across diverse inputs, and hardware deployment targets vary from datacenter GPUs to edge accelerators. The following discussion covers quantization techniques specifically designed for production inference.
--- a/book/quarto/contents/vol2/introduction/introduction.qmd
+++ b/book/quarto/contents/vol2/introduction/introduction.qmd
@@ -302,7 +302,7 @@ Throughout this book, recurring *lighthouse archetypes* ground abstract principl
 **Systems Perspectives** continue to appear as sidebars, now focusing on the physics of data centers, network topology, and distributed consistency (CAP theorem).
 :::

-This exponential growth has transformed ML from a discipline where algorithms dominate to one where systems engineering determines success. A sophisticated algorithm that cannot scale often provides less practical value than a simpler algorithm deployed efficiently across scalable infrastructure.
+This exponential growth in **Federated Learning**[^fn-federated-forward] has transformed ML from a discipline where algorithms dominate to one where systems engineering determines success. A sophisticated algorithm that cannot scale often provides less practical value than a simpler algorithm deployed efficiently across scalable infrastructure.

 The transition from single-machine to distributed training introduces qualitative changes in system behavior. The unit of compute is no longer a single server, but a **Machine Learning Fleet**—a massive, interconnected distributed system that must act as a single coherent engine.

@@ -875,7 +875,7 @@ Scale and distribution do not just create engineering challenges; they amplify i

 #### Security and the Fleet Threat

-ML systems face unique security threats that intensify at production scale. **Model Extraction** attacks can steal proprietary intellectual property through API queries. **Data Poisoning** can inject backdoors into models that remain dormant until triggered by a specific input. *When* you operate at fleet scale, these threats become economically attractive targets for sophisticated attackers. Defending the fleet requires systematic approaches: access controls, differential privacy, and continuous behavioral monitoring that go far beyond traditional perimeter security.
+ML systems face unique security threats that intensify at production scale. **Model Extraction** attacks can steal proprietary intellectual property through API queries. **Data Poisoning** can inject backdoors into models that remain dormant until triggered by a specific input. *When* you operate at fleet scale, these threats become economically attractive targets for sophisticated attackers. Defending the fleet requires systematic approaches: access controls, **differential privacy**[^fn-dp-forward], and continuous behavioral monitoring that go far beyond traditional perimeter security.

 #### The Regulatory Wall

@@ -1336,7 +1336,8 @@ The **Linear Scaling Fallacy** occurs when teams extrapolate resource requiremen

 This volume opened with a fundamental challenge: the principles that enable success on single machines become the obstacles that prevent success at scale. We have moved from the laboratory to the **Machine Learning Fleet**, where communication costs dominate computation, failures are a statistical certainty, and societal impact demands rigorous governance.

-The transition from building systems that *work* to building systems that *scale* represents the next frontier of engineering. The principles established in the preceding book—measure everything, optimize the bottleneck, design for failure—remain essential, but their application changes fundamentally when the "system" spans thousands of nodes. Network topology becomes as important as memory hierarchy, and distributed consensus replaces local synchronization.
+The transition from building systems that *work* to building systems that *scale* represents the next frontier of engineering. The core principles of machine learning systems --- measure everything, optimize the bottleneck, design for failure --- remain essential, but their application changes fundamentally when the "system" spans thousands of nodes.
+ Network topology becomes as important as memory hierarchy, and distributed consensus replaces local synchronization.

 ::: {.callout-takeaways title="Scale Changes the Rules"}

@@ -1357,5 +1358,14 @@ We have established the *requirements* of scale: we know *why* we must distribut

 :::

+
+[^fn-zero-forward]: **ZeRO (Zero Redundancy Optimizer)**: A memory optimization technique that partitions model state (parameters, gradients, optimizer states) across workers rather than replicating it. @sec-distributed-training-systems examines how ZeRO enables training trillion-parameter models that would otherwise exceed single-device memory. \index{ZeRO!forward reference}
+
+[^fn-flashattention-forward]: **FlashAttention**: An I/O-aware attention algorithm that uses tiling to minimize memory traffic between GPU SRAM and HBM. @sec-performance-engineering develops the performance engineering principles that allow FlashAttention to achieve 2--4$\times$ speedups by overcoming the memory bandwidth bottleneck. \index{FlashAttention!forward reference}
+
+[^fn-federated-forward]: **Federated Learning**: A distributed learning paradigm where models are trained across millions of decentralized edge devices holding local data. @sec-edge-intelligence analyzes the privacy and coordination challenges of federated fleets. \index{Federated Learning!forward reference}
+
+[^fn-dp-forward]: **Differential Privacy (DP)**: A mathematical framework for adding calibrated noise to computations to bound information leakage about individual data points. @sec-security-privacy examines how DP protects user privacy in large-scale ML systems. \index{Differential Privacy!forward reference}
+
 ::: { .quiz-end }
 :::
--- a/book/quarto/contents/vol2/network_fabrics/network_fabrics.qmd
+++ b/book/quarto/contents/vol2/network_fabrics/network_fabrics.qmd
@@ -179,7 +179,7 @@ Consider the running example that threads through this volume: a 175-billion-par

 In the **Fleet Stack** (@sec-vol2-introduction), network fabrics form the connective tissue binding the Infrastructure Layer into a coherent whole. @sec-compute-infrastructure established the building blocks: accelerators, power delivery, and cooling. Those components define what each node can compute in isolation. This chapter wires those nodes together, because at scale, communication cost dominates computation cost. The **Law of Distributed Efficiency** (@eq-distributed-efficiency) makes this explicit: the $T_{\text{sync}} / T_{\text{compute}}$ ratio in the Scaling Factor is determined almost entirely by the network fabric. The fabric constrains every layer above it in the stack: @sec-distributed-training-systems cannot overlap communication with computation unless the fabric provides sufficient bandwidth, @sec-collective-communication cannot choose optimal algorithms without knowing the topology, and @sec-fault-tolerance-reliability must account for network partitions alongside node failures.

-The physical network fabric exists to carry three fundamental collective communication patterns. An AllReduce sums gradients across thousands of GPUs so that every device holds the identical average—the heartbeat of synchronous training. An AllGather collects different model portions so that every GPU can reconstruct the full model state. An AllToAll, the most demanding pattern, requires every GPU to send unique data to every other GPU, a requirement critical to expert parallelism. While @sec-collective-communication covers the *algorithms* that orchestrate these patterns, this chapter covers the *physics* of the wires and switches that carry them. Understanding the distinction matters because the fabric's physical properties—bandwidth, latency, topology—determine which patterns are efficient and which become bottlenecks.
+The physical network fabric exists to carry three fundamental collective communication patterns. An **AllReduce**[^fn-allreduce-forward] sums gradients across thousands of GPUs so that every device holds the identical average—the heartbeat of synchronous training. An **AllGather**[^fn-collectives-forward] collects different model portions so that every GPU can reconstruct the full model state. An AllToAll, the most demanding pattern, requires every GPU to send unique data to every other GPU, a requirement critical to **expert parallelism**[^fn-moe-forward]. While @sec-collective-communication covers the *algorithms* that orchestrate these patterns, this chapter covers the *physics* of the wires and switches that carry them. Understanding the distinction matters because the fabric's physical properties—bandwidth, latency, topology—determine which patterns are efficient and which become bottlenecks.

 ::: {.callout-perspective title="The Network as a Gradient Bus"}

@@ -1460,3 +1460,19 @@ Equally important, the $\alpha$-$\beta$ cost model provides a quantitative vocab
 The network fabric now binds compute nodes into a fleet, and every byte of gradient data and activation tensor flows through it. The fleet, however, also needs to move massive datasets and enormous checkpoints. @sec-data-storage examines the parallel storage systems and data-loading architectures that keep the fleet supplied with data.

 :::
+
+[^fn-allreduce-forward]: **AllReduce**: A collective operation that sums data from all nodes and distributes the result back to all nodes. @sec-collective-communication develops the mathematical cost models for ring and tree-based AllReduce implementations. \index{AllReduce!forward reference}
+
+[^fn-collectives-forward]: **Collective Primitives**: Higher-level communication patterns involving groups of nodes. While @sec-collective-communication derives the algorithms for **AllGather**, **ReduceScatter**, and **AllToAll**, this chapter addresses the physical fabric requirements (bisection bandwidth, switch radix) that enable them at scale. \index{Collective Primitives!forward reference}
+
+[^fn-moe-forward]: **Mixture-of-Experts (MoE)**: An architecture that activates only a subset of model "experts" per input, necessitating AllToAll communication. @sec-inference-scale and @sec-performance-engineering examine how MoE decouples model capacity from per-token compute cost ($O$). \index{MoE!forward reference}
+
+
+[^fn-allreduce-forward]: **AllReduce**: A collective operation that sums data from all nodes and distributes the result back to all nodes. @sec-collective-communication develops the mathematical cost models for ring and tree-based AllReduce implementations. \index{AllReduce!forward reference}
+
+[^fn-collectives-forward]: **Collective Primitives**: Higher-level communication patterns involving groups of nodes. While @sec-collective-communication derives the algorithms for **AllGather**, **ReduceScatter**, and **AllToAll**, this chapter addresses the physical fabric requirements (bisection bandwidth, switch radix) that enable them at scale. \index{Collective Primitives!forward reference}
+
+[^fn-moe-forward]: **Mixture-of-Experts (MoE)**: An architecture that activates only a subset of model "experts" per input, necessitating AllToAll communication. @sec-inference-scale and @sec-performance-engineering examine how MoE decouples model capacity from per-token compute cost ($O$). \index{MoE!forward reference}
+
+::: { .quiz-end }
+:::