style: Vol2 register pass — eliminate rhetorical questions, second person, vague intensifiers

Systematic register audit and fix across all 13 non-clean Vol2 chapters.
Clean chapters (compute_infrastructure, network_fabrics, inference, responsible_ai,
frontmatter, backmatter) required no edits.

Violations fixed by chapter:
- introduction: 14 fixes (rhetorical Qs, second person, vague intensifiers)
- collective_communication: 27 fixes (rhetorical Qs, contractions, second person, intensifiers)
- distributed_training: 7 fixes (all rhetorical questions → declarative statements)
- ops_scale: 6 fixes (intensifiers, second person, rhetorical Q, announcement transition)
- performance_engineering: 3 fixes (rhetorical Q, second person, announcement transition)
- robust_ai: 4 fixes (hedging, second person in callout-notebook)
- sustainable_ai: 4 fixes (rhetorical Q, second person, bold starter in callout)
- fleet_orchestration: 4 fixes (rhetorical questions)
- security_privacy: 4 fixes (banned phrase, second person, rhetorical Q)
- edge_intelligence: 4 fixes (rhetorical Q, vague intensifiers)
- fault_tolerance: 1 fix (second person in callout-notebook)
- data_storage: 1 fix (sentence-starting "But,")
- conclusion: 2 fixes (first-person "We have climbed", "To conclude" opener)

Pre-commit rendering/inline-refs failures are pre-existing on this branch
(77 files, 116 rendering issues, 179 inline-ref errors in unrelated files).
None of the 13 edited files have rendering violations.
This commit is contained in:
Vijay Janapa Reddi
2026-02-24 19:46:16 -05:00
parent e881d92625
commit 1ddf9bd5e3
13 changed files with 189 additions and 192 deletions

View File

@@ -289,11 +289,11 @@ Expert Parallelism refers to the Mixture of Experts (MoE)[^fn-moe-sparsity] arch
## Mapping the Terrain: Network Performance Modeling {#sec-communication-collective-operations-collective-operations-network-performance-modeling-0d8e}
If you ask a datacenter engineer how long it takes to send ten megabytes across a cluster, they cannot give you a valid answer without knowing two distinct variables: the fixed startup tax and the per-byte transit fee. As our gradient begins its journey, it immediately encounters the physical reality of the datacenter network.
A datacenter engineer cannot predict how long it takes to send ten megabytes across a cluster without knowing two distinct variables: the fixed startup tax and the per-byte transit fee. As our gradient begins its journey, it immediately encounters the physical reality of the datacenter network.
### The α-β Reality: When Physics Fights Back {#sec-communication-collective-operations-collective-operations-alphabeta-model-f9b4}
Every step our gradient takes is governed by the linear cost model $T(n) = \alpha + n/\beta$. This is not just a formula; it is the "Physics of Failure" for distributed systems.
Every step our gradient takes is governed by the linear cost model $T(n) = \alpha + n/\beta$. This is not merely a formula; it is the "Physics of Failure" for distributed systems.
::: {.callout-definition title="α-β Model (Hockney Model)"}
@@ -306,7 +306,7 @@ $$ T(n) = \alpha + \frac{n}{\beta} $$
1. **Significance (Quantitative):** It maps directly to the **Iron Law**, where **α** (alpha) represents the fixed **Startup Latency ($L_{\text{lat}}$)** and **β** (beta) represents the **Network Bandwidth ($BW$)**. It separates the **Latency-Bound Regime** (small messages) from the **Bandwidth-Bound Regime** (large messages).
2. **Distinction (Durable):** Unlike **Idealized Throughput Models**, the α-β model captures the **Fixed Penalty** of communication, explaining why many small messages are 100$\times$ more expensive than one large message of the same total size.
3. **Common Pitfall:** A frequent misconception is that $\alpha$ is just "network delay." In reality, $\alpha$ includes **Software Overhead** (kernel context switches, stack traversal, and library synchronization) that can often exceed the physical wire delay.
3. **Common Pitfall:** A frequent misconception is that $\alpha$ is merely "network delay." In reality, $\alpha$ includes **Software Overhead** (kernel context switches, stack traversal, and library synchronization) that can often exceed the physical wire delay.
:::
@@ -362,7 +362,7 @@ The following comparison shows how the bottleneck shifts between these regimes:
* Latency Time: $2\ \mu s$.
* **Total: 20,002 μs**. Latency is 0.01%—completely negligible.
**The Systems Conclusion**: For Data Parallelism (large gradients), we optimize for $\beta$—compress gradients, add NICs. For Pipeline/Expert Parallelism (small activations), we fight for every microsecond of $\alpha$—kernel bypass, topology optimization. The α-β model tells you *which fight to pick*.
**The Systems Conclusion**: For Data Parallelism (large gradients), we optimize for $\beta$—compress gradients, add NICs. For Pipeline/Expert Parallelism (small activations), we fight for every microsecond of $\alpha$—kernel bypass, topology optimization. The α-β model identifies *which constraint to optimize*.
:::
### The LogP Model {#sec-communication-collective-operations-collective-operations-logp-model-e45d}
@@ -389,7 +389,7 @@ The key insight of LogP is distinguishing **network latency** (L, which can be h
**Effective time**: $\max(500, 100 + 100) = 500\ \mu s$ (communication hidden!).
**The Systems Insight**: α-β tells you the total communication time. LogP tells you **how much of it you can hide**. When designing pipelined training, optimize for low $o$ (kernel bypass, GPUDirect) rather than just high $\beta$.
**The Systems Insight**: α-β tells you the total communication time. LogP tells you **how much of it you can hide**. When designing pipelined training, optimize for low $o$ (kernel bypass, GPUDirect) rather than high $\beta$ alone.
:::
The choice between models depends on the analysis context:
@@ -516,7 +516,7 @@ The alpha-beta model provides useful first-order predictions, but real communica
The table reveals two critical lessons. First, the alpha-beta model underestimates small-message latency by 7--8$\times$ because it accounts only for wire-level propagation, not the software stack overhead. For latency-sensitive operations (tensor parallelism AllReduce, MoE token routing), the effective alpha is 5--10$\times$ higher than the physical wire latency. Second, for large messages the model is accurate to within 8--15%, confirming that bandwidth is the binding constraint and that NCCL's internal optimizations (channel pipelining, kernel fusion) successfully saturate the available links.
This reality gap has practical consequences for algorithm selection. The crossover point between Ring and Tree AllReduce shifts upward in practice because the effective alpha is larger than the wire-level value. Engineers who use textbook alpha values will underestimate latency costs and may choose Ring when Tree would perform better. A robust practice is to measure the effective alpha on your specific cluster by benchmarking small-message AllReduce latency, then use that measured value in all subsequent calculations.
This reality gap has practical consequences for algorithm selection. The crossover point between Ring and Tree AllReduce shifts upward in practice because the effective alpha is larger than the wire-level value. Engineers who use textbook alpha values will underestimate latency costs and may choose Ring when Tree would perform better. A robust practice is to measure the effective alpha on the specific cluster by benchmarking small-message AllReduce latency, then use that measured value in all subsequent calculations.
## Choosing the Vehicle: Collective Operation Primitives {#sec-communication-collective-operations-collective-operations-collective-operation-vocabulary-fdc7}
@@ -528,7 +528,7 @@ If a GPU simply opens a socket and sends a massive gradient to another GPU, the
1. **Significance (Quantitative):** They are the primary mechanisms for managing **Data Movement** across the fleet. Within the **Iron Law**, collective operations (e.g., AllReduce) determine the total **Communication Overhead ($L_{\text{lat}}$)** and whether training becomes **Bandwidth-Bound ($BW$)**.
2. **Distinction (Durable):** Unlike **Point-to-Point Communication** (one-to-one), Collective Operations are **Group-Aware** (one-to-all or all-to-all), using optimized algorithms (e.g., Ring, Tree) to minimize redundant transfers.
3. **Common Pitfall:** A frequent misconception is that collectives are "just library calls." In reality, they are **Synchronization Barriers**: every participant must reach the collective call point before the operation can complete, making them the primary source of **Straggler-Induced Stalls**.
3. **Common Pitfall:** A frequent misconception is that collectives are merely library calls. In reality, they are **Synchronization Barriers**: every participant must reach the collective call point before the operation can complete, making them the primary source of **Straggler-Induced Stalls**.
:::
@@ -802,7 +802,7 @@ The problem with this pattern becomes clear on real hardware. The round $k=2$ pa
### Double Binary Tree {#sec-communication-double-tree}
NCCL's default algorithm for many message sizes is actually a **Double Binary Tree**, which addresses the bandwidth inefficiency of a standard binary tree without requiring the non-local communication of Butterfly. The idea is to construct two independent binary trees that together cover all links, then run both trees simultaneously, each carrying half the data.
NCCL's default algorithm for many message sizes is a **Double Binary Tree**, which addresses the bandwidth inefficiency of a standard binary tree without requiring the non-local communication of Butterfly. The idea is to construct two independent binary trees that together cover all links, then run both trees simultaneously, each carrying half the data.
In a standard binary tree, at each level only half the links are active (the other half are idle because those nodes are receiving, not sending). By constructing a second, complementary tree (rooted at a different node, with edges that cover the links unused by the first tree), both trees can operate in parallel. Each tree carries $M/2$ bytes, and since their link utilization is complementary, the aggregate link utilization approaches 100%, matching Ring's bandwidth efficiency while retaining Tree's $O(\log N)$ latency.
@@ -882,7 +882,7 @@ fig = plt.gcf()
### The Algorithm Crossover Point {#sec-communication-collective-operations-algorithm-crossover}
When should we use Ring vs. Tree? We can derive the crossover point by setting $T_{\text{ring}} = T_{\text{tree}}$ and solving for $M$.
The choice between Ring and Tree depends on the crossover point derived by setting $T_{\text{ring}} = T_{\text{tree}}$ and solving for $M$.
#### Step 1: Write the Full Time Equations {.unnumbered}
@@ -1352,7 +1352,7 @@ The interaction between rail-optimized routing and hierarchical AllReduce is wor
::: {.callout-war-story}
## The NCCL Topology Discovery
How does a software library running deep inside a GPU actually know whether another GPU is sitting right next to it on a high-speed NVLink switch, or sitting 100 meters away across an InfiniBand fabric? NVIDIA's Collective Communications Library (NCCL) performs a critical "topology detection" phase at initialization to map these physical realities.
A software library running deep inside a GPU must determine whether another GPU sits on a local NVLink switch or 100 meters away across an InfiniBand fabric. NVIDIA's Collective Communications Library (NCCL) performs a critical "topology detection" phase at initialization to map these physical realities.
:::
Misalignment between logical and physical topology is a common source of performance degradation that is difficult to diagnose without careful profiling. If process ranks are assigned arbitrarily (for example, by the job scheduler without topology awareness), the hierarchical AllReduce may route cross-node traffic through the wrong NICs, creating hotspots that reduce effective bandwidth by 2--4$\times$. Production deployments use topology-aware rank assignment (configured through NCCL's `CUDA_VISIBLE_DEVICES` and the scheduler's GPU binding policies) to ensure alignment.
@@ -1361,9 +1361,9 @@ The hierarchical approach, combined with topology-aware routing, represents the
## The Last Resort: Gradient Compression {#sec-communication-collective-operations-collective-operations-gradient-compression-7a5c}
What happens when you have perfectly tuned your topology, selected the optimal AllReduce algorithm, and you are still fundamentally bottlenecked by the physical speed of light across your datacenter? You must send fewer bits. The previous section attacked the communication bottleneck from the algorithm side; now we look at payload compression.
Even after perfectly tuning topology and selecting the optimal AllReduce algorithm, a system may remain fundamentally bottlenecked by the physical speed of light across the datacenter. The only remaining option is to send fewer bits. The previous section attacked the communication bottleneck from the algorithm side; now we look at payload compression.
This situation arises most acutely in two scenarios: training across datacenters connected by wide-area networks (where bandwidth is 10--100$\times$ lower than InfiniBand), and training on cloud instances with commodity Ethernet networking (where RDMA is unavailable and effective bandwidth is 10--25 GB/s). In these bandwidth-constrained settings, gradient compression techniques can reduce communication volume by 4--1000$\times$, at the cost of introducing noise into the optimization process. The central question becomes: how much noise can the optimization process tolerate before convergence is compromised?
This situation arises most acutely in two scenarios: training across datacenters connected by wide-area networks (where bandwidth is 10--100$\times$ lower than InfiniBand), and training on cloud instances with commodity Ethernet networking (where RDMA is unavailable and effective bandwidth is 10--25 GB/s). In these bandwidth-constrained settings, gradient compression techniques can reduce communication volume by 4--1000$\times$, at the cost of introducing noise into the optimization process. The central tension is between bandwidth reduction and the noise tolerance of the optimization process — how much compression the optimizer can absorb before convergence is compromised.
### Quantization: Reducing Precision {#sec-communication-collective-operations-collective-operations-quantization-reducing-precision-815b}
@@ -1426,7 +1426,7 @@ We solve the conflict between compression and convergence with **Error Feedback*
***Error Feedback***\index{Error Feedback!definition} is a convergence-preserving technique that maintains a local accumulator to store the residual information lost during gradient compression.
1. **Significance (Quantitative):** It allows for aggressive **Gradient Compression** (e.g., Top-K, 1-bit quantization) without sacrificing convergence. By re-injecting the residual ($e_t$) into the next gradient update, it ensures that small but consistent signals are eventually transmitted, maintaining an **Unbiased Estimator** over time.
2. **Distinction (Durable):** Unlike **Lossy Compression** (which permanently discards information), Error Feedback is a **Delayed Transmission** strategy: it ensures that information is not lost, but simply deferred until it accumulates above the compression threshold.
2. **Distinction (Durable):** Unlike **Lossy Compression** (which permanently discards information), Error Feedback is a **Delayed Transmission** strategy: it ensures that information is not lost, but deferred until it accumulates above the compression threshold.
3. **Common Pitfall:** A frequent misconception is that Error Feedback is "built into" all compressors. In reality, without explicit error tracking, biased compressors (like Top-K) can lead to **Divergence** or poor generalization because small gradient components are never applied.
:::
@@ -1475,7 +1475,7 @@ This property—that compression error telescopes across time—is why error fee
1. **No information is lost**: The sum (transmitted + error buffer) always equals the cumulative true gradient.
2. **Small gradients accumulate**: Individual gradients of 0.30.4 were too small to transmit alone, but they accumulated until crossing the threshold.
3. **Error oscillates around zero**: The error buffer $e_t$ does not grow unboundedly—it oscillates as gradients are "paid back" through transmission.
4. **Convergence preserved**: Over time, the model receives approximately the correct total gradient, just with some delay.
4. **Convergence preserved**: Over time, the model receives approximately the correct total gradient, with some delay.
:::
### 1-bit Adam: Compression-Aware Optimization {#sec-communication-1bit-adam}
@@ -1505,9 +1505,9 @@ Verify your understanding of when and how to apply gradient compression:
Gradient compression is not free; it trades reduced communication for increased variance in the optimization process.
| **Method** | **Compression Ratio** | **Convergence Impact** | **Best Use Case** |
|:--------------------------|----------------------:|-------------------------------:|:-------------------------------|
| **FP16** | 2$\times$ | Negligible | Default for all training |
| **Method** | **Compression Ratio** | **Convergence Impact** | **Best Use Case** |
|:--------------------------|----------------------:|--------------------------------:|:-------------------------------|
| **FP16** | 2$\times$ | Negligible | Default for all training |
| **INT8 + Error FB** | 4$\times$ | Minor slowdown (~5--10%) | Bandwidth-constrained clusters |
| **Top-K (1%) + Error FB** | 100$\times$ | Moderate slowdown (~10--20%) | Cross-datacenter training |
| **1-bit + Error FB** | 32$\times$ | Significant slowdown (~20--30%) | Extreme bandwidth constraints |
@@ -1518,15 +1518,15 @@ Gradient compression is not free; it trades reduced communication for increased
- **Avoid aggressive compression** when compute time dominates—the convergence slowdown is not worth the communication savings when you are not bottlenecked on communication.
- **Always use Error Feedback** with any compression beyond FP16. Without it, convergence is not guaranteed.
The $\alpha$-$\beta$ analysis from @sec-communication-collective-operations-collective-operations-network-performance-modeling-0d8e helps determine when compression pays off: if your gradients are large enough to be bandwidth-bound ($M > n$), compression directly reduces wall-clock time. If they are latency-bound ($M < n$), compression will not help because the latency term dominates regardless of message size.
The $\alpha$-$\beta$ analysis from @sec-communication-collective-operations-collective-operations-network-performance-modeling-0d8e helps determine when compression pays off: if the gradients are large enough to be bandwidth-bound ($M > n$), compression directly reduces wall-clock time. If they are latency-bound ($M < n$), compression will not help because the latency term dominates regardless of message size.
The decision of whether compression is worthwhile requires comparing the communication time savings against the convergence penalty. Consider a concrete scenario: a training run requires 100,000 steps to converge without compression, with each step taking 500 ms (300 ms compute, 200 ms communication). The total training time is 50,000 seconds. Applying INT8 compression with error feedback reduces communication time by 4$\times$ (from 200 ms to 50 ms per step) but increases the required steps by 10% (from 100,000 to 110,000). The new total time is $110{,}000 \times 0.35 = 38{,}500$ seconds, a 23% improvement. The compression is worthwhile because the per-step communication savings (150 ms) outweigh the additional steps required.
Now consider the same model on a faster network where communication takes only 30 ms per step. Compression reduces this to 7.5 ms (saving 22.5 ms per step) but still adds 10% more steps. The new total time is $110{,}000 \times 0.3075 = 33{,}825$ seconds versus $100{,}000 \times 0.33 = 33{,}000$ seconds without compression. The compression actually *increases* total training time by 2.5%, because the per-step savings (22.5 ms) are too small relative to the convergence penalty (10,000 extra steps at 307.5 ms each). This example illustrates why compression should only be applied when the communication-to-computation ratio ($T_{\text{comm}}/T_{\text{compute}}$) exceeds a threshold that depends on the specific convergence penalty of the chosen method.
Now consider the same model on a faster network where communication takes only 30 ms per step. Compression reduces this to 7.5 ms (saving 22.5 ms per step) but still adds 10% more steps. The new total time is $110{,}000 \times 0.3075 = 33{,}825$ seconds versus $100{,}000 \times 0.33 = 33{,}000$ seconds without compression. The compression *increases* total training time by 2.5%, because the per-step savings (22.5 ms) are too small relative to the convergence penalty (10,000 extra steps at 307.5 ms each). This example illustrates why compression should only be applied when the communication-to-computation ratio ($T_{\text{comm}}/T_{\text{compute}}$) exceeds a threshold that depends on the specific convergence penalty of the chosen method.
## The Communication Library Landscape {#sec-communication-collective-operations-collective-operations-communication-libraries-nccl-5307}
Writing a highly optimized, topology-aware, hierarchical Ring AllReduce from scratch in C++ would take a dedicated team of engineers months of effort. Fortunately, you do not have to. The preceding sections developed three complementary strategies for taming communication cost, all of which are encapsulated in modern communication libraries.
Writing a highly optimized, topology-aware, hierarchical Ring AllReduce from scratch in C++ would take a dedicated team of engineers months of effort. Fortunately, this is unnecessary. The preceding sections developed three complementary strategies for taming communication cost, all of which are encapsulated in modern communication libraries.
### NCCL: Why It Dominates GPU Workloads {#sec-communication-nccl}
@@ -1608,7 +1608,7 @@ When a distributed training job runs slower than expected, the communication lib
1. **Profile with NCCL debug logging**: Set `NCCL_DEBUG=INFO` to see which algorithm (Ring, Tree) and protocol (Simple, LL, LL128) NCCL selects for each collective. Unexpected algorithm choices often indicate topology mis-detection.
2. **Measure bare collective performance**: Run NCCL's built-in benchmarks (`nccl-tests`) with your cluster's exact topology to establish the achievable bandwidth baseline. If `nccl-tests` achieves 90% of theoretical bandwidth but your training achieves only 50%, the bottleneck is in how your training framework invokes collectives, not in the communication library itself.
2. **Measure bare collective performance**: Run NCCL's built-in benchmarks (`nccl-tests`) with the cluster's exact topology to establish the achievable bandwidth baseline. If `nccl-tests` achieves 90% of theoretical bandwidth but the training job achieves only 50%, the bottleneck is in how your training framework invokes collectives, not in the communication library itself.
3. **Check for stragglers**: Use `torch.cuda.synchronize()` before and after each collective to measure per-operation time. A collective that takes 2$\times$ longer than expected often indicates one GPU is delayed (thermal throttling, ECC error recovery, or unbalanced data loading), which stalls the entire barrier.
@@ -1620,7 +1620,7 @@ When a distributed training job runs slower than expected, the communication lib
## Communication-Computation Overlap {#sec-communication-overlap}
What if you could make a 20-millisecond network delay effectively disappear? You do not achieve this by upgrading the physical network; you achieve it by hiding the communication behind the arithmetic. The preceding sections addressed communication cost from three physical and algorithmic angles.
A 20-millisecond network delay can effectively disappear — not by upgrading the physical network, but by hiding the communication behind arithmetic. The preceding sections addressed communication cost from three physical and algorithmic angles.
### Layer-by-Layer Overlap {#sec-communication-layer-overlap}
@@ -1720,7 +1720,6 @@ overlap_savings_pct_str = OverlapBudgetCalc.overlap_savings_pct_str
exposed_per_layer_ms_str = OverlapBudgetCalc.exposed_per_layer_ms_str
```
**Problem**: A 32-layer transformer model (7B parameters) is trained on 64 GPUs. Each layer's backward pass takes 15 ms. The hierarchical AllReduce for each layer's gradients (~880 MB per layer) takes `{python} allreduce_per_layer_ms_str` ms using 100 MB buckets. What is the step time with and without overlap?
**Without overlap (sequential)**:
@@ -1741,7 +1740,7 @@ The combination of hierarchical AllReduce (reducing the volume of data to commun
::: {.callout-notebook}
## Napkin Math: The Overlap Budget
Can we hide the communication cost of a 70B parameter model on 8 GPUs? Let's check the physics.
The following analysis determines whether the communication cost of a 70B parameter model on 8 GPUs can be hidden behind computation. The physics determines the answer.
1. **Gradient Size:** 70B params$\times$ 2 bytes (FP16) = 140 GB.
2. **Communication Time:** Using Ring AllReduce on NVLink (900 GB/s), time is $\approx \frac{2 \cdot 140\text{GB}}{900\text{GB/s}} = 0.31\text{s}$.
3. **Compute Time:** A typical forward/backward pass for a block of this size takes $\approx 2.1\text{s}$.
@@ -1752,19 +1751,19 @@ Since communication takes only 15% of the compute time, it fits comfortably with
## Fallacies and Pitfalls {#sec-communication-collective-operations-collective-operations-fallacies-pitfalls-9cd0}
Fallacy: ***Bandwidth is the only metric that matters.***
For small messages (pipeline parallelism activations, MoE tokens), **latency** ($\alpha$) dominates. Buying 400G networking will not help if your message takes 5 μs to serialize in software. The critical message size $n = \alpha \cdot \beta$ determines which metric to optimize—below $n$, reduce latency; above it, increase bandwidth.
For small messages (pipeline parallelism activations, MoE tokens), **latency** ($\alpha$) dominates. Buying 400G networking will not help if the message takes 5 μs to serialize in software. The critical message size $n = \alpha \cdot \beta$ determines which metric to optimize—below $n$, reduce latency; above it, increase bandwidth.
Pitfall: ***Assuming AllReduce works for everything.***
AllReduce creates a global barrier and assumes all participants contribute identical data shapes. In **Expert Parallelism (MoE)** and **Recommendation Systems**, where each worker needs to send distinct data to every other worker, AllReduce is fundamentally wrong. These workloads require AlltoAll, which has $O(N^2)$ logical connections and hits network contention limits at much smaller cluster sizes.
Fallacy: ***Ring AllReduce is always optimal.***
Ring achieves bandwidth-optimal $2\frac{N-1}{N}\frac{M}{\beta}$ but pays $O(N)$ latency. For a 1 MB gradient across 64 GPUs with $\alpha = 10\ \mu s$, Tree AllReduce actually wins because Ring's 1,260 μs latency penalty exceeds Tree's 1,000 μs bandwidth penalty. The crossover formula $M_{\text{crossover}} \approx N \cdot \alpha \cdot \beta$ determines when to switch algorithms.
Ring achieves bandwidth-optimal $2\frac{N-1}{N}\frac{M}{\beta}$ but pays $O(N)$ latency. For a 1 MB gradient across 64 GPUs with $\alpha = 10\ \mu s$, Tree AllReduce wins because Ring's 1,260 μs latency penalty exceeds Tree's 1,000 μs bandwidth penalty. The crossover formula $M_{\text{crossover}} \approx N \cdot \alpha \cdot \beta$ determines when to switch algorithms.
Pitfall: ***Compressing gradients without error feedback.***
Top-K sparsification can achieve 99% compression, but naively discarding small gradients causes divergence. Without error feedback ($e_{t+1} = (g_t + e_t) - v_t$), gradients below the threshold are lost forever, accumulating systematic bias that eventually destabilizes training.
Fallacy: ***Async collectives always hide latency.***
Python's `dist.all_reduce(..., async_op=True)` only returns control to the CPU. The LogP model distinguishes network latency $L$ (overlappable) from processor overhead $o$ (non-overlappable). If the GPU compute kernel is shorter than the communication overhead, the GPU still stalls. You can only hide communication when $T_{\text{compute}} > o$.
Python's `dist.all_reduce(..., async_op=True)` only returns control to the CPU. The LogP model distinguishes network latency $L$ (overlappable) from processor overhead $o$ (non-overlappable). If the GPU compute kernel is shorter than the communication overhead, the GPU still stalls. Communication can only be hidden when $T_{\text{compute}} > o$.
Pitfall: ***Silent data corruption in the network.***
Networks are not perfect. A bad cable can flip bits. Unlike TCP (which checksums everything), high-speed RDMA protocols sometimes have weaker guarantees or buggy NIC firmware. At 10,000 nodes running 24/7, "rare" bit flips (1 in $10^{15}$) happen multiple times per day, corrupting gradients without any error signal.
@@ -1787,11 +1786,11 @@ When even the fastest wires are not enough, **Gradient Compression** provides th
::: {.callout-takeaways title="Every Byte Has a Travel Cost"}
* **The $\alpha$-$\beta$ model reveals the bottleneck**: The critical message size $n = \alpha \cdot \beta$ determines whether you should optimize software latency (small messages) or hardware bandwidth (large payloads). In practice, NCCL's effective $\alpha$ is 5--10$\times$ higher than wire-level latency.
* **The $\alpha$-$\beta$ model reveals the bottleneck**: The critical message size $n = \alpha \cdot \beta$ determines whether to optimize software latency (small messages) or hardware bandwidth (large payloads). In practice, NCCL's effective $\alpha$ is 5--10$\times$ higher than wire-level latency.
* **Algorithm choice is scale-dependent**: Ring AllReduce is bandwidth-optimal but pays $O(N)$ latency; Tree AllReduce is latency-optimal ($O(\log N)$) but bandwidth-inefficient. Use the crossover formula $M_{\text{crossover}} \approx N \alpha \beta$ to choose.
* **Hierarchical algorithms multiply bandwidth**: By performing local reductions over fast NVLink before crossing the slow InfiniBand bridge, hierarchical collectives effectively multiply inter-node bandwidth by the number of GPUs per node.
* **AlltoAll is the contention king**: Unlike AllReduce, AlltoAll creates $O(N^2)$ logical connections. This makes Expert Parallelism (MoE) and Recommendation Systems fundamentally harder to scale than dense LLMs.
* **Error Feedback makes lossy compression safe**: You can throw away 99% of gradients via sparsification, provided you accumulate the "error" locally and add it to the next step. This turns biased estimators into unbiased ones over time.
* **Error Feedback makes lossy compression safe**: Sparsification can discard 99% of gradients, provided the "error" is accumulated locally and added to the next step. This turns biased estimators into unbiased ones over time.
* **Overlap is the final multiplier**: Communication-computation pipelining through gradient hooks and bucket fusion can hide 90--95% of communication time behind backward pass computation for deep models.
* **Topology discovery is not optional**: Modern libraries like NCCL dynamically map logical rings to physical wires to avoid "hot spots" and maximize bisection bandwidth. Misaligned rank-to-GPU mapping can degrade performance by 2--4$\times$.
@@ -1802,11 +1801,11 @@ Throughout this chapter, we have seen how these communication traffic patterns c
::: {.callout-lighthouse title="Communication Archetype Patterns"}
The "Travel Manifest" for a gradient depends on the system's objective function and constraint regime. Each lighthouse archetype faces a distinct communication challenge, and the techniques developed in this chapter map to those challenges differently:
| **Archetype** | **Primary Collective** | **Dominant Friction** | **Optimization Strategy** |
|:-------------------------------------|:-----------------------|:--------------------------------|:---------------------------------------------|
| **Archetype A (GPT-4 / Llama-3)** | AllReduce | Bandwidth ($\beta$) | Hierarchical AllReduce; Rail-optimization |
| **Archetype B (DLRM at Scale)** | AllToAll | Latency ($\alpha$) & Contention | Topology-aware routing; token load-balancing |
| **Archetype C (Federated MobileNet)**| P2P / Async | Connectivity & Latency | Aggressive quantization; Error Feedback |
| **Archetype** | **Primary Collective** | **Dominant Friction** | **Optimization Strategy** |
|:--------------------------------------|:-----------------------|:--------------------------------|:---------------------------------------------|
| **Archetype A (GPT-4 / Llama-3)** | AllReduce | Bandwidth ($\beta$) | Hierarchical AllReduce; Rail-optimization |
| **Archetype B (DLRM at Scale)** | AllToAll | Latency ($\alpha$) & Contention | Topology-aware routing; token load-balancing |
| **Archetype C (Federated MobileNet)** | P2P / Async | Connectivity & Latency | Aggressive quantization; Error Feedback |
The key insight is that Archetype A and Archetype B, despite both operating at datacenter scale, face fundamentally different bottlenecks. **Archetype A (GPT-4 / Llama-3)** is bandwidth-bound and benefits from hierarchical AllReduce combined with communication-computation overlap. **Archetype B (DLRM at Scale)** is latency-bound and benefits from low-latency switches and topology-aware AllToAll routing. Applying the wrong optimization to the wrong archetype wastes engineering effort without improving performance.

View File

@@ -29,7 +29,7 @@
This chapter's position in the book's organizing framework, *the Fleet Stack*, clarifies how the principles developed across every layer integrate into a unified discipline for engineering intelligence at scale.
::: {.callout-note title="Connection: The Fleet Stack"}
This is the final synthesis. We have climbed the Fleet Stack layer by layer: from **Infrastructure** (Part I: The Fleet) to **Distribution** (Part II: Distributed ML), **Serving** (Part III: Deployment at Scale), and **Governance** (Part IV: The Responsible Fleet). Now, we step back to see the whole structure. This conclusion integrates every principle we have studied into a single, cohesive discipline for engineering intelligence at global scale.
This is the final synthesis. The preceding chapters built the Fleet Stack layer by layer: from **Infrastructure** (Part I: The Fleet) to **Distribution** (Part II: Distributed ML), **Serving** (Part III: Deployment at Scale), and **Governance** (Part IV: The Responsible Fleet). This conclusion steps back to see the whole structure, integrating every principle into a single, cohesive discipline for engineering intelligence at global scale.
:::
## Synthesizing Distributed ML Systems {#sec-conclusion-synthesizing-distributed-ml-systems-bac9}
@@ -216,7 +216,7 @@ class FermiEstimate:
n_gpus = 25000
tflops_per_gpu = 1000
w_per_gpu = 700
brain_synapses = 1e14
brain_firing_hz = 100
brain_power_w = 20
@@ -224,11 +224,11 @@ class FermiEstimate:
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
machine_ops = n_gpus * tflops_per_gpu * 1e12
machine_power_mw = (n_gpus * w_per_gpu) / 1e6
brain_ops = brain_synapses * brain_firing_hz
throughput_ratio = machine_ops / brain_ops
machine_eff = machine_ops / (machine_power_mw * 1e6)
brain_eff = brain_ops / brain_power_w
efficiency_gap = brain_eff / machine_eff
@@ -282,7 +282,7 @@ Go build systems that scale. Go build systems that endure. Go build systems that
*Prof. Vijay Janapa Reddi, Harvard University*
To conclude, here are the essential insights from this book.
The essential insights from this textbook are:
::: {.callout-takeaways}

View File

@@ -163,7 +163,7 @@ class StorageHierarchyAnalysis:
# Weights (FP16) + Optimizer (FP32 x 2)
bytes_per_param_ckpt = BYTES_PER_FP16 + (2 * BYTES_PER_FP32)
ckpt_total_gb_val = (gpt3_params * bytes_per_param_ckpt) / BILLION
# Per-node shard (assume 256 nodes)
n_nodes = 256
node_shard_gb = ckpt_total_gb_val / n_nodes
@@ -252,7 +252,7 @@ pcie5_bw_gbs = f"{PCIE_GEN5_BW.m_as(GB/second):.0f}"
## The Fuel Line {#sec-storage-fuel-line}
The previous two chapters built the engine and wired it together. @sec-compute-infrastructure established that a modern accelerator node packs eight GPUs delivering petaFLOPS of aggregate compute, and @sec-network-fabrics showed how InfiniBand fabrics connect thousands of such nodes at hundreds of gigabits per second. But an engine without fuel is expensive sculpture. The fleet also needs *fuel*: training data, model weights, optimizer state, and intermediate checkpoints. This chapter asks how to deliver that fuel fast enough that 1,000 accelerators never starve.
The previous two chapters built the engine and wired it together. @sec-compute-infrastructure established that a modern accelerator node packs eight GPUs delivering petaFLOPS of aggregate compute, and @sec-network-fabrics showed how InfiniBand fabrics connect thousands of such nodes at hundreds of gigabits per second. An engine without fuel is expensive sculpture. The fleet also needs *fuel*: training data, model weights, optimizer state, and intermediate checkpoints. This chapter asks how to deliver that fuel fast enough that 1,000 accelerators never starve.
::: {.callout-note title="Connection: The Fleet Stack"}

View File

@@ -103,7 +103,7 @@ from mlsys.formatting import fmt, sci, check
## Why Distribution Is Necessary {#sec-distributed-training-systems-systems-multimachine-scaling-fundamentals-ff96}
Part I built the physical fleet: @sec-compute-infrastructure established the accelerator hierarchy, @sec-network-fabrics wired nodes into a high-bandwidth fabric, and @sec-data-storage completed the infrastructure with storage pipelines that keep the fleet fed. With the physical foundation in place, we now confront the algorithmic question that defines Part II: how do we split a single training job across this hardware?
Part I built the physical fleet: @sec-compute-infrastructure established the accelerator hierarchy, @sec-network-fabrics wired nodes into a high-bandwidth fabric, and @sec-data-storage completed the infrastructure with storage pipelines that keep the fleet fed. With the physical foundation in place, the algorithmic challenge that defines Part II is splitting a single training job across this hardware.
If you could purchase a single GPU with 100 terabytes of memory and an exaflop of compute, distributed training would not exist. Because the laws of physics prevent this, we are forced to shatter our models across thousands of independent chips. In the **Fleet Stack** framework (@sec-vol2-introduction), Distributed Training represents the **Distribution Layer** — the logic that partitions the mathematical workload across the physical fleet.
@@ -558,7 +558,7 @@ These coordination requirements shape the four main approaches to distributed tr
## Data Parallelism {#sec-distributed-training-systems-systems-data-parallelism-6132}
What is the simplest way to use eight GPUs to process a massive dataset? You give each GPU a complete, identical copy of the model, but only assign it one-eighth of the data. Data parallelism represents the most straightforward distributed approach and the natural starting point for understanding how distributed training works in practice.
The simplest approach gives each GPU a complete, identical copy of the model and assigns it one-eighth of the data. Data parallelism represents the most straightforward distributed approach and the natural starting point for understanding how distributed training works in practice.
::: {.callout-definition title="Data Parallelism"}
@@ -1001,7 +1001,7 @@ class Scaling8GPU:
# For large N, ~2 * Params * 2 = 4 * Params.
sync_size_gb = (params_b * BILLION * 4) / BILLION
comm_8gpu_ms_val = (sync_size_gb / nvlink_bw) * 1000
total_8gpu_ms_val = compute_8gpu_ms + comm_8gpu_ms_val
speedup_8gpu_val = single_gpu_step_s / (total_8gpu_ms_val / 1000)
efficiency_8gpu_val = speedup_8gpu_val / gpus_per_node * 100
@@ -1084,13 +1084,13 @@ class Scaling32GPU:
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
sync_size_gb = (params_b * BILLION * 4) / BILLION
inter_node_ms_val = (sync_size_gb / ib_bw) * 1000
comm_32gpu_ms_val = inter_node_ms_val + intra_node_ms
total_32gpu_ms_val = compute_32gpu_ms + comm_32gpu_ms_val
compute_pct_val = compute_32gpu_ms / total_32gpu_ms_val * 100
comm_pct_val = comm_32gpu_ms_val / total_32gpu_ms_val * 100
speedup_32gpu_val = single_gpu_step_s / (total_32gpu_ms_val / 1000)
efficiency_32gpu_val = speedup_32gpu_val / 32 * 100
training_32gpu_hours_val = training_hours_1gpu / speedup_32gpu_val
@@ -1148,7 +1148,6 @@ Communication dominates and becomes the bottleneck.
Better Approach: 8 GPUs with Gradient Accumulation
```{python}
#| label: grad-accum-calc
#| echo: false
@@ -1215,7 +1214,6 @@ compute_8gpu_ms = GradAccumScenario.compute_8gpu_ms
Key Insights
1. NVLink enables efficient scaling within single nodes (97% efficiency)
2. Inter-node communication kills efficiency (drops to 13%)
3. Gradient accumulation beats naive scaling for memory-bound models
@@ -1472,7 +1470,7 @@ These efficiency metrics directly influence the choice of parallelism strategy.
### Convergence Guarantees for Distributed Optimization {#sec-distributed-training-systems-systems-convergence-guarantees-distributed-optimization-350e}
Hardware efficiency metrics govern throughput, but convergence theory determines whether distributed training reaches the same solution quality as single-device training. Three questions arise: how does parallelism affect optimization convergence, when does adding workers help versus hurt, and how must learning rates be tuned for large-batch training?
Hardware efficiency metrics govern throughput, but convergence theory determines whether distributed training reaches the same solution quality as single-device training. Parallelism affects optimization convergence in three ways: convergence rate changes with batch size, adding workers yields diminishing returns beyond the critical batch size, and learning rates must scale with batch size.
### Convergence Rate for Synchronous Data Parallel SGD {#sec-distributed-training-systems-systems-convergence-rate-synchronous-data-parallel-sgd-2ed5}
@@ -1786,7 +1784,7 @@ The choice among these methods depends on the specific bottleneck. When network
## Model Parallelism {#sec-distributed-training-systems-systems-model-parallelism-7437}
What do you do when your model is so massive that even a single layer's weights exceed the memory capacity of your largest GPU? Data parallelism entirely collapses under these constraints. The memory optimization techniques examined in the previous section extend data parallelism's reach, but eventually, we must partition the model itself.
When a model is so massive that even a single layer's weights exceed the memory capacity of a GPU, data parallelism entirely collapses. The memory optimization techniques examined in the previous section extend data parallelism's reach, but eventually, we must partition the model itself.
Even with ZeRO-3 fully deployed, sharding optimizer states, gradients, and parameters across workers, some architectures remain intractable. A `{python} gpt3_params_b`B parameter model using FSDP across 64 GPUs still requires 700 GB / 64 = 11 GB of parameters per GPU before accounting for activations. For long-context transformers where activation memory dominates, a 2048-token sequence through `{python} gpt3_params_b`B parameters generates 200+ GB of intermediate activations, and no amount of optimizer sharding addresses this constraint. Model parallelism addresses these limitations by splitting the model architecture itself across devices, rather than replicating it with sharded state.
@@ -2349,7 +2347,7 @@ This three-precision approach (FP32 master weights, FP8 GEMMs, FP16 accumulation
## Hybrid Parallelism {#sec-distributed-training-systems-systems-hybrid-parallelism-12a0}
How do you train a frontier model when data parallelism runs out of memory and model parallelism runs out of network bandwidth? You must orchestrate them simultaneously across three dimensions. The preceding sections revealed a fundamental tension: data parallelism scales throughput but demands massive memory, while model parallelism enables large models but starves the compute.
Training a frontier model when data parallelism runs out of memory and model parallelism runs out of network bandwidth requires orchestrating both strategies simultaneously across three dimensions. The preceding sections revealed a fundamental tension: data parallelism scales throughput but demands massive memory, while model parallelism enables large models but starves the compute.
Hybrid parallelism resolves this tension by applying both strategies orthogonally: model parallelism splits the architecture to fit available memory, while data parallelism scales throughput across multiple model replicas. Training a `{python} gpt3_params_b` billion parameter language model on a dataset of 300 billion tokens demonstrates this approach in practice. The neural network layers distribute across multiple GPUs through model parallelism, while data parallelism enables different GPU groups to process separate batches. This dual strategy addresses both memory constraints from model size and computational demands from dataset scale simultaneously, and it is precisely this combination that defines *Archetype A* training at frontier scale.
@@ -2558,14 +2556,14 @@ Total static memory: $35 + 4.4 = 39.4$ GB per GPU — identical to PPO's static
PPO's two-phase design (generate then train) introduces a fundamental throughput penalty. If generation consumes 60% of the step time (typical for autoregressive decoding of long sequences), the training hardware achieves only 40% utilization during an RLHF step. DPO, operating as a standard training loop, achieves the same 40--55% MFU as pre-training.
| **Metric** | **PPO (70B policy)** | **DPO (70B policy)** |
|:-------------------------------|:-------------------------|:-------------------------|
| **Models required** | 4 (policy, ref, reward, value) | 2 (policy, reference) |
| **Total parameter memory** | $\sim$2,406 GB | $\sim$1,260 GB |
| **Minimum GPUs (memory)** | 64--128 | 32--64 |
| **Generation phase** | Yes (sequential, BW-bound) | None |
| **Effective training MFU** | 15--25% | 40--55% |
| **Data pipeline** | Online generation | Static preference pairs |
| **Metric** | **PPO (70B policy)** | **DPO (70B policy)** |
|:---------------------------|:-------------------------------|:------------------------|
| **Models required** | 4 (policy, ref, reward, value) | 2 (policy, reference) |
| **Total parameter memory** | $\sim$2,406 GB | $\sim$1,260 GB |
| **Minimum GPUs (memory)** | 64--128 | 32--64 |
| **Generation phase** | Yes (sequential, BW-bound) | None |
| **Effective training MFU** | 15--25% | 40--55% |
| **Data pipeline** | Online generation | Static preference pairs |
**Systems Conclusion**: DPO halves the GPU requirement and doubles the effective compute utilization compared to PPO by eliminating two models and the autoregressive generation phase. The choice between them is not a modeling preference but an infrastructure constraint: organizations with limited GPU budgets adopt DPO because the alternative physically does not fit on their cluster.
:::
@@ -2580,7 +2578,7 @@ Production RLHF systems mitigate this through a combination of sequence packing
## Parallelism Strategy Comparison {#sec-distributed-training-systems-systems-parallelism-strategy-comparison-d92a}
If a colleague asks whether to implement pipeline or tensor parallelism for a new 50-billion parameter model, how do you systematically weigh the architectural and hardware trade-offs? @tbl-parallelism-compare contrasts data, model, pipeline, and hybrid parallelism across six critical dimensions.
To systematically weigh the architectural and hardware trade-offs for a new 50-billion parameter model, practitioners must compare data, model, pipeline, and hybrid parallelism across six critical dimensions. @tbl-parallelism-compare presents this comparison.
| **Aspect** | **Data Parallelism** | **Model Parallelism** | **Pipeline Parallelism** | **Hybrid Parallelism** |
|:----------------------------------|:-----------------------------------------------------------------|:---------------------------------------------------------------------------|:---------------------------------------------------------------------------|:------------------------------------------------------------------|
@@ -2692,7 +2690,7 @@ These primitives compose into the communication patterns that define each parall
## Fallacies and Pitfalls {#sec-distributed-training-systems-systems-fallacies-pitfalls-e2bc}
Why do so many engineering teams scale their cluster capacity by 4x, only to find that their training iterations are actually taking longer to complete? Distributed training involves counterintuitive behavior that leads to common misconceptions, capturing errors that waste compute resources and delay research.
Many engineering teams who scale cluster capacity by 4x find that their training iterations take longer to complete, reflecting a fundamental misunderstanding of distributed training physics. The following misconceptions capture errors that waste compute resources and delay research.
Fallacy: ***Linear speedup is achievable with sufficient engineering effort.***
@@ -2819,7 +2817,7 @@ Ultimately, the choice of parallelism is a **loop transformation** applied by th
\definecolor{DataColor}{RGB}{0,99,149} % BlueLine
\definecolor{TensorColor}{RGB}{204,85,0} % OrangeLine
\definecolor{PipeColor}{RGB}{0,143,69} % GreenLine
\tikzset{
face/.style={fill=white, opacity=0.8, draw=black!60, thick},
label/.style={font=\bfseries, align=center}
@@ -2840,7 +2838,7 @@ Ultimately, the choice of parallelism is a **loop transformation** applied by th
\draw[face, fill=DataColor!10] (0,0,0) -- (1,0,0) -- (1,1,0) -- (0,1,0) -- cycle;
\draw[face, fill=TensorColor!10] (1,0,0) -- (1,0,1) -- (1,1,1) -- (1,1,0) -- cycle;
\draw[face, fill=PipeColor!10] (0,1,0) -- (1,1,0) -- (1,1,1) -- (0,1,1) -- cycle;
\node at (0.5, 0.5, 0.5) {\tiny GPU};
\end{scope}
}

View File

@@ -579,7 +579,7 @@ fig = plt.gcf()
Memory consumption during training is not static. It fluctuates dynamically, reaching a maximum during the backward pass when activations, gradients, and optimizer states must coexist. This peak memory usage determines whether a model can be trained on a device. For a 10M parameter model on a smartphone with 6 GB RAM, the 40 MB of FP32 weights might spike to over 200 MB during backpropagation as activations, gradients, and optimizer states accumulate---competing directly with the operating system and foreground applications for the device's limited memory. Techniques like gradient checkpointing mitigate this by discarding intermediate activations during the forward pass and recomputing them on-demand during the backward pass. This approach trades increased computation (20 to 30%) for a dramatic reduction in peak memory, often achieving 3 to 4 times reduction.
These amplifications explain why simply applying standard optimization techniques to training workloads is insufficient. Each constraint category shapes on-device learning system design, requiring approaches that build on but extend beyond inference-focused methods.
These amplifications explain why standard optimization techniques fail when applied to training workloads without modification. Each constraint category shapes on-device learning system design, requiring approaches that build on but extend beyond inference-focused methods.
@fig-ondevice-pretraining illustrates how the complete training pipeline combines offline pre-training with online adaptive learning on resource-constrained IoT devices. The system first undergoes meta-training with generic data. During deployment, device-specific constraints such as data availability, compute, and memory shape the adaptation strategy by ranking and selecting layers and channels to update. This allows efficient on-device learning within limited resource envelopes.
@@ -999,7 +999,7 @@ Each solution pillar extends model optimization principles and distributed syste
## Model Adaptation {#sec-edge-intelligence-model-adaptation-6a82}
If updating the full weights of a billion-parameter model requires 12 GB of VRAM, how can we possibly personalize that model on a smartwatch with only 500 MB of total system memory? The answer is that we do not update the whole model. We freeze the vast majority of the network and only train a tiny, strategically placed set of new parameters. This is the essence of resource-efficient model adaptation.
Personalizing a billion-parameter model on a smartwatch with 500 MB of total system memory is impossible without abandoning the goal of updating the complete model. The answer is that we do not update the whole model. We freeze the vast majority of the network and only train a tiny, strategically placed set of new parameters. This is the essence of resource-efficient model adaptation.
The engineering challenge centers on navigating a fundamental trade-off space: adaptation expressivity versus resource consumption. At one extreme, updating all parameters provides maximum flexibility but exceeds edge device capabilities. At the other extreme, no adaptation preserves resources but fails to capture user-specific patterns. Effective on-device learning systems must operate in the middle ground, selecting adaptation strategies based on three key engineering criteria.
@@ -1530,7 +1530,7 @@ Task-adaptive sparse updates introduce several important system-level considerat
Second, the stability of the adaptation process becomes important when working with sparse updates. If too few parameters are selected for updating, the model may underfit the target distribution, failing to capture important local variations. This suggests the need for careful validation of the selected parameter subset before deployment, potentially incorporating minimum thresholds for adaptation capacity.
Third, the selection of updatable parameters must account for hardware-specific characteristics of the target platform. Beyond just considering gradient magnitudes, the system must evaluate the actual execution cost of updating specific layers on the deployed hardware. Some parameters might show high contribution scores but prove expensive to update on certain architectures, requiring a more nuanced selection strategy that balances statistical utility with runtime efficiency.
Third, the selection of updatable parameters must account for hardware-specific characteristics of the target platform. Beyond considering gradient magnitudes alone, the system must evaluate the actual execution cost of updating specific layers on the deployed hardware. Some parameters might show high contribution scores but prove expensive to update on certain architectures, requiring a more nuanced selection strategy that balances statistical utility with runtime efficiency.
Despite these tradeoffs, task-adaptive sparse updates provide a practical mechanism to scale adaptation to diverse deployment contexts, from microcontrollers to mobile devices [@diao2023sparse].
@@ -2249,7 +2249,7 @@ These mathematical aggregation protocols prove that decentralized learning is th
## Federated Learning: Systems at Scale {#sec-edge-intelligence-federated-systems}
In a textbook algorithm, all 100 edge devices compute their updates perfectly and report back to the server exactly at the same time. In reality, a federated learning round spanning a million smartphones must deal with devices overheating, losing Wi-Fi, entering power-saving mode, or simply turning off. Building federated systems at scale means engineering an orchestrator that embraces chaos, asynchronous dropouts, and extreme stragglers as standard operating conditions.
In a textbook algorithm, all 100 edge devices compute their updates perfectly and report back to the server exactly at the same time. In reality, a federated learning round spanning a million smartphones must deal with devices overheating, losing Wi-Fi, entering power-saving mode, or powering off. Building federated systems at scale means engineering an orchestrator that embraces chaos, asynchronous dropouts, and extreme stragglers as standard operating conditions.
### Client Scheduling {#sec-edge-intelligence-client-scheduling-f675}

View File

@@ -301,7 +301,7 @@ The formula $\tau_{\text{opt}} = \sqrt{2 \cdot T_{\text{write}} \cdot \text{\tex
2. **Cluster Success Probability ($P_{cluster}$)**: $P_{cluster} = (P_{1h})^{10,000}$.
3. **Calculation**: $0.99988^{10000} \approx \mathbf{0.30}$.
**The Systems Conclusion**: Even with 99.99% reliable hardware, a 10k GPU cluster has only a **30% chance** of surviving a single hour without a failure. Relying on hardware reliability is mathematically impossible at scale. You *must* handle failure in software.
**The Systems Conclusion**: Even with 99.99% reliable hardware, a 10k GPU cluster has only a **30% chance** of surviving a single hour without a failure. Relying on hardware reliability is mathematically impossible at scale. Hardware reliability alone is insufficient; software must handle failures automatically.
:::
This linear relationship between component count and failure rate has a profound implication: *scale transforms failure* from a rare event into a continuous condition.

View File

@@ -190,7 +190,7 @@ This chapter examines the algorithms and trade-offs for ML cluster orchestration
### The Scheduler's Challenge {#sec-fleet-orchestration-schedulers-challenge}
Why is cluster scheduling so hard? After all, operating systems have scheduled processes on CPUs for decades. The answer lies in a qualitative shift that occurs at fleet scale: the scheduling problem transitions from an optimization challenge (finding good assignments) to a distributed systems challenge (maintaining consistent state and coordinating decisions across thousands of machines). Before examining specific scheduling systems, understanding why cluster scheduling is intrinsically hard clarifies the design constraints that every solution must navigate.
Cluster scheduling is fundamentally harder than single-machine process scheduling because the problem transitions from an optimization challenge to a distributed systems challenge of maintaining consistent state across thousands of machines. Operating systems have scheduled processes on CPUs for decades, yet the qualitative shift at fleet scale invalidates the assumptions that make single-machine scheduling tractable. Understanding why cluster scheduling is intrinsically hard clarifies the design constraints that every solution must navigate.
### Distributed Scheduling Complexity {#sec-fleet-orchestration-distributed-scheduling}
@@ -374,15 +374,15 @@ fig = plt.gcf()
\node[anchor=west, crimson] at (0, 3.5) {\textbf{A. Independent Scheduling (Deadlock)}};
\draw (0, 0) -- (8, 0) node[right] {Time};
\draw (0, 0) -- (0, 2.5) node[above] {Cluster Nodes};
% Job A holds some, waits for some
\fill[JobAColor!40, draw=black!60] (0, 1.25) rectangle (4, 2.2) node[midway] {Job A (Hold)};
\draw[JobAColor, dashed, ->] (4, 1.7) -- (6, 1.7) node[right] {Wait};
% Job B holds the rest, waits for Job A's
\fill[JobBColor!40, draw=black!60] (1, 0.2) rectangle (5, 1.15) node[midway] {Job B (Hold)};
\draw[JobBColor, dashed, ->] (5, 0.7) -- (7, 0.7) node[right] {Wait};
\node[deadlock] at (4.5, 1.2) {\textbf{DEADLOCK}};
\end{scope}
@@ -391,16 +391,16 @@ fig = plt.gcf()
\node[anchor=west, crimson] at (0, 3.5) {\textbf{B. Gang Scheduling (Atomic)}};
\draw (0, 0) -- (8, 0) node[right] {Time};
\draw (0, 0) -- (0, 2.5) node[above] {Cluster Nodes};
% Job A gets everything at once
\fill[JobAColor!60, draw=black!60] (0, 0.2) rectangle (3, 2.2) node[midway, white] {Job A (All)};
% Job B waits entirely in queue
\draw[thick, gray, latex-] (3.2, 1.2) -- (4.2, 1.2) node[right, black] {Queue};
% Job B starts after A finishes
\fill[JobBColor!60, draw=black!60] (4.5, 0.2) rectangle (7.5, 2.2) node[midway, white] {Job B (All)};
\node[text=GreenLine] at (3.75, -0.5) {\textbf{SUCCESS}};
\end{scope}
\end{tikzpicture}
@@ -454,7 +454,7 @@ The tension between gang scheduling's safety guarantees and backfill's utilizati
## Orchestration Paradigms {#sec-fleet-orchestration-paradigms}
When a researcher needs 64 GPUs, should they submit a script detailing exactly how many nodes they want (imperative), or should they declare their desired state and let a control loop figure out the provisioning (declarative)? Two dominant philosophies have emerged for cluster orchestrationthe HPC-oriented Slurm and the cloud-native Kuberneteseach reflecting the culture and requirements of its origin domain.
Researchers face a choice between two paradigms: submitting a script specifying exact resource needs (imperative), or declaring desired state and letting a control loop determine provisioning (declarative). Two dominant philosophies have emerged for cluster orchestrationthe HPC-oriented Slurm and the cloud-native Kuberneteseach reflecting the culture and requirements of its origin domain.
The fundamental distinction is between *imperative* and *declarative* resource management. In an imperative system, the user specifies exactly what resources they need and the system allocates them directly. In a declarative system, the user specifies what they want running and the system figures out how to make it happen. This distinction, familiar from programming language design, has profound consequences for how ML workloads are scheduled, monitored, and recovered from failure.
@@ -1398,7 +1398,7 @@ Optimizing isolation for a single deployment is straightforward, but maintaining
## Multi-Tenancy and Quotas {#sec-fleet-orchestration-multi-tenancy}
What happens when the computer vision team hoards 500 GPUs for a week "just in case" they need to run an experiment, while the NLP team sits completely blocked trying to push a critical bug fix to production? Without strict governance, shared clusters devolve into a tragedy of the commons. Multi-tenancy and quotas provide the organizational and technical firewalls needed to ensure fair access and high overall utilization.
When the computer vision team hoards 500 GPUs for a week while the NLP team sits blocked trying to push a critical bug fix to production, shared clusters devolve into a tragedy of the commons. Without strict governance, this pattern repeats across every organization with shared infrastructure. Multi-tenancy and quotas provide the organizational and technical firewalls needed to ensure fair access and high overall utilization.
Without explicit resource management policies, resource allocation degrades to a tragedy of the commons. Teams with the most aggressive job submission rates, the largest jobs, or the most persistent resubmission scripts consume disproportionate resources, while teams with more modest or intermittent needs find the cluster perpetually occupied. This creates organizational friction, political escalation to management, and underinvestment in slower-moving but potentially higher-value projects whose teams lack the engineering effort to compete for resources. The multi-tenancy and quota systems discussed in this section prevent this degradation by establishing formal policies for resource allocation, borrowing, and reclamation.
@@ -1428,7 +1428,7 @@ The appropriate over-commitment ratio depends on workload characteristics and re
### Resource Accounting and Observability {#sec-fleet-orchestration-resource-accounting}
Who pays for the 64 GPUs that sat idle for a weekend because a researcher forgot to cancel a reservation? In a dedicated cluster, the waste is visible: the lights are on, but the fans are quiet. In a multi-tenant cloud fleet, this waste is invisible -- and expensive. Effective governance requires moving beyond simple "allocation" metrics to a tiered accounting model that distinguishes between what was requested, what was used, and what was useful.
The 64 GPUs that sat idle for a weekend because a researcher forgot to cancel a reservation represent invisible waste in a multi-tenant cloud fleet. In a dedicated cluster, such waste is at least visible: the lights are on, but the fans are quiet. In a shared environment, the cost is hidden -- and compounds across thousands of jobs. Effective governance requires moving beyond simple "allocation" metrics to a tiered accounting model that distinguishes between what was requested, what was used, and what was useful.
Three tiers of measurement expose this gap. Allocated capacity measures what the scheduler has reserved for a job -- the resources that are unavailable to others. Compute utilization measures the percentage of time the GPU kernels are active. Productive utilization measures the fraction of time the GPU is advancing the model state, excluding data loading pauses, communication overhead, and checkpointing. The distinction is financial, not just technical. A team might be "allocated" 500 GPUs but only using 350 (70 percent compute utilization) and only "productively using" 280 (56 percent of allocation). If the organization pays for allocation but measures success by training progress, this 44 percent gap represents pure burn.

View File

@@ -320,7 +320,7 @@ The transition from single-machine to distributed training introduces qualitativ
:::
As systems scale beyond a single node, a fundamental physical constraint emerges: the *bisection bandwidth wall*, which limits how fast data can cross the network midpoint. This constraint explains why networking, not just compute, often determines the "speed" of your model.
As systems scale beyond a single node, a fundamental physical constraint emerges: the *bisection bandwidth wall*, which limits how fast data can cross the network midpoint. This constraint explains why networking, not just compute, often determines model throughput.
On a single GPU, training proceeds deterministically: the same code, data, and random seed produce identical results. At the scale of thousands of GPUs, new phenomena emerge. Network partitions can split clusters into groups that train independently, causing model divergence. Stragglers (workers that process data slower than peers due to hardware variation or thermal throttling) can bottleneck entire training runs. Hardware failures that occur once per machine-year become daily events when operating 10,000 machines[^fn-failure-rates-fleet]. Systems must checkpoint frequently enough that losing a day's progress becomes acceptable rather than catastrophic.
@@ -370,7 +370,7 @@ These scale-induced challenges drive infrastructure investment by the largest AI
## A Breed Apart: The ML Workload Character {#sec-vol2-introduction-breed-apart}
*Why* can we not simply use existing distributed systems like Apache Spark or standard web microservices to run the Machine Learning Fleet? While the underlying hardware—network, compute, storage—is identical, the **workload characteristics** of ML systems are fundamentally different from traditional distributed systems.
Existing distributed systems like Apache Spark and standard web microservices cannot run the Machine Learning Fleet because the **workload characteristics** of ML systems are fundamentally different from traditional distributed systems, even though the underlying hardware—network, compute, storage—is identical.
### Traditional vs. ML Fleet Dynamics
@@ -509,7 +509,7 @@ gpt3_training_ops_sci = Gpt3StatsScenario.gpt3_training_ops_sci
a100_fp16_tflops = Gpt3StatsScenario.a100_fp16_tflops
```
GPT-3 (2020) expanded to `{python} gpt3_params_b` billion parameters and demonstrated sophisticated text generation across diverse domains. Each increase in model size brought dramatically improved capabilities, but at exponentially increasing costs.
GPT-3 (2020) expanded to `{python} gpt3_params_b` billion parameters and demonstrated sophisticated text generation across diverse domains. Each increase in model size brought substantially improved capabilities, but at exponentially increasing costs.
This pattern extends beyond language models. In computer vision, doubling neural network size typically yields consistent accuracy gains when training data increases proportionally. AlexNet (2012) had 60 million parameters, VGG-16 (2014) scaled to 138 million, and large modern vision transformers can exceed 600 million parameters. Each generation achieved better image recognition accuracy but required proportionally more computational resources and training data.
@@ -933,7 +933,7 @@ The transition from single-machine ML to the Machine Learning Fleet introduces c
\index{Reliability Gap}
Machine learning systems already face the **Verification Gap**: the impossibility of testing a high-dimensional model against every possible input. At fleet scale, a more physical challenge emerges: the **Reliability Gap**\index{Reliability Gap!definition}.
In traditional software, we treat hardware as a reliable abstraction. A single server typically has an **Availability** of "four nines" (99.99%), meaning it fails for only a few minutes a year. The Machine Learning Fleet, however, operates at a scale where this abstraction collapses. *When* you coordinate 25,000 GPUs, the probability that the entire system is healthy ($P_{\text{fleet}}$) is the product of the individual probabilities:
In traditional software, we treat hardware as a reliable abstraction. A single server typically has an **Availability** of "four nines" (99.99%), meaning it fails for only a few minutes a year. The Machine Learning Fleet, however, operates at a scale where this abstraction collapses. *When* a fleet coordinates 25,000 GPUs, the probability that the entire system is healthy ($P_{\text{fleet}}$) is the product of the individual probabilities:
$$ P_{\text{fleet}} = (P_{\text{node}})^N $$ {#eq-reliability-gap}
@@ -980,7 +980,7 @@ prob_1k_str = ReliabilityGap.prob_1k_str
prob_10k_str = ReliabilityGap.prob_10k_str
```
If each node in your cluster is `{python} "99.9%"` reliable, a 1,000-node cluster is healthy only **`{python} prob_1k_str`%** of the time. Scale that to a 10,000-node fleet, and the probability of the entire system being healthy at any given second drops to **`{python} prob_10k_str`%**.
If each node in a cluster is `{python} "99.9%"` reliable, a 1,000-node cluster is healthy only **`{python} prob_1k_str`%** of the time. Scale that to a 10,000-node fleet, and the probability of the entire system being healthy at any given second drops to **`{python} prob_10k_str`%**.
The engineering lesson: **Failure is the common case.** At scale, we stop trying to prevent failure and start engineering for **Resilience**. We trade *uptime* for *recovery speed*. This shift in mindset—from "How do I keep it running?" to "How do I ensure it self-heals?"—is the defining challenge of @sec-fault-tolerance-reliability.
@@ -994,7 +994,7 @@ We define **Communication Intensity ($CI$)** as the ratio of data moved across t
$$ CI = \frac{\text{Bytes Transferred (Network)}}{\text{FLOPs Executed (Local)}} $$ {#eq-ci-ratio}
* **Low CI (< 0.01)**: The workload is **Compute-Heavy**. The GPUs spend most of their time doing math. Scaling is easy.
* **High CI (> 0.1)**: The workload is **Network-Bound**. The system is limited by bisection bandwidth. Adding more GPUs may actually *slow down* the training.
* **High CI (> 0.1)**: The workload is **Network-Bound**. The system is limited by bisection bandwidth. Adding more GPUs may *slow down* the training.
Every optimization in this volume—from **Gradient Sparsification** to **3D Parallelism**—is an attempt to lower the CI ratio so that the Machine Learning Fleet acts as a single, massive computer rather than a collection of idling processors waiting for the wire.
@@ -1012,7 +1012,7 @@ Scale forces distribution: no single machine provides the compute required for f
#### The CAP Theorem Reality
The **CAP Theorem**[^fn-cap-theorem-ml]\index{CAP Theorem} establishes that distributed systems can provide at most two of three properties: **Consistency**, **Availability**, and **Partition Tolerance**. *How* does this apply to ML?
The **CAP Theorem**[^fn-cap-theorem-ml]\index{CAP Theorem} establishes that distributed systems can provide at most two of three properties: **Consistency**, **Availability**, and **Partition Tolerance**. The ML fleet encounters all three constraints, forcing explicit trade-offs.
[^fn-cap-theorem-ml]: **CAP Theorem** (Brewers's Theorem): Formulated by Eric Brewer in 2000, the CAP theorem proves that in the face of a network partition (P), a system must choose between absolute Consistency (C) and high Availability (A). For the ML Fleet, this manifests as a choice between **Synchronous Training** (Consistency prioritized; the job stalls if a node fails) and **Asynchronous Training** (Availability prioritized; training continues but with stale gradients). \index{CAP Theorem!ML trade-off}
@@ -1020,7 +1020,7 @@ Distributed ML systems make different trade-offs depending on their requirements
#### Edge Distribution Complexity
The coordination challenges discussed so far assume datacenter distribution, where machines run in managed facilities. *How* do these challenges change at the network edge?
The coordination challenges discussed so far assume datacenter distribution, where machines run in managed facilities. At the network edge, these challenges intensify along every dimension.
Edge distribution amplifies every challenge. Billions of smartphones and IoT devices operate in uncontrolled environments with unreliable connectivity and limited power. *When* raw data cannot leave the device for privacy reasons, federated learning becomes mandatory. *When* connectivity is intermittent, the system must tolerate asynchronous updates spanning days. These constraints require architectural approaches—Differential Privacy, On-Device Inference, and Model Compression—that differ fundamentally from datacenter ML.
@@ -1036,7 +1036,7 @@ ML systems face unique security threats that intensify at production scale. **Mo
#### The Regulatory Wall
Systems operating at scale inevitably attract regulatory attention. From the **EU AI Act** to local privacy laws, the Machine Learning Fleet must be architected for **Auditability**. *How* do you prove that a model trained on 10,000 GPUs did not ingest prohibited data? *How* do you generate a human-interpretable explanation for a sub-millisecond recommendation? Meeting these requirements demands technical capabilities—audit trails, bias testing, and consent management—that must be designed into the infrastructure from day one.
Systems operating at scale inevitably attract regulatory attention. From the **EU AI Act** to local privacy laws, the Machine Learning Fleet must be architected for **Auditability**. Proving that a model trained on 10,000 GPUs did not ingest prohibited data requires end-to-end data lineage tracking. Generating a human-interpretable explanation for a sub-millisecond recommendation demands techniques that must be designed into the serving architecture. Meeting these requirements demands technical capabilities—audit trails, bias testing, and consent management—that must be designed into the infrastructure from day one.
#### Responsibility as an Invariant
@@ -1056,7 +1056,7 @@ When we scale from one machine to a fleet, three new fundamental resources compe
2. **Communication ($C_2$ - The Wire)**\index{C$^3$ Taxonomy!communication}: The movement of data across the network fabric. This is governed by bisection bandwidth and the speed of light. At scale, communication becomes the primary system bottleneck, often taking more time than the math itself.
3. **Coordination ($C_3$ - The Logic)**\index{C$^3$ Taxonomy!coordination}: The synchronous management of state across thousands of nodes. This is the domain of collective algorithms (All-Reduce), fault tolerance, and distributed consensus. Coordination is the "Software Tax" that determines how efficiently $N$ independent nodes can act as a single computer.
These three dimensions form the **Triad of Distributed Efficiency**. If your fleet spends too much time on *Communication* or *Coordination*, the expensive *Computation* capacity sits idle. The central challenge here is engineering the fleet to minimize the "C$^3$ Gap"—the difference between theoretical hardware peak and actual distributed throughput.
These three dimensions form the **Triad of Distributed Efficiency**. If the fleet spends too much time on *Communication* or *Coordination*, the expensive *Computation* capacity sits idle. The central challenge here is engineering the fleet to minimize the "C$^3$ Gap"—the difference between theoretical hardware peak and actual distributed throughput.
### The Fleet Law {#sec-vol2-introduction-fleet-law}
@@ -1117,11 +1117,11 @@ where:
2. **Communication Overhead ($T_{\text{comm}}(N)$)**: The "Coordination Tax." As $N$ grows, the time spent synchronizing gradients or activations ($T_{\text{comm}}$) grows or stays constant, threatening to dominate the step time.
3. **Communication Hiding ($T_{\text{overlap}}$)**: The "Efficiency Gain." Advanced systems overlap communication with the computation of the next layer to hide the coordination tax.
The **Scaling Efficiency** ($\eta_{\text{scale}}$) of the fleet is simply the ratio of ideal parallel compute to the actual step time:
The **Scaling Efficiency** ($\eta_{\text{scale}}$) of the fleet is the ratio of ideal parallel compute to the actual step time:
$$ \eta_{\text{scale}} = \frac{T_{\text{compute}}}{N \times T_{\text{step}}} $$
::: {.callout-perspective title="Amdahl's Distributed Pitfall"}
The Iron Law of Scale is a specialized form of **Amdahl's Law**\index{Amdahl's Law!distributed}. It tells us that the maximum speedup of a distributed system is limited by its most tightly coupled component—usually the network synchronization. If your model spends 20% of its time waiting for the network ($T_{\text{comm}}$), no amount of faster GPUs can ever make it more than 5$\times$ faster, regardless of how many you add. Scale is limited by **coordination**, not just by **calculation**.
The Iron Law of Scale is a specialized form of **Amdahl's Law**\index{Amdahl's Law!distributed}. It tells us that the maximum speedup of a distributed system is limited by its most tightly coupled component—usually the network synchronization. If a model spends 20% of its time waiting for the network ($T_{\text{comm}}$), no amount of faster GPUs can ever make it more than 5$\times$ faster, regardless of how many you add. Scale is limited by **coordination**, not just by **calculation**.
:::
The following notebook applies this law to a real-world cluster to calculate the "Coordination Tax" of a GPT-3 training run.
@@ -1410,7 +1410,7 @@ The **Six Systems Engineering Principles**\index{Six Systems Engineering Princip
5. *Design Cost-Consciously*: Scale makes efficiency improvements worth millions of dollars.
6. *Co-Design for Hardware*: Distributed hardware introduces network topology and storage hierarchy as primary co-design considerations.
These frameworks assume familiarity with single-machine ML systems: how models are trained, optimized, and deployed on individual devices. This textbook teaches you to **scale, distribute, and govern** them across the global Machine Learning Fleet.
These frameworks assume familiarity with single-machine ML systems: how models are trained, optimized, and deployed on individual devices. This textbook teaches engineers to **scale, distribute, and govern** them across the global Machine Learning Fleet.
### Three Systems Archetypes {#sec-vol2-introduction-archetypes}
@@ -1458,7 +1458,7 @@ This textbook organizes around the **Fleet Stack**\index{Fleet Stack!textbook st
### Part III: Deployment at Scale (The Serving Pipeline)
**The Impediment**: *Operational Economics*. Inference costs eventually dwarf training costs, requiring a fundamental shift from throughput to latency optimization.
* **Inference at Scale (@sec-inference-scale)**: The interface. Serving models to millions of users simultaneously while managing the "KV Cache Wall."
* **Performance Engineering (@sec-performance-engineering)**: The efficiency frontier. Closing the gap between hardware peak and actual throughput through kernel fusion and compilation. Innovations like **FlashAttention**\index{FlashAttention} [@dao2022flashattention] demonstrate how I/O-aware kernels can dramatically improve performance on memory-bound workloads.
* **Performance Engineering (@sec-performance-engineering)**: The efficiency frontier. Closing the gap between hardware peak and actual throughput through kernel fusion and compilation. Innovations like **FlashAttention**\index{FlashAttention} [@dao2022flashattention] demonstrate how I/O-aware kernels can improve performance 24$\times$ on memory-bound workloads.
* **Edge Intelligence (@sec-edge-intelligence)**: The frontier. Moving intelligence from the datacenter to the user's device, constrained by milliwatt power budgets.
* **Operations at Scale (@sec-ops-scale)**: The control plane. Monitoring the fleet's health, drift, and performance across global deployments.

View File

@@ -72,7 +72,7 @@ We are now at the **Management Layer** of the Fleet Stack. While Parts I and II
## From Single-Model to Platform Operations {#sec-ml-operations-scale-singlemodel-platform-operations-db8e}
Consider a team of five engineers maintaining a single recommendation model. When the model drifts, they manually retrain it. When the API latency spikes, they manually scale the instances. Now, scale that same team to support five hundred models across dozens of product surfaces. Manual intervention is no longer just inefficient; it is mathematically impossible. The transition from single-model to platform operations is fundamentally about replacing human-in-the-loop maintenance with automated, systemic governance.
Consider a team of five engineers maintaining a single recommendation model. When the model drifts, they manually retrain it. When the API latency spikes, they manually scale the instances. Now, scale that same team to support five hundred models across dozens of product surfaces. Manual intervention is no longer merely inefficient; it is mathematically impossible. The transition from single-model to platform operations is fundamentally about replacing human-in-the-loop maintenance with automated, systemic governance.
@sec-inference-scale established the distributed serving architectures that handle massive request volumes, and @sec-edge-intelligence pushed deployment to its physical limits — smartphones, microcontrollers, and federated fleets spanning billions of heterogeneous devices. The question is what happens when organizations must sustain not one but hundreds of such systems across this entire spectrum. The transition from managing individual models to operating enterprise-scale ML platforms represents a fundamental shift in operational complexity.
@@ -508,13 +508,13 @@ Effective platforms present high-level dashboards by default and enable investig
Beyond scale considerations, different model types require fundamentally different operational patterns. The practices appropriate for deploying a large language model are entirely inappropriate for a fraud detection system, and vice versa. Examine @tbl-ops-scale-model-types: LLMs demand staged rollouts over days to weeks with hours-long rollback windows, while fraud detection requires hourly updates with seconds-fast rollback to address adversarial dynamics.
| **Model Type** | **Update Frequency** | **Deployment Pattern** | **Primary Risk** | **Rollback Speed** |
|:----------------------------|:---------------------|:----------------------------|:---------------------------|:-------------------|
| **Model Type** | **Update Frequency** | **Deployment Pattern** | **Primary Risk** | **Rollback Speed** |
|:----------------------------------|:---------------------|:----------------------------|:---------------------------|:-------------------|
| **Archetype A (GPT-4 / Llama-3)** | Monthly to quarterly | Staged, careful | Quality regression, safety | Hours to days |
| **Archetype B (DLRM at Scale)** | Daily to weekly | Shadow, interleaving | Engagement drop | Minutes |
| **Fraud Detection** | Hourly to daily | Rapid with instant rollback | False negatives | Seconds |
| **Vision (Classification)** | Weekly to monthly | Canary | Accuracy regression | Minutes |
| **Search Ranking** | Daily | A/B with holdout | Relevance degradation | Minutes |
| **Fraud Detection** | Hourly to daily | Rapid with instant rollback | False negatives | Seconds |
| **Vision (Classification)** | Weekly to monthly | Canary | Accuracy regression | Minutes |
| **Search Ranking** | Daily | A/B with holdout | Relevance degradation | Minutes |
: **Model-Type Operational Requirements**: Update frequency, deployment patterns, and rollback speeds vary by model type due to differing risk profiles. LLMs require monthly staged rollouts with hours-to-days rollback due to quality regression risks, while fraud detection demands hourly updates with seconds-fast rollback to counter adversarial dynamics. {#tbl-ops-scale-model-types}
@@ -813,10 +813,10 @@ Deploying ensemble updates requires coordination that single-model deployments d
| **Deployment Stage** | **Actions** | **Duration** | **Rollback Trigger** |
|:---------------------|:---------------------------------------------------------------------|-------------:|:---------------------------------------|
| **Shadow** | New model scores alongside production, results logged but not served | 24--48 hours | Quality metrics below threshold |
| **Canary** | 1% traffic receives new model results | 4--8 hours | Statistical significance of regression |
| **Staged Rollout** | 5% → 25% → 50% → 100% | 24--72 hours | Business metric degradation |
| **Soak** | Full traffic, extended monitoring | 7--14 days | Delayed effects emerge |
| **Shadow** | New model scores alongside production, results logged but not served | 24--48 hours | Quality metrics below threshold |
| **Canary** | 1% traffic receives new model results | 4--8 hours | Statistical significance of regression |
| **Staged Rollout** | 5% → 25% → 50% → 100% | 24--72 hours | Business metric degradation |
| **Soak** | Full traffic, extended monitoring | 7--14 days | Delayed effects emerge |
: **Staged Ensemble Deployment Pattern**: Four-phase rollout for updating recommendation ensemble components. Shadow deployment (24--48 hours) validates behavior without user impact, canary (1% traffic, 4--8 hours) enables statistical regression detection, staged rollout progressively increases exposure, and soak period (7--14 days) catches delayed interaction effects. {#tbl-ops-scale-ensemble-deploy}
@@ -1152,10 +1152,10 @@ Production models must meet latency requirements. Validation should measure infe
| **Model Type** | **p50 Target** | **p99 Target** | **Gate Action if Exceeded** |
|:--------------------|---------------:|---------------:|:---------------------------------------|
| **LLM** | 500 ms | 2000 ms | Block deployment, require optimization |
| **Recommendation** | 10 ms | 50 ms | Block deployment |
| **Fraud Detection** | 5 ms | 20 ms | Block deployment, high priority |
| **Vision** | 50 ms | 200 ms | Warning, conditional approval |
| **LLM** | 500 ms | 2000 ms | Block deployment, require optimization |
| **Recommendation** | 10 ms | 50 ms | Block deployment |
| **Fraud Detection** | 5 ms | 20 ms | Block deployment, high priority |
| **Vision** | 50 ms | 200 ms | Warning, conditional approval |
: **Latency Gate Thresholds by Model Type**: Production latency requirements (p50 and p99) and gate actions when thresholds are exceeded. Fraud detection enforces the strictest requirements (5 ms p50, 20 ms p99) with high-priority blocking, reflecting the real-time nature of transaction processing. LLMs accept broader bounds (500 ms p50, 2000 ms p99) while requiring optimization before deployment approval. {#tbl-ops-scale-latency-gates}
@@ -1175,7 +1175,7 @@ $$\text{Equalized Odds: } |P(\hat{Y}=1|Y=y, A=a) - P(\hat{Y}=1|Y=y, A=b)| < \eps
where $A$ represents the protected attribute, $\hat{Y}$ is the model prediction, and $Y$ is the true outcome.
Fairness gates should evaluate multiple fairness definitions since different contexts require different definitions, compare against historical baselines rather than just thresholds, flag improvements as well as regressions for review, and integrate with human review for borderline cases.
Fairness gates should evaluate multiple fairness definitions since different contexts require different definitions, compare against historical baselines rather than thresholds alone, flag improvements as well as regressions for review, and integrate with human review for borderline cases.
#### Data Quality Gates
@@ -1609,7 +1609,7 @@ For 20 tests: $\alpha' = 1 - 0.95^{1/20} = 0.00256$, slightly more lenient than
$$p_{(i)} \leq \frac{i}{k} \times \alpha$$
Reject all hypotheses $H_{(1)}, \ldots, H_{(i)}$. This procedure is more powerful than Bonferroni when running many tests.
Reject all hypotheses $H_{(1)}, \ldots, H_{(i)}$. This procedure yields higher statistical power than Bonferroni when running many tests.
#### Sequential Testing and Early Stopping
@@ -1776,11 +1776,11 @@ This framework suggests risk mitigation strategies:
#### Risk Categories
| **Category** | **$P_{\text{regression}}$** | **$I_{\text{regression}}$** | **Rollout Strategy** |
|:-------------|:---------------------|:---------------------|:-------------------------------|
| **Low** | Minor code fix | Limited user impact | Fast canary |
| **Medium** | Retrained model | Engagement effects | Standard canary |
| **High** | New architecture | Revenue impact | Extended shadow + slow canary |
| **Critical** | Core model change | Safety implications | Shadow + human review + staged |
|:-------------|:----------------------------|:----------------------------|:-------------------------------|
| **Low** | Minor code fix | Limited user impact | Fast canary |
| **Medium** | Retrained model | Engagement effects | Standard canary |
| **High** | New architecture | Revenue impact | Extended shadow + slow canary |
| **Critical** | Core model change | Safety implications | Shadow + human review + staged |
: **Risk-Based Rollout Strategy Selection**: Four risk categories mapped to deployment strategies. Low-risk minor fixes (risk score under 0.1) proceed through fast canary rollout, while critical core model changes (risk score above 0.75) require full shadow deployment, human approval, and staged multi-week rollout with extended monitoring. {#tbl-ops-scale-risk-categories}
@@ -1863,7 +1863,7 @@ Fraud models balance quality validation against deployment urgency:
The full cycle may complete in 4--12 hours, with ability to deploy emergency updates in under 1 hour when new fraud patterns emerge.
A mature CI/CD pipeline ensures that only healthy, verified models reach production, completing the deployment cycle in hours rather than weeks. However, deployment is not the finish line—it is the starting line. Once a model is safely deployed, how do we know it is actually doing what we expect it to do as the world around it changes? Answering this requires a monitoring architecture capable of scaling alongside our automated deployments.
A mature CI/CD pipeline ensures that only healthy, verified models reach production, completing the deployment cycle in hours rather than weeks. However, deployment is not the finish line—it is the starting line. Once a model is safely deployed, the question of whether it continues to perform as expected under shifting conditions becomes the primary operational concern. Answering this requires a monitoring architecture capable of scaling alongside our automated deployments.
## Monitoring at Scale {#sec-ml-operations-scale-monitoring-scale-73c5}
@@ -2374,7 +2374,7 @@ Integrating cost monitoring with the hierarchical monitoring architecture ensure
## Platform Engineering {#sec-ml-operations-scale-platform-engineering-17be}
If every data science team in your organization has to independently figure out how to provision a GPU cluster, configure a model registry, and wire up alerting dashboards, your company is paying a massive "undifferentiated heavy lifting" tax. Platform engineering solves this by treating the ML infrastructure itself as a product, providing paved roads that allow product teams to focus entirely on modeling rather than infrastructure plumbing.
Organizations where every data science team independently provisions GPU clusters, configures model registries, and wires up alerting dashboards pay a massive duplication tax on undifferentiated infrastructure work. Platform engineering solves this by treating the ML infrastructure itself as a product, providing paved roads that allow product teams to focus entirely on modeling rather than infrastructure plumbing.
Platform engineering for machine learning creates shared infrastructure that enables model teams to develop, deploy, and operate models without managing underlying complexity. Effective platforms balance self-service capabilities that accelerate development against governance requirements that ensure consistency and reliability.
@@ -2638,11 +2638,11 @@ ML platform costs span multiple categories with different optimization strategie
| **Cost Category** | **Typical Share** | **Primary Drivers** | **Optimization Lever** |
|:----------------------|------------------:|:-----------------------------------|:-----------------------------------|
| **Training compute** | 40--60% | GPU hours, experiment volume | Spot instances, early stopping |
| **Serving compute** | 20--40% | Traffic volume, latency SLOs | Autoscaling, model optimization |
| **Storage** | 10--20% | Dataset size, checkpoint frequency | Tiered storage, retention policies |
| **Network** | 5--15% | Multi-region, data transfer | Caching, compression |
| **Platform overhead** | 5--10% | Team size, tooling | Automation, self-service |
| **Training compute** | 40--60% | GPU hours, experiment volume | Spot instances, early stopping |
| **Serving compute** | 20--40% | Traffic volume, latency SLOs | Autoscaling, model optimization |
| **Storage** | 10--20% | Dataset size, checkpoint frequency | Tiered storage, retention policies |
| **Network** | 5--15% | Multi-region, data transfer | Caching, compression |
| **Platform overhead** | 5--10% | Team size, tooling | Automation, self-service |
: **ML Platform Cost Breakdown**: Five cost categories with typical budget share and optimization levers. Training compute dominates (40--60%) driven by GPU hours and experiment volume; spot instances and early stopping provide primary savings. Serving compute (20--40%) scales with traffic; autoscaling and model optimization reduce costs while maintaining latency SLOs. {#tbl-ops-scale-cost-breakdown}
@@ -3640,7 +3640,7 @@ At Spotify, non-negotiable latency constraints force a design tradeoff where com
## Production Debugging and Incident Response {#sec-ml-operations-scale-production-debugging-incident-response-9449}
It is 3:00 AM, and PagerDuty alerts you that the revenue from the core recommendation system has dropped by 15% in the last hour. The servers are healthy, the latency is normal, and there are no exception logs. In traditional software, a silent failure of this magnitude is rare; in machine learning systems, it is the expected reality. Debugging production ML systems requires fundamentally different investigative frameworks because the failures reside in the data and the mathematics, not just the code.
At 3:00 AM, PagerDuty alerts the on-call engineer that revenue from the core recommendation system has dropped 15% in the last hour. The servers are healthy, the latency is normal, and there are no exception logs. In traditional software, a silent failure of this magnitude is rare; in machine learning systems, it is the expected reality. Debugging production ML systems requires fundamentally different investigative frameworks because the failures reside in the data and the mathematics, not just the code.
Engineers spend 30-50% of their time debugging production issues. At platform scale, the complexity multiplies: failures may originate in data pipelines, model code, infrastructure, or emergent interactions between components. Effective incident response requires systematic approaches that go beyond single-model debugging techniques.

View File

@@ -100,7 +100,7 @@ Performance Engineering is the **Optimization Layer** of the Fleet Stack. While
## The Memory Wall and the Efficiency Frontier {#sec-performance-engineering-memory-wall}
Why does an H100 GPU capable of 989 teraFLOPS often sit 95% idle while generating text from a large language model? The processor is starving for data. Performance engineering operates within a constrained optimization space defined by the Memory Wall, where the speed of moving bytes from memory to compute units fundamentally caps our operational throughput.
An H100 GPU capable of 989 teraFLOPS often sits 95% idle while generating text from a large language model because the processor is starving for data. Performance engineering operates within a constrained optimization space defined by the Memory Wall, where the speed of moving bytes from memory to compute units fundamentally caps our operational throughput.
Part II established the distributed logic of the fleet: parallelism strategies (@sec-distributed-training-systems), communication patterns (@sec-collective-communication), fault recovery (@sec-fault-tolerance-reliability), and resource orchestration (@sec-fleet-orchestration). Those chapters ensured that workloads reach the right hardware and survive failures along the way. This chapter ensures that each workload *uses* that hardware efficiently, extracting maximum throughput from every accelerator cycle.
@@ -485,7 +485,7 @@ class RooflineRidgeCalc:
a100_ridge_str = f"{a100_ridge_val:.0f}"
h100_fp16_ridge_str = f"{h100_ridge_fp16_val:.0f}"
h100_fp8_ridge_str = f"{h100_ridge_fp8_val:.0f}"
a100_fp16_str = f"{a100_f16.m_as(TFLOPs/second):.0f}"
a100_hbm_bw_str = f"{a100_bw.m_as(TB/second):.0f}"
h100_fp8_str = f"{h100_f8.m_as(TFLOPs/second):.0f}"
@@ -699,7 +699,7 @@ class WorkloadIntensityCalc:
prefill_intensity = (2 * prefill_batch) / (2 * bytes_per_elem) # Approx for large hidden
# Token Rate Physics (Llama-3 70B sharded across 8 GPUs)
p_shard = 17.5 * BILLION
p_shard = 17.5 * BILLION
weight_bytes_val = p_shard * bytes_per_elem
decode_step_flops_val = 2 * p_shard
t_decode_ms = (weight_bytes_val / h100_bw.m_as(byte/second)) * 1000
@@ -717,17 +717,17 @@ class WorkloadIntensityCalc:
gemm_intensity_str = f"{gemm_intensity:.0f}"
gemm_flops_str = f"{gemm_flops/BILLION:.0f} billion"
gemm_data_mb_str = f"{gemm_bytes/MILLION:.0f} MB"
elem_intensity_str = f"{elem_intensity:.1f}"
gelu_flops_per_val_str = f"{elem_flops_per_val}"
decode_intensity_str = f"{decode_intensity_val:.1f}"
h100_ridge_str = f"{h100_ridge:.0f}"
prefill_intensity_str = f"{prefill_intensity:.0f}"
decode_t_ms_str = f"{t_decode_ms:.1f}"
decode_util_pct_str = f"{util_pct:.1f}"
decode_tokens_sec_str = f"{tokens_sec:.0f}"
decode_step_param_b_str = f"{p_shard/BILLION:.1f}"
decode_step_weight_gb_str = f"{weight_bytes_val/BILLION:.0f}"
decode_step_flops_str = f"{decode_step_flops_val/BILLION:.0f}"
@@ -757,7 +757,7 @@ Autoregressive LLM decoding at batch size one represents the extreme case. Each
| **Operation** | **Arithmetic Intensity** | **H100 FP16 Regime** | **Primary Bottleneck** |
|:-----------------------------------|:-------------------------|:---------------------|:-----------------------|
| **GEMM ($4096\times4096$)** | ~1,365 FLOP/byte | Compute-bound | Tensor core throughput |
| **GEMM ($4096\times4096$)** | ~1,365 FLOP/byte | Compute-bound | Tensor core throughput |
| **Self-Attention (seq=2048)** | ~50--200 FLOP/byte | Memory-bound | HBM bandwidth |
| **Element-wise (GELU, LayerNorm)** | ~1--3 FLOP/byte | Memory-bound | HBM bandwidth |
| **LLM Decode (batch=1)** | ~1--2 FLOP/byte | Memory-bound | HBM bandwidth |
@@ -1029,7 +1029,7 @@ class FlashAttentionSavings:
naive_mb_str = f"{naive_attn_bytes_total / MILLION:.0f}"
flash_mb_str = f"{flash_attn_bytes_total / MILLION:.0f}"
savings_str = f"{savings_ratio:.0f}"
head_n_str = f"{seq_len}"
head_d_str = f"{head_dim}"
head_bytes_str = f"{head_dim * bytes_per_elem / MILLION * seq_len:.0f}"
@@ -1989,7 +1989,7 @@ class FleetEfficiencyCalc:
n_gpus = 128
r_peak_per_gpu = H100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second) * TRILLION
tokens_per_step = 2048 * 32 # seq_len * global_batch
# Observed times
t_local_step_ms = 180.0 # 8-GPU node baseline
t_fleet_step_ms = 245.0 # 128-GPU cluster observed
@@ -1997,11 +1997,11 @@ class FleetEfficiencyCalc:
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# Useful FLOPs per step = 6 * P * Tokens
flops_per_step = 6 * p_params * tokens_per_step
# MFU = useful_flops / (n_gpus * r_peak * time)
local_mfu = flops_per_step / (8 * r_peak_per_gpu * (t_local_step_ms / 1000))
global_mfu = flops_per_step / (n_gpus * r_peak_per_gpu * (t_fleet_step_ms / 1000))
scaling_tax = (1 - (global_mfu / local_mfu)) * 100
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
@@ -2025,7 +2025,7 @@ fleet_scaling_tax_str = FleetEfficiencyCalc.fleet_scaling_tax_str
```
::: {.callout-notebook title="The Scaling Tax"}
**Scenario**: Training a **`{python} fleet_params_b_str`** parameter model across a cluster of **`{python} fleet_nodes_str` H100 GPUs**.
**Scenario**: Training a **`{python} fleet_params_b_str`** parameter model across a cluster of **`{python} fleet_nodes_str` H100 GPUs**.
* **Local Node Baseline**: A single 8-GPU node achieves **`{python} fleet_local_mfu_str`% MFU**.
* **Fleet Performance**: At 128 GPUs, the step time increases to **`{python} fleet_t_step_ms_str` ms**, dropping MFU to **`{python} fleet_global_mfu_str`%**.
@@ -2042,9 +2042,10 @@ At scale, the system is non-linear. A code change that introduces a minor memory
3. **Gray Failure Detection**: Monitoring the distribution of step times across the fleet. A single "straggler" node that is 10% slower due to thermal throttling can reduce the entire cluster's MFU by 10% in a synchronous data-parallel workload.
::: {.callout-note title="Benchmark vs. Reality: The Hero Run Tax"}
Industry benchmarks like **MLPerf** are often "Hero Runs"—highly tuned configurations where logging is disabled, safety checks are bypassed, and the hardware is freshly rebooted.
Industry benchmarks like **MLPerf** are often "Hero Runs"—highly tuned configurations where logging is disabled, safety checks are bypassed, and the hardware is freshly rebooted.
In production, your achieved MFU will typically sit **1020% lower** than these hero numbers. This "Reality Tax" is consumed by essential operational overhead:
* **Observability**: Metrics collection and logging.
* **Reliability**: Checkpointing and health heartbeats.
* **Entropy**: Thermal throttling, memory fragmentation, and multi-tenant network noise.
@@ -2052,11 +2053,11 @@ In production, your achieved MFU will typically sit **1020% lower** than thes
When planning capacity, engineers must budget for the Reality Tax. If a benchmark says you can train in 30 days, your production plan should assume 3540 days.
:::
With the measurement hierarchy established—from node-level traces to fleet-wide MFU—we now turn to the tactical execution. The optimization playbook translates these global measurements into a surgical sequence of interventions.
Measurement without action is overhead. The optimization playbook translates node-level traces and fleet-wide MFU into a surgical sequence of interventions, each targeting the specific bottleneck that the measurement hierarchy identified.
## The Optimization Playbook: A 70B LLM Case Study {#sec-performance-engineering-playbook}
You are handed a raw, unoptimized 70-billion parameter PyTorch model and told it must serve 1,000 tokens per second in production by next week. Where do you start? You cannot simply throw every technique at the wall. The optimization playbook requires a systematic, prioritized attack: first unblocking the memory wall, then fusing operators, and finally applying algorithmic techniques like speculative decoding in a specific, compounding sequence.
Consider a raw, unoptimized 70-billion parameter PyTorch model that must serve 1,000 tokens per second in production by next week. The optimization sequence begins with baseline measurement, followed by roofline classification to identify the dominant bottleneck. Throwing every technique at the wall does not work. The optimization playbook requires a systematic, prioritized attack: first unblocking the memory wall, then fusing operators, and finally applying algorithmic techniques like speculative decoding in a specific, compounding sequence.
### The Diagnostic Sequence {#sec-performance-engineering-diagnostic-sequence}

View File

@@ -480,7 +480,7 @@ class SilentErrorProbability:
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
p_per_hr_meta = 1e-4 # Meta reported rate
n_gpus_large = 10 * THOUSAND
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# P(at least one) = 1 - (1-p)^N
p_at_least_one_val = 1 - (1 - p_per_hr_meta) ** n_gpus_large
@@ -795,13 +795,13 @@ class DistributionShiftMetrics:
n_samples_mmd = 10 * THOUSAND
mmd_latency_ms = 150 # ms on H100
n_samples_ks = THOUSAND
psi_warn = 0.10
psi_critical = 0.25
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# (Simplified benchmarks for this scenario)
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
check(psi_critical > psi_warn, "Critical threshold must be above warning")
@@ -1165,7 +1165,7 @@ The most direct and widely studied category comprises gradient-based attacks, wh
The key insight behind gradient-based attacks is that neural networks compute gradients to understand how changes to their inputs affect their outputs. During training, gradients guide weight updates to minimize prediction errors. For attacks, these same gradients reveal which input modifications would maximize prediction errors—essentially running the training process in reverse.
To illustrate this concept, consider an image classification model that correctly identifies a cat in a photo. The gradient with respect to the input image shows how sensitive the model's prediction is to changes in each pixel. An attacker can use this gradient information to determine the most effective way to modify specific pixels to change the model's prediction, perhaps causing it to misclassify the cat as a dog while keeping the changes imperceptible to human observers.
To illustrate this concept, consider an image classification model that correctly identifies a cat in a photo. The gradient with respect to the input image shows how sensitive the model's prediction is to changes in each pixel. An attacker can use this gradient information to determine the most effective way to modify specific pixels to change the model's prediction, causing it to misclassify the cat as a dog while keeping the changes imperceptible to human observers.
###### Fast Gradient Sign Method (FGSM)
@@ -1880,7 +1880,7 @@ acc_drop_rn50 = RobustnessTaxAnalysis.acc_drop_str
```
::: {.callout-notebook title="The Robustness Tax"}
**Problem**: You want to make your ResNet-50 "unhackable" by adversarial patches. What does this cost you in normal performance?
**Problem**: The goal is to make a ResNet-50 "unhackable" by adversarial patches. What does this cost you in normal performance?
**The Data**:
@@ -1890,7 +1890,7 @@ acc_drop_rn50 = RobustnessTaxAnalysis.acc_drop_str
**The Systems Conclusion**: To gain robustness against rare adversarial attacks, you sacrifice **`{python} acc_drop_rn50`% accuracy** on normal inputs.
* **Why?** The model must learn to ignore "non-robust features" (like high-frequency textures) that are predictive but brittle.
* **Implication**: You cannot simply "turn on robustness" for free. It is a fundamental trade-off between **Average-Case Performance** and **Worst-Case Reliability**.
* **Implication**: Robustness cannot simply be "turned on" for free. It is a fundamental trade-off between **Average-Case Performance** and **Worst-Case Reliability**.
:::
::: {#nte-robustness-compute-penalty .callout-principle icon=false title="The Robustness Compute Penalty"}
@@ -1997,7 +1997,7 @@ While adversarial training and certified defenses provide a strong perimeter aga
## Data Poisoning Defenses {#sec-robust-ai-data-poisoning-defenses-d070}
Imagine a hedge fund training a sentiment analysis model on financial tweets. If a rival firm coordinates a network of bots to systematically associate the word "growth" with negative sentiment during the training window, your newly deployed trading algorithm will aggressively short stocks on positive earnings reports. Data poisoning attacks target the supply chain of machine learning, manipulating the raw material of intelligence before the model even begins to learn.
Imagine a hedge fund training a sentiment analysis model on financial tweets. If a rival firm coordinates a network of bots to systematically associate the word "growth" with negative sentiment during the training window, a newly deployed trading algorithm will aggressively short stocks on positive earnings reports. Data poisoning attacks target the supply chain of machine learning, manipulating the raw material of intelligence before the model even begins to learn.
::: {#fig-adversarial-attack-injection fig-env="figure" fig-pos="htb" fig-cap="**Data Poisoning Attack**: Adversaries inject malicious data into the training set to manipulate model behavior, potentially causing misclassification or performance degradation during deployment. This attack emphasizes the vulnerability of machine learning systems to compromised data integrity and the need for robust data validation techniques. *Source: [li](HTTPS://www.mdpi.com/2227-7390/12/2/247)*" fig-alt="Matrix showing user-item ratings with attacker injecting red malicious rows. Defender analyzes and cleans data before training. Poisoned model cube shows compromised output."}
```{.tikz}
@@ -2362,7 +2362,6 @@ Several strategies can incorporate self-supervised learning into robust system d
While promising, self-supervised learning for robustness remains an active research area with important limitations. Current SSL methods may still be vulnerable to adversarial attacks, particularly when attackers understand the pretext tasks. The theoretical understanding of why and when SSL improves robustness remains incomplete. Computational overhead for SSL pre-training can be substantial, requiring careful consideration of resource constraints.
## Fallacies and Pitfalls {#sec-robust-ai-fallacies-pitfalls-087e}
This chapter has examined robustness across environmental shifts, adversarial attacks, and data poisoning. Each threat domain introduces misconceptions that can lead to inadequate defenses or misallocated engineering resources.

View File

@@ -71,7 +71,7 @@ This chapter's position in the book's organizing framework, *the Fleet Stack*, c
::: {.callout-note title="Connection: The Fleet Stack"}
We have entered **Part IV: The Responsible Fleet**, the **Governance Layer** of the Fleet Stack. Having built the fleet (Part I), the distributed logic (Part II), and the serving infrastructure (Part III), we now focus on **protection**. This layer does not add new features; it wraps the entire stack in armor, ensuring that our powerful global machine cannot be hijacked, poisoned, or exploited by adversaries.
We have entered **Part IV: The Responsible Fleet**, the **Governance Layer** of the Fleet Stack. Having built the fleet (Part I), the distributed logic (Part II), and the serving infrastructure (Part III), we now focus on **protection**. This layer does not add new features; it wraps the entire stack in armor, ensuring that the global fleet cannot be hijacked, poisoned, or exploited by adversaries.
:::
@@ -209,7 +209,7 @@ error_small_str = DPCostAnalysis.error_small_str
```
::: {.callout-notebook title="The Cost of Differential Privacy"}
**Problem**: You want to compute the average salary of **`{python} n_emp_dp` employees** while guaranteeing privacy ($\epsilon=`{python} eps_dp`). The salaries range from \$0 to \$`{python} sensitivity_dp_str`. How much noise must you add?
**Problem**: Consider computing the average salary of **`{python} n_emp_dp` employees** while guaranteeing privacy ($\epsilon=`{python} eps_dp`). The salaries range from \$0 to \$`{python} sensitivity_dp_str`. How much noise must the mechanism add?
**The Math**:
@@ -219,7 +219,7 @@ error_small_str = DPCostAnalysis.error_small_str
4. **Impact on Mean**: The noise added to the *sum* has magnitude $\approx `{python} noise_scale_dp_str`.
* Noise per person (average) = `{python} noise_scale_dp_str` / `{python} n_emp_dp` = **\$`{python} error_per_person_str`**.
**The Systems Conclusion**: To protect one outlier, you introduce a **\$`{python} error_per_person_str` error** to the average. For a dataset of $N=`{python} n_small_dp`, the error would be **\$`{python} error_small_str`**! Differential Privacy kills utility for small $N$. It only works at scale where $1/N$ dampens the noise.
**The Systems Conclusion**: Protecting one outlier introduces a **\$`{python} error_per_person_str` error** to the average. For a dataset of $N=`{python} n_small_dp`, the error would be **\$`{python} error_small_str`**! Differential Privacy kills utility for small $N$. It only works at scale where $1/N$ dampens the noise.
:::
### Security versus Privacy {#sec-security-privacy-security-versus-privacy-e0b8}
@@ -842,7 +842,7 @@ These three incidents establish a common structure: an attacker exploits a speci
## Systematic Threat Analysis and Risk Assessment {#sec-security-privacy-systematic-threat-analysis-risk-assessment-3ef1}
How do you protect a system when the attack surface includes every image it will ever see and every word it will ever process? Traditional cybersecurity focuses on securing networks and authenticating users, but ML systems introduce attack surfaces at the algorithmic layer: training data can be manipulated to embed backdoors, input perturbations can exploit learned decision boundaries, and systematic API queries can extract proprietary model knowledge. Each of these vectors requires a formal threat model that specifies the adversary's capability (what they can access), the adversary's goal (what they seek to compromise), and the defender's information (what signals are observable). Systematic threat analysis maps these surfaces and quantifies the cost-benefit calculus that determines which threats are economically viable for an attacker to mount -- and therefore which defenses are worth engineering.
Protecting a system whose attack surface includes every image it will ever see and every word it will ever process demands a fundamentally different approach than traditional cybersecurity. Network security and user authentication remain necessary, but ML systems introduce attack surfaces at the algorithmic layer: training data can be manipulated to embed backdoors, input perturbations can exploit learned decision boundaries, and systematic API queries can extract proprietary model knowledge. Each of these vectors requires a formal threat model that specifies the adversary's capability (what they can access), the adversary's goal (what they seek to compromise), and the defender's information (what signals are observable). Systematic threat analysis maps these surfaces and quantifies the cost-benefit calculus that determines which threats are economically viable for an attacker to mount -- and therefore which defenses are worth engineering.
### Threat Prioritization Framework {#sec-security-privacy-threat-prioritization-framework-f2d5}
@@ -2361,17 +2361,17 @@ These mechanisms work together to create comprehensive protection that begins in
These four complementary hardware primitives work together to create comprehensive protection, as summarized in @tbl-hardware-security-mechanisms. Each mechanism addresses different security challenges but works most effectively when combined: secure boot establishes initial trust, TEEs provide runtime isolation, HSMs handle cryptographic operations, and PUFs enable device-unique authentication. Trusted Execution Environments (TEEs) provide isolated runtime environments for sensitive computations. Secure Boot ensures system integrity from power-on, creating the trusted foundation that TEEs depend upon. Hardware Security Modules (HSMs) offer specialized cryptographic processing and tamper-resistant key storage, often required for regulatory compliance. Physical Unclonable Functions (PUFs) provide device-unique identities that enable lightweight authentication and cannot be cloned or extracted.
| **Mechanism** | **Fortress Analogy and Function** |
|:----------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Mechanism** | **Fortress Analogy and Function** |
|:----------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Secure Boot** | Functions like a trusted gatekeeper checking credentials of everyone entering the fortress at dawn. Before your system runs any code, Secure Boot cryptographically verifies that the firmware and operating system have not been tampered with. |
| **Trusted Execution** | Create secure, windowless rooms deep inside the fortress where you |
| **Environments** | handle your most sensitive operations. When your ML model processes |
| **(TEEs)** | private medical data or proprietary algorithms, the TEE isolates these computations from the rest of the system. |
| **Hardware Security** | Serve as specialized, impenetrable vaults designed specifically for |
| **Modules (HSMs)** | storing and using your most valuable cryptographic keys. Rather than keeping encryption keys in regular computer memory where they might be stolen, HSMs provide tamper-resistant storage. |
| **Physical** | Give each device a unique biometric fingerprint at the silicon |
| **Unclonable** | level. Just as human fingerprints cannot be perfectly replicated, |
| **Functions (PUFs)** | PUFs exploit tiny manufacturing variations in each chip to create device-unique identifiers that cannot be cloned. |
| **Trusted Execution** | Create secure, windowless rooms deep inside the fortress where you |
| **Environments** | handle your most sensitive operations. When your ML model processes |
| **(TEEs)** | private medical data or proprietary algorithms, the TEE isolates these computations from the rest of the system. |
| **Hardware Security** | Serve as specialized, impenetrable vaults designed specifically for |
| **Modules (HSMs)** | storing and using your most valuable cryptographic keys. Rather than keeping encryption keys in regular computer memory where they might be stolen, HSMs provide tamper-resistant storage. |
| **Physical** | Give each device a unique biometric fingerprint at the silicon |
| **Unclonable** | level. Just as human fingerprints cannot be perfectly replicated, |
| **Functions (PUFs)** | PUFs exploit tiny manufacturing variations in each chip to create device-unique identifiers that cannot be cloned. |
: **Hardware Security Mechanisms**: Each primitive provides distinct defensive capabilities that work together to create comprehensive protection from hardware-level threats. {#tbl-hardware-security-mechanisms}
@@ -2694,25 +2694,25 @@ Together, these hardware primitives form the foundation of a defense-in-depth st
Implementing secure multi-party computation and gradient compression establishes a robust architecture for collaborative training without exposing raw data. The appropriate defense for any given deployment, however, depends on three interacting factors: the threat model (who attacks and with what capabilities), the deployment context (what computational and latency budgets exist), and the regulatory environment (what legal mandates constrain design). A healthcare system training federated diagnostic models faces fundamentally different threats than a public-facing LLM chatbot, and each demands a distinct combination of the mechanisms surveyed in this chapter. @tbl-defense-selection-framework maps seven common deployment contexts to their primary threats, recommended defense combinations, and the quantified trade-offs that each combination imposes.
| **Deployment Context** | **Primary Threats** | **Recommended Defenses** | **Key Trade-offs** |
|:----------------------------|:--------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------|
| **Healthcare ML** | Data leakage (HIPAA violation), | • Differential Privacy ($\epsilon \leq 4$) for training | 2--5% accuracy loss acceptable for compliance; |
| **(Federated diagnostic** | membership inference, | • Federated Learning across hospitals | 50--100 ms inference latency from TEE overhead |
| **models)** | unauthorized access | • TEE for inference on sensitive data<br>• Audit logging and access control (RBAC) | |
| **Financial ML** | Model theft (IP loss), | • Model encryption (AES-256) at rest | HSM adds \$10--50K capital cost; rate limiting |
| **(Fraud detection API)** | adversarial evasion, data poisoning | • HSM for cryptographic key management<br>• Adversarial training (PGD-based)<br>• Input validation + rate limiting (100 req/min)<br>• Output confidence monitoring | may impact legitimate high-frequency users |
| **Edge ML** | Physical access, | • Secure Boot (verified firmware) | TEE memory limits constrain model size |
| **(Mobile/IoT devices)** | side-channel attacks, model extraction | • ARM TrustZone or similar TEE<br>• Model quantization + obfuscation<br>• Encrypted model storage<br>• Anti-tampering hardware (PUF) | $<$50 MB; quantization required for large models; 15--30% power overhead from encryption |
| **Cloud ML Training** | Data poisoning, | • Secure data pipelines (provenance tracking) | Training time increases 30--120% with DP; |
| **(Multi-tenant platform)** | backdoor injection, gradient leakage | • Differential Privacy (DP-SGD, $\epsilon \approx 1$--$10$)<br>• Gradient verification and anomaly detection<br>• Secure aggregation (if federated)<br>• Model watermarking for IP protection | gradient verification adds 10--15% compute overhead; federated aggregation requires secure communication protocols |
| **Public-Facing LLM** | Prompt injection, | • Input sanitization (prompt filtering) | Aggressive filtering may block 5--10% of |
| **(Chatbot/API)** | data extraction (training leakage), abuse/overuse | • Output monitoring (PII detection)<br>• Rate limiting (per-user quotas)<br>• Response watermarking<br>• Confidence thresholding (abstention) | legitimate requests; response time increases 50--100 ms for content filtering; watermarking may be detectable by sophisticated users |
| **Multi-Party ML** | Data sharing restrictions, | • Federated Learning (no raw data sharing) | Communication overhead: 10--100$\times$ more rounds |
| **(Cross-organizational** | honest-but-curious participants, | • SMPC for secure aggregation | than centralized training; SMPC adds $1{,}000\times$+ |
| **training)** | privacy compliance (GDPR) | • Differential Privacy ($\epsilon \leq 1$)<br>• Homomorphic Encryption (for sensitive ops) | compute cost; accuracy may degrade 5--15%; requires legal agreements for liability |
| **Critical Infrastructure** | Supply chain compromise, | • Hardware attestation (TPM/PUF) | Development cost: 6--18 months additional |
| **(Autonomous vehicles,** | real-time adversarial attacks, | • Secure Boot + runtime integrity checks | engineering; 20--40% higher hardware costs; |
| **power grids)** | safety-critical failures | • Redundant model validation<br>• Fault injection detection<br>• Fail-safe fallback mechanisms | latency constraints limit cryptographic defenses; requires certified hardware |
| **Deployment Context** | **Primary Threats** | **Recommended Defenses** | **Key Trade-offs** |
|:----------------------------|:--------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------|
| **Healthcare ML** | Data leakage (HIPAA violation), | • Differential Privacy ($\epsilon \leq 4$) for training | 2--5% accuracy loss acceptable for compliance; |
| **(Federated diagnostic** | membership inference, | • Federated Learning across hospitals | 50--100 ms inference latency from TEE overhead |
| **models)** | unauthorized access | • TEE for inference on sensitive data<br>• Audit logging and access control (RBAC) | |
| **Financial ML** | Model theft (IP loss), | • Model encryption (AES-256) at rest | HSM adds \$10--50K capital cost; rate limiting |
| **(Fraud detection API)** | adversarial evasion, data poisoning | • HSM for cryptographic key management<br>• Adversarial training (PGD-based)<br>• Input validation + rate limiting (100 req/min)<br>• Output confidence monitoring | may impact legitimate high-frequency users |
| **Edge ML** | Physical access, | • Secure Boot (verified firmware) | TEE memory limits constrain model size |
| **(Mobile/IoT devices)** | side-channel attacks, model extraction | • ARM TrustZone or similar TEE<br>• Model quantization + obfuscation<br>• Encrypted model storage<br>• Anti-tampering hardware (PUF) | $<$50 MB; quantization required for large models; 15--30% power overhead from encryption |
| **Cloud ML Training** | Data poisoning, | • Secure data pipelines (provenance tracking) | Training time increases 30--120% with DP; |
| **(Multi-tenant platform)** | backdoor injection, gradient leakage | • Differential Privacy (DP-SGD, $\epsilon \approx 1$--$10$)<br>• Gradient verification and anomaly detection<br>• Secure aggregation (if federated)<br>• Model watermarking for IP protection | gradient verification adds 10--15% compute overhead; federated aggregation requires secure communication protocols |
| **Public-Facing LLM** | Prompt injection, | • Input sanitization (prompt filtering) | Aggressive filtering may block 5--10% of |
| **(Chatbot/API)** | data extraction (training leakage), abuse/overuse | • Output monitoring (PII detection)<br>• Rate limiting (per-user quotas)<br>• Response watermarking<br>• Confidence thresholding (abstention) | legitimate requests; response time increases 50--100 ms for content filtering; watermarking may be detectable by sophisticated users |
| **Multi-Party ML** | Data sharing restrictions, | • Federated Learning (no raw data sharing) | Communication overhead: 10--100$\times$ more rounds |
| **(Cross-organizational** | honest-but-curious participants, | • SMPC for secure aggregation | than centralized training; SMPC adds $1{,}000\times$+ |
| **training)** | privacy compliance (GDPR) | • Differential Privacy ($\epsilon \leq 1$)<br>• Homomorphic Encryption (for sensitive ops) | compute cost; accuracy may degrade 5--15%; requires legal agreements for liability |
| **Critical Infrastructure** | Supply chain compromise, | • Hardware attestation (TPM/PUF) | Development cost: 6--18 months additional |
| **(Autonomous vehicles,** | real-time adversarial attacks, | • Secure Boot + runtime integrity checks | engineering; 20--40% higher hardware costs; |
| **power grids)** | safety-critical failures | • Redundant model validation<br>• Fault injection detection<br>• Fail-safe fallback mechanisms | latency constraints limit cryptographic defenses; requires certified hardware |
: **Defense Selection Framework**: Maps deployment contexts to threat-specific defensive strategies with quantified trade-offs. The framework provides starting points for security architecture design, highlighting primary threats, recommended defense combinations, and key implementation trade-offs across seven common ML system deployment scenarios. {#tbl-defense-selection-framework}

View File

@@ -175,7 +175,7 @@ emissions_ratio_low_carbon_str = CarbonCostGPT3.emissions_ratio_str
**Archetype A (GPT-4)** is the primary driver of the industry's exponential energy growth. A single 25,000-GPU cluster drawing 700 W per chip requires 17.5 MW of continuous power for training. This is not just a cost problem; it is a **Grid Capacity** problem. Organizations operating Archetype A models are increasingly forced to build their own power infrastructure or relocate to regions with excess renewable energy, making **Carbon-Aware Scheduling** and **Geographic Optimization** as critical as learning rate tuning.
:::
AI systems consume resources at industrial scales that rival traditional heavy industries.
AI systems consume resources at industrial scales that rival traditional heavy industries.
Training a single large language model consumes thousands of megawatt-hours of electricity, equivalent to powering hundreds of households for months.[^fn-household-energy] Data centers that include AI workloads are projected to account for 8% of global power consumption by 2030, surpassing aviation at 2.1% and approaching cement production at 4% [@oecd2023blueprint].[^fn-datacenter-industrial-scale] Computational demands increased 350,000$\times$ from 2012 to 2019 [@schwartz2020green], while hardware efficiency improved at a far slower rate, creating an unsustainable growth trajectory.
[^fn-household-energy]: **Household Energy Baseline**: The average U.S. household consumes 10,500 kWh annually. GPT-3's verified 1,287 MWh training run equals 122 households' annual electricity, and frontier models have grown 25$\times$ in compute since then. This comparison anchors an otherwise abstract energy figure to physical infrastructure: a single training run draws more grid capacity than a residential neighborhood. \index{Household Energy!sustainability baseline}
@@ -406,7 +406,7 @@ Y,Date
### The Energy Wall: Divergent Scaling {#sec-sustainable-ai-energy-wall}
*Why* is AI sustainability a unique engineering challenge? It is a race between two fundamentally different physics: the **exponential scaling of logic** and the **linear scaling of energy infrastructure**.
AI sustainability presents a unique engineering challenge because it is a race between two fundamentally different physics: the **exponential scaling of logic** and the **linear scaling of energy infrastructure**.
```{python}
#| label: energy-wall-scenario
@@ -438,7 +438,7 @@ class EnergyWallScenario:
# Compound growth: (1 + r)^n
battery_gain = (1 + battery_annual_growth) ** n_years
grid_gain = (1 + grid_annual_growth) ** n_years
# Gap between compute growth and infrastructure growth
energy_wall_gap = compute_growth_factor / battery_gain
@@ -632,7 +632,7 @@ The convergence of exponential computational demands with hard physical efficien
## Energy Measurement and Modeling {#sec-sustainable-ai-part-ii-measurement-assessment-fb0b}
You cannot optimize what you cannot measure. If your cluster consumes five megawatts of power during a large language model training run, how much of that power actually went into matrix multiplications, and how much was wasted spinning cooling fans to remove the resulting heat? Effective energy modeling requires decomposing the monolithic datacenter power bill into granular, component-level metrics that engineers can actually target for optimization.
Engineers cannot optimize what they cannot measure. If your cluster consumes five megawatts of power during a large language model training run, how much of that power actually went into matrix multiplications, and how much was wasted spinning cooling fans to remove the resulting heat? Effective energy modeling requires decomposing the monolithic datacenter power bill into granular, component-level metrics that engineers can actually target for optimization.
The datacenter infrastructure foundations from @sec-compute-infrastructure established power and cooling as dominant engineering constraints. Systematic measurement now transforms these constraints into sustainability metrics. This part develops quantitative frameworks for three critical areas: energy consumption tracking during training and inference, carbon footprint analysis across system lifecycles, and resource usage assessment for hardware and infrastructure. These measurement tools enable engineers to identify optimization opportunities, compare alternative designs, and validate that sustainability improvements achieve their intended effects. Just as performance engineering requires profiling before optimization, sustainable AI engineering requires measurement before mitigation.
@@ -1367,7 +1367,7 @@ The geographic choice alone produces a `{python} emissions_ratio_str`-fold diffe
% Input flows
\fill[EmbodiedColor!30] (0, 3) -- (3, 2.5) -- (3, 1.5) -- (0, 2) -- cycle;
\node[left, align=right] at (0, 2.5) {\textbf{Raw Materials}\\Extraction \& Log.};
\fill[EmbodiedColor!50] (0, 1.5) -- (3, 1.5) -- (3, 0.5) -- (0, 1) -- cycle;
\node[left, align=right] at (0, 1.25) {\textbf{Chip Fabrication}\\(Embodied Carbon)};
@@ -1464,7 +1464,7 @@ share_clean_pct_str = EmbodiedCarbonAmort.share_clean_pct_str
```
::: {.callout-perspective title="Embodied Carbon Amortization"}
**The Hidden Cost**: When you rent a GPU for an hour, you are not just paying for electricity; you are "paying off" the carbon debt of its manufacturing.
A hidden cost emerges when renting a GPU for an hour: the fee does not cover electricity alone but also amortizes the carbon debt of manufacturing.
**Formula**:
$$ C_{total} = C_{operational} + \left( \frac{C_{manufacturing}}{T_{lifetime}} \times T_{job} \right) $$
@@ -2195,10 +2195,10 @@ These power budgets reflect the physical constraints of battery capacity, therma
| **Platform Category** | **Idle Power** | **Active Power** | **Peak Power** | **Example Devices** |
|:-----------------------|---------------:|-----------------:|---------------:|:--------------------------------------------------|
| **TinyML (MCU)** | 1--100 \mu W | 1--50 mW | 100 mW | Arduino Nano 33, STM32H7, Nordic nRF5340 |
| **Mobile NPU** | 10-100 mW | 0.5--5 W | 10 W | Pixel Tensor, Apple Neural Engine, Snapdragon NPU |
| **Edge GPU/TPU** | 1-5 W | 5--30 W | 75 W | NVIDIA Jetson Orin, Google Edge TPU, RPi AI Kit |
| **Autonomous Vehicle** | 10--50 W | 50--200 W | 500 W | Tesla FSD Computer, Mobileye EyeQ, NVIDIA Drive |
| **TinyML (MCU)** | 1--100 \mu W | 1--50 mW | 100 mW | Arduino Nano 33, STM32H7, Nordic nRF5340 |
| **Mobile NPU** | 10-100 mW | 0.5--5 W | 10 W | Pixel Tensor, Apple Neural Engine, Snapdragon NPU |
| **Edge GPU/TPU** | 1-5 W | 5--30 W | 75 W | NVIDIA Jetson Orin, Google Edge TPU, RPi AI Kit |
| **Autonomous Vehicle** | 10--50 W | 50--200 W | 500 W | Tesla FSD Computer, Mobileye EyeQ, NVIDIA Drive |
: **Edge AI Power Budget Categories**: Edge platforms span five orders of magnitude in power consumption, from sub-milliwatt TinyML systems to automotive compute platforms approaching datacenter power levels. Sustainable deployment requires matching workload requirements to appropriate power tiers. {#tbl-edge-power-budgets}
@@ -2758,7 +2758,7 @@ Every model optimization and efficiency technique is not just a performance opti
These optimization techniques represent a direct bridge between performance engineering and environmental responsibility. When we optimize a model to run faster or use less memory, we simultaneously reduce its carbon footprint. When we design efficient architectures or implement hardware-software co-design, we create systems that are both high-performing and environmentally sustainable.
This connection reveals a fundamental insight: **sustainable AI is not separate from efficient AI; it is efficient AI**. The same engineering principles that enable systems to scale, perform better, and cost less to operate also make them more environmentally responsible. Understanding this relationship transforms sustainability from an additional constraint into an integral part of good systems engineering.
This connection reveals a fundamental insight: **sustainable AI engineering is the same discipline as efficient AI engineering**. The same engineering principles that enable systems to scale, perform better, and cost less to operate also make them more environmentally responsible. Understanding this relationship transforms sustainability from an additional constraint into an integral part of good systems engineering.
:::
### Lifecycle-Aware Development Methodologies {#sec-sustainable-ai-lifecycleaware-development-methodologies-bf40}