fix(citations): update Volume I bibliography files and add cross-references

This commit includes:
- Bibliography reformatting across all Volume I chapters
- Updated cross-references in Vol II chapters
- Added 'fpr' to codespell ignore list
- Updated symlink to point to vol1 PDF config

Changes span both volumes as part of ongoing volume restructure work.
This commit is contained in:
Vijay Janapa Reddi
2026-01-10 16:10:14 -05:00
parent 3f9b41a0a3
commit adcbed3ed3
18 changed files with 443 additions and 194 deletions

View File

@@ -33,6 +33,7 @@ socio-economic
rin
rouge
FPR
fpr
Clos
Marz
Pease

View File

@@ -1 +1 @@
config/_quarto-html.yml
config/_quarto-pdf-vol1.yml

View File

@@ -2166,7 +2166,7 @@ $$
\mu_{\mathcal{B}} = \frac{1}{m}\sum_{i=1}^{m} x_i \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m}\sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2
$$
Then normalize and apply learnable scale and shift:
Then normalize and apply learnable scale and shift. @eq-batchnorm-normalize shows the normalization step, and @eq-batchnorm-transform shows the subsequent affine transformation:
$$
\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}
@@ -2205,6 +2205,8 @@ $$
\mu_L = \frac{1}{H}\sum_{i=1}^{H} x_i \qquad \sigma_L^2 = \frac{1}{H}\sum_{i=1}^{H} (x_i - \mu_L)^2
$$
The complete layer normalization operation is given by @eq-layernorm:
$$
\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}
$$ {#eq-layernorm}
@@ -2215,7 +2217,7 @@ This architectural difference explains why Transformers universally adopt layer
**Comparative Analysis: When to Use Each Variant**
The choice between normalization variants depends on the computational context:
The choice between normalization variants depends on the computational context, as summarized in @tbl-normalization-comparison:
+---------------------------+---------------------------+--------------------------+-----------------------+
| **Characteristic** | **BatchNorm** | **LayerNorm** | **RMSNorm** |

View File

@@ -2745,7 +2745,7 @@ Transfer overhead depends on data size and interconnect bandwidth. For a 1000x10
: Transfer overhead for 4 MB tensor across different interconnects. NVLink bandwidth is bidirectional (300 GB/s per direction). PCIe transfers are significantly slower than on-device memory access, making device placement critical for performance. {#tbl-device-transfer-overhead}
These numbers demonstrate why keeping data on-device is essential. A simple model forward pass might take 0.5 ms on GPU, but transferring inputs and outputs over PCIe 3.0 adds 0.5 ms overhead, doubling total latency. For small batches or lightweight models, transfer overhead can exceed computation time entirely.
@tbl-device-transfer-overhead demonstrates why keeping data on-device is essential. A simple model forward pass might take 0.5 ms on GPU, but transferring inputs and outputs over PCIe 3.0 adds 0.5 ms overhead, doubling total latency. For small batches or lightweight models, transfer overhead can exceed computation time entirely.
Profiling tools reveal transfer bottlenecks. PyTorch's profiler captures CPU-GPU transfers:
@@ -2809,7 +2809,7 @@ Nsight Compute reports metrics that explain why kernels achieve their observed p
: Key Nsight Compute metrics for ML kernel optimization. Low values indicate specific optimization opportunities. {#tbl-nsight-metrics}
The combination of both tools follows a standard optimization workflow: use Nsight Systems to identify which kernels or operations dominate runtime, then use Nsight Compute to understand why those specific kernels underperform. This two-level approach prevents optimizing the wrong operations (improving a kernel that consumes 1% of runtime) and provides actionable guidance for the kernels that matter.
@tbl-nsight-metrics summarizes the key metrics for kernel optimization. The combination of both tools follows a standard optimization workflow: use Nsight Systems to identify which kernels or operations dominate runtime, then use Nsight Compute to understand why those specific kernels underperform. This two-level approach prevents optimizing the wrong operations (improving a kernel that consumes 1% of runtime) and provides actionable guidance for the kernels that matter.
#### Domain-Specific Data Organizations {#sec-ai-frameworks-domainspecific-data-organizations-ef92}
@@ -4047,7 +4047,7 @@ model.eval() # Sets self.training = False for all modules
```
:::
Parameter freezing provides fine-grained control over which weights update during training. Setting `requires_grad=False` on specific parameters excludes them from gradient computation, effectively freezing those weights. This technique enables transfer learning workflows where pretrained feature extractors remain fixed while newly initialized classification layers train on target datasets. The implementation achieves computational savings by excluding frozen parameters from backward pass computation.
Parameter freezing provides fine-grained control over which weights update during training. Setting `requires_grad=False` on specific parameters excludes them from gradient computation, effectively freezing those weights. This technique enables transfer learning workflows where pretrained feature extractors remain fixed while newly initialized classification layers train on target datasets, as demonstrated in @lst-parameter_freezing. The implementation achieves computational savings by excluding frozen parameters from backward pass computation.
::: {#lst-parameter_freezing lst-cap="**Parameter Freezing**: Demonstrates selective parameter freezing for transfer learning, where pretrained layers remain fixed while new layers train."}
```{.python}

View File

@@ -1188,7 +1188,7 @@
journal = {J. Mach. Learn. Res.},
booktitle = {Journal of Machine Learning Research},
volume = {18},
pages = {185:1185:52},
pages = {185:1--185:52},
url = {https://jmlr.org/papers/v18/16-558.html},
source = {DBLP},
}

View File

@@ -378,11 +378,14 @@
% Continuous Batching and LLM Serving
@inproceedings{yu2022orca,
title = {Orca: A Distributed Serving System for Transformer-Based Generative Models},
title = {Orca: A Distributed Serving System for Transformer-Based Generative Models.},
author = {Yu, Gyeong-In and Jeong, Joo Seong and Kim, Geon-Woo and Kim, Soojeong and Chun, Byung-Gon},
year = {2022},
journal = {OSDI},
booktitle = {16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)},
pages = {521--538},
url = {https://www.usenix.org/conference/osdi22/presentation/yu},
source = {DBLP},
organization = {USENIX Association},
}

View File

@@ -228,7 +228,7 @@ This equation reveals why serving systems exhibit nonlinear behavior. At 50% uti
: **Utilization-Latency Relationship**: Average wait time as a multiple of service time for an M/M/1 queue. The nonlinear relationship explains why systems that perform well at moderate load can suddenly violate SLOs when traffic increases. {#tbl-utilization-latency}
The M/M/1 model assumes exponentially distributed service times, but ML inference typically has near-constant service time for fixed batch sizes, making the M/D/1 (deterministic service) model more accurate in practice. We use M/M/1 here because it yields closed-form solutions and produces conservative estimates. For M/D/1 queues, average wait time is approximately half of M/M/1 at the same utilization, which matters for capacity planning: M/M/1 analysis will slightly over-provision, erring on the side of meeting SLOs rather than violating them.[^fn-queuing-models]
The M/M/1 model assumes exponentially distributed service times, but ML inference typically has near-constant service time for fixed batch sizes, making the M/D/1 (deterministic service) model more accurate in practice. We use M/M/1 here because it yields closed-form solutions and produces conservative estimates. As @tbl-utilization-latency illustrates, average wait time grows rapidly as utilization approaches 100%. For M/D/1 queues, average wait time is approximately half of M/M/1 at the same utilization, which matters for capacity planning: M/M/1 analysis will slightly over-provision, erring on the side of meeting SLOs rather than violating them.[^fn-queuing-models]
[^fn-queuing-models]: **Queuing Model Assumptions**: The M/M/1 model assumes Poisson arrivals and exponential service times. ML inference typically has near-constant service time for fixed batch sizes, making M/D/1 (deterministic service) more accurate. We use M/M/1 because it produces conservative estimates and closed-form solutions, erring on the side of meeting SLOs. For deeper treatment including multi-server models, see Harchol-Balter's *Performance Modeling and Design of Computer Systems* [@harchol2013performance].
@@ -523,7 +523,7 @@ where $L_{\text{wait}}$ is the time spent waiting in the batching queue and $L_{
**Queue Waiting Time Analysis**
For Poisson arrivals with rate $\lambda$ and batching window $T$, requests arrive uniformly within the window. A request arriving at time $t$ within the window waits $T - t$ for the batch to close. The average wait time is half the window:
For Poisson arrivals with rate $\lambda$ and batching window $T$, requests arrive uniformly within the window. A request arriving at time $t$ within the window waits $T - t$ for the batch to close. The average wait time, shown in @eq-avg-wait, is half the window:
$$E[L_{\text{wait}}] = \frac{T}{2}$$ {#eq-avg-wait}
@@ -531,7 +531,7 @@ This simple relationship has profound implications. A 20ms batching window adds
**Batch Size Distribution**
The number of requests collected during window $T$ follows a Poisson distribution with mean $\lambda T$:
The number of requests collected during window $T$ follows a Poisson distribution with mean $\lambda T$, as expressed in @eq-batch-distribution:
$$P(\text{batch size} = k) = \frac{(\lambda T)^k e^{-\lambda T}}{k!}$$ {#eq-batch-distribution}
@@ -550,7 +550,7 @@ This distribution reveals batch size variability. @tbl-batch-variability shows h
**Throughput Maximization Strategy**
Throughput optimization requires maximizing the number of requests processed per unit time. For a system with service time $S(b)$ for batch size $b$, throughput follows:
Throughput optimization requires maximizing the number of requests processed per unit time. For a system with service time $S(b)$ for batch size $b$, throughput follows @eq-batch-throughput:
$$\text{Throughput}(b) = \frac{b}{T + S(b)}$$ {#eq-batch-throughput}
@@ -572,7 +572,7 @@ For ResNet-50 on a V100 GPU, service time scales as $S(b) = 5\text{ms} + 0.6b$ (
**Latency-Constrained Optimization**
When latency SLOs provide the binding constraint, the optimization problem becomes finding the maximum batch size that meets the SLO. For SLO $L_{\text{SLO}}$ and average wait time $T/2$:
When latency SLOs provide the binding constraint, the optimization problem becomes finding the maximum batch size that meets the SLO. For SLO $L_{\text{SLO}}$ and average wait time $T/2$, @eq-latency-constrained-batch defines the maximum allowable batch size:
$$b_{\text{max}} = \max\{b : \frac{T}{2} + S(b) \leq L_{\text{SLO}}\}$$ {#eq-latency-constrained-batch}
@@ -590,11 +590,11 @@ Consider a 50ms p95 latency SLO for ResNet-50 serving:
- Maximum batch size: 48
- Achieved throughput: ~1,280 img/s (batch=48)
The aggressive window achieves only 12% higher throughput but increases average latency by 10ms and p99 latency by 25ms. For latency-sensitive applications, the conservative window provides better user experience at modest throughput cost.
The aggressive window achieves only 12% higher throughput but increases average latency by 10ms and p99 latency by 25ms. @tbl-batching-throughput summarizes these trade-offs across different batch sizes. For latency-sensitive applications, the conservative window provides better user experience at modest throughput cost.
**SLO Violation Analysis**
Batch size variability causes SLO violations even when mean latency appears safe. The p99 latency includes both worst-case wait time (full window) and worst-case batch size (governed by Poisson tail):
Batch size variability causes SLO violations even when mean latency appears safe. The p99 latency includes both worst-case wait time (full window) and worst-case batch size (governed by Poisson tail). @eq-p99-batch-latency captures this relationship:
$$L_{p99} \approx T + S(b_{p99})$$ {#eq-p99-batch-latency}

View File

@@ -2613,7 +2613,7 @@ The key steps in gradient accumulation are:
4. Repeat steps 1-3 for all micro-batches in the effective batch.
5. Update the model parameters using the accumulated gradients after all micro-batches are processed.
**Mathematical Equivalence**: The key insight is that gradient accumulation produces mathematically identical results to training with larger batches. For an effective batch size $B = k \times b$ where $k$ is the number of accumulation steps and $b$ is the micro-batch size, the accumulated gradient equals the true batch gradient:
**Mathematical Equivalence**: The key insight is that gradient accumulation produces mathematically identical results to training with larger batches. For an effective batch size $B = k \times b$ where $k$ is the number of accumulation steps and $b$ is the micro-batch size, the accumulated gradient equals the true batch gradient, as shown in @eq-gradient-accumulation-equivalence:
$$
\nabla L_B = \frac{1}{B}\sum_{i=1}^{B} \nabla L_i = \frac{1}{k}\sum_{j=1}^{k}\left(\frac{1}{b}\sum_{i \in \text{batch}_j} \nabla L_i\right)
@@ -2627,7 +2627,7 @@ This equivalence holds because gradients are linear operators. The right-hand si
- **Computation**: Unchanged total FLOPs, as all $B$ examples are still processed
- **Time**: $k$ forward and backward passes execute before each optimizer step, introducing synchronization overhead
The time overhead per accumulation step is typically 2-5%, arising from the additional synchronization and gradient buffer management. For $k$ accumulation steps with micro-batch time $T_{\text{micro}}$ and synchronization overhead $T_{\text{sync}}$, the effective time per update is:
The time overhead per accumulation step is typically 2-5%, arising from the additional synchronization and gradient buffer management. For $k$ accumulation steps with micro-batch time $T_{\text{micro}}$ and synchronization overhead $T_{\text{sync}}$, @eq-gradient-accumulation-overhead gives the effective time per update:
$$
T_{\text{effective}} = k \times T_{\text{micro}} + (k-1) \times T_{\text{sync}}

View File

@@ -143,6 +143,14 @@ As the backward pass proceeds from the last layer to the first, gradients for th
If the computation time for a bucket exceeds its communication time, the network cost is effectively zero because it is fully hidden. If communication is slower, the training time becomes determined solely by network speed. Tuning bucket sizes is a key optimization: too small, and latency overhead dominates; too large, and the pipeline stalls waiting for data.
::: {.callout-note title="Figure Placeholder: Communication-Computation Overlap" collapse="true"}
```{.tikz}
% TODO: Timeline showing sequential (Compute -> Comm) vs Overlapped (Compute | Comm) execution.
\node[draw, align=center] {Pipelining Strategy\nSequential vs Overlap};
```
**Hiding Communication Latency**. (Top) Naive execution serializes computation and communication, exposing full network latency. (Bottom) Overlapped execution pipelines gradient communication: as soon as the last layer's gradients are computed, they are sent while the GPU computes the second-to-last layer. If $T_{compute} > T_{comm}$, network cost is effectively zero.
:::
### Bandwidth-Bound versus Latency-Bound Communication {#sec-bandwidth-latency-regimes}
The $\alpha$-$\beta$ model reveals two distinct communication regimes that require different optimization strategies:
@@ -895,6 +903,14 @@ The bandwidth term matches other collectives, but the latency term is $O(N)$ rat
**Embedding Table Exchange in Recommendation Systems**
::: {.callout-note title="Figure Placeholder: AlltoAll Communication Pattern" collapse="true"}
```{.tikz}
% TODO: Diagram showing AlltoAll traffic. Each node sends different colored blocks to every other node.
\node[draw, align=center] {AlltoAll Pattern\nPersonalized Exchange};
```
**AlltoAll vs AllReduce**. In AllReduce (left), all nodes compute the same global sum. In AlltoAll (right), each node sends distinct data to every other node (transposing the data matrix). This pattern is critical for recommendation systems where GPU 0 needs embeddings from GPU 1, while GPU 1 needs different embeddings from GPU 0.
:::
Recommendation models like DLRM have embedding tables that are too large for single-GPU memory. Tables are sharded across workers, and each training batch requires fetching embeddings from multiple shards.
Consider a batch with $B$ items where each item requires $E$ embedding lookups. Embeddings have dimension $D$ and are distributed across $N$ workers. Each worker first identifies which embeddings it needs from each other worker. An AlltoAll operation exchanges embedding requests containing the indices. Workers then look up the requested embeddings from their local shards. A second AlltoAll operation exchanges the embedding values back to the requesters.
@@ -961,6 +977,14 @@ $$
Each token is sent to $k$ experts. If experts are uniformly distributed, each worker sends $(N-1)/N$ of its routed tokens to other workers.
::: {.callout-note title="Figure Placeholder: MoE Token Routing" collapse="true"}
```{.tikz}
% TODO: Visual of tokens being routed from a batch to specific experts (E1, E4, etc.)
\node[draw, align=center] {MoE Routing\nDynamic Dispatch};
```
**Mixture-of-Experts Token Routing**. Tokens in a batch are dynamically routed to specific expert networks based on gating weights. This creates a data-dependent AlltoAll communication pattern where traffic volume varies based on input data, potentially creating hotspots if "popular" experts receive too many tokens.
:::
**Load Balancing Challenge**: MoE communication is sensitive to routing decisions. If routing is unbalanced (many tokens go to few experts), some workers receive disproportionate communication while others sit idle. This creates both communication hotspots and computation imbalance.
Auxiliary load balancing losses encourage uniform routing:
@@ -1058,6 +1082,14 @@ When network bandwidth limits training throughput, reducing the volume of data t
### The Case for Gradient Compression {#sec-compression-motivation}
::: {.callout-note title="Figure Placeholder: Gradient Compression Techniques" collapse="true"}
```{.tikz}
% TODO: Visual comparison of FP32 vs INT8 Quantization vs Top-K Sparsification
\node[draw, align=center] {Compression Methods\nQuantization (Precision) vs Sparsification (Count)};
```
**Gradient Compression Strategies**. Comparison of bandwidth reduction techniques. (A) Standard FP32 transmission. (B) Quantization reduces precision (e.g., INT8), shrinking each element. (C) Sparsification (Top-K) transmits only the largest magnitude gradients and their indices, discarding the majority of values. Error feedback (D) accumulates discarded residuals to ensure eventual convergence.
:::
Gradient compression addresses the bandwidth bottleneck by reducing the size of gradient messages. The potential benefit is straightforward: if communication time is $T_{comm} = M/\beta$, halving $M$ halves communication time. However, compression introduces three costs. Compression overhead consumes CPU or GPU time to compress gradients before sending. Decompression overhead requires time to reconstruct gradients after receiving. Accuracy loss occurs because compressed gradients approximate the true gradient, potentially affecting convergence.
Compression is worthwhile when:

View File

@@ -503,6 +503,14 @@ To maintain consistency across the distributed system, the gradients computed by
With 8 GPUs sharing gradients for a 100 MB model, ring all-reduce requires only 7 communication steps instead of the 56 steps needed for naive all-to-all synchronization. The ring topology creates bottlenecks where the slowest link in the ring determines the overall synchronization time, and network partitions can halt the entire training process. Alternative algorithms like tree-reduce achieve O(log n) latency at the cost of increased bandwidth requirements on root nodes. Modern systems often implement hierarchical topologies using high-speed links within nodes and lower-bandwidth connections between nodes to optimize these trade-offs.
::: {.callout-note title="Figure Placeholder: Collective Communication Patterns" collapse="true"}
```{.tikz}
% TODO: Compare Ring AllReduce (circular) vs Tree AllReduce (hierarchical) vs Parameter Server (star)
\node[draw, align=center] {Communication Patterns\nRing vs Tree vs Star};
```
**Gradient Synchronization Topologies**. Visual comparison of communication patterns. (A) Parameter Server uses a central node, creating bottlenecks. (B) Ring AllReduce distributes bandwidth evenly but has O(N) latency. (C) Tree AllReduce reduces latency to O(log N) but may congest root links. (D) Hierarchical AllReduce combines intra-node NVLink and inter-node InfiniBand.
:::
#### Synchronization Models {#sec-distributed-training-sync-models}
Distributed training systems operate under explicit synchronization models that govern when workers observe each other's updates. Understanding these models is essential for reasoning about correctness and performance.
@@ -994,6 +1002,14 @@ Column-parallel linear layers split weights along columns. For input $X$ and wei
$$Y = XW = X[W_1 | W_2] = [XW_1 | XW_2]$$
Each GPU computes its partition independently. Outputs are concatenated (no communication needed if followed by row-parallel layer).
::: {.callout-note title="Figure Placeholder: Tensor Parallelism Matrix Split" collapse="true"}
```{.tikz}
% TODO: Visual showing Matrix A split by columns, B split by rows.
\node[draw, align=center] {Megatron-LM Partitioning\nColumn Split vs Row Split};
```
**Tensor Parallelism - Matrix Partitioning**. Illustration of Megatron-LM style tensor parallelism. The first linear layer (e.g., QKV) is split column-wise $[W_1 | W_2]$. The second layer (e.g., Output) is split row-wise $[W_1 ; W_2]$. This arrangement allows the output of the first layer to flow directly into the second without synchronization, requiring only one AllReduce after the second layer.
:::
Row-parallel linear layers split weights along rows. For $W = \begin{bmatrix} W_1 \\ W_2 \end{bmatrix}$:
$$Y = XW = X_1 W_1 + X_2 W_2$$
Each GPU computes a partial sum. Outputs require AllReduce to combine.
@@ -1124,7 +1140,15 @@ The primary advantage lies in simultaneous scaling across model size and dataset
Hardware utilization improves substantially over single-strategy approaches. Pure model parallelism leaves devices idle during pipeline bubbles (20-30% of training time). Pure data parallelism leaves devices waiting during gradient synchronization (10-40% of training time). Hybrid approaches overlap these operations: while one pipeline stage computes, another synchronizes gradients within its data parallel group. Megatron-LM demonstrates 52% MFU (Model FLOPS Utilization) on 1024 A100 GPUs using 3D parallelism, compared to 35% MFU for pure data parallelism at similar scale.
Communication overhead reduces through hierarchical structuring. Tensor parallelism restricts high-frequency communication to NVLink-connected GPUs within nodes (600 GB/s bandwidth). Pipeline parallelism limits cross-node communication to activation transfers at stage boundaries (requiring 100-200 GB/s per stage). Data parallelism performs AllReduce across replica groups using InfiniBand (200 GB/s aggregate). This hierarchy matches communication patterns to available bandwidth at each level, avoiding the scenario where all communication competes for the slowest interconnect.
Communication overhead reduces through hierarchical structuring. Tensor parallelism restricts high-frequency communication to NVLink-connected GPUs within nodes (600 GB/s bandwidth). Pipeline parallelism limits cross-node communication to activation transfers at stage boundaries (requires 100-200 GB/s per stage). Data parallelism performs AllReduce across replica groups using InfiniBand (200 GB/s aggregate). This hierarchy matches communication patterns to available bandwidth at each level, avoiding the scenario where all communication competes for the slowest interconnect.
::: {.callout-note title="Figure Placeholder: Hybrid Parallelism Physical Mapping" collapse="true"}
```{.tikz}
% TODO: Diagram showing a cluster of nodes. Tensor Parallel within Node (NVLink). Pipeline Parallel across groups of Nodes. Data Parallel across replicas.
\node[draw, align=center] {Hybrid Parallelism Mapping\nTP(Intra) -> PP(Inter) -> DP(Replicas)};
```
**3D Parallelism Physical Mapping**. How Hybrid Parallelism maps to hardware topology. Tensor Parallelism uses high-bandwidth NVLink within a node. Pipeline Parallelism connects adjacent nodes via high-speed InfiniBand. Data Parallelism replicas can be placed anywhere, synchronizing gradients via the inter-node network.
:::
Hybrid parallelism enables training at scales that would otherwise be impossible. GPT-4 training required an estimated 10,000+ H100 GPUs for months. Without hybrid approaches, memory constraints would limit training to smaller models, throughput constraints would extend training time beyond practical limits, and communication overhead would waste the majority of available compute. The combination of tensor, pipeline, and data parallelism transforms these theoretical capabilities into practical training systems.

View File

@@ -1204,9 +1204,13 @@ class Adapter(nn.Module):
def __init__(self, dim, bottleneck_dim):
super().__init__()
# Project from full dimension to bottleneck (e.g., 768 -> 16)
self.down = nn.Linear(dim, bottleneck_dim) # W1: learns compression
self.down = nn.Linear(
dim, bottleneck_dim
) # W1: learns compression
# Project back to original dimension (e.g., 16 -> 768)
self.up = nn.Linear(bottleneck_dim, dim) # W2: learns expansion
self.up = nn.Linear(
bottleneck_dim, dim
) # W2: learns expansion
self.activation = nn.ReLU()
def forward(self, x):
@@ -1285,7 +1289,7 @@ This pattern can be extended with profiling logic to select layers based on cont
for name, param in model.named_parameters():
if "conv2" in name or "fc" in name:
param.requires_grad = True # Train high-impact layers
param.requires_grad = True # Train high-impact layers
else:
param.requires_grad = False # Freeze low-impact layers
@@ -1404,7 +1408,10 @@ class ReplayBuffer:
if len(self.buffer) < self.capacity:
self.buffer.append((feature_vec, label)) # Fill phase
else:
self.buffer[self.index] = (feature_vec, label) # Overwrite oldest
self.buffer[self.index] = (
feature_vec,
label,
) # Overwrite oldest
self.index = (self.index + 1) % self.capacity # Wrap around
def sample(self, k):
@@ -1962,6 +1969,18 @@ Here, $n_k$ is the number of local training examples at client $k$, and $n_{\mat
However, communication-efficient updates can introduce tradeoffs. Compression may degrade gradient fidelity, selective updates can limit model capacity, and split architectures may complicate coordination. As a result, effective federated learning requires careful balancing of bandwidth constraints, privacy concerns, and convergence dynamics—a balance that depends heavily on the capabilities and variability of the client population.
::: {.callout-note title="Figure Placeholder: Gradient Compression Techniques" collapse="true"}
```{.tikz}
% TODO: Create a 3-part comparison diagram showing standard vs. compressed updates
% Part 1: Standard Update (Full dense gradient vector, e.g., [0.5, -0.2, 0.01, 0.8])
% Part 2: Quantization (Map values to low-precision buckets, e.g., FP32 -> INT8 bins)
% Part 3: Sparsification (Top-k selection, only sending indices and values of magnitudes > threshold)
% Visual style: "Data packets" moving from Phone to Server with size reduction emphasized.
\node[draw, align=center] {Gradient Compression Visualizer};
```
**Gradient Compression Techniques**. Strategies to reduce communication overhead in federated learning. **(a)** Standard full-precision updates transmit all gradient values. **(b)** Quantization maps values to lower-precision discrete buckets (e.g., INT8), reducing bit-width. **(c)** Sparsification (Top-$k$) transmits only the most significant gradients, capitalizing on the observation that many updates are near-zero and redundant.
:::
#### Federated Personalization {#sec-edge-intelligence-federated-personalization-3c73}
While compression and communication strategies improve scalability, they do not address a important limitation of the global federated learning paradigm—its inability to capture user-specific variation. In real-world deployments, devices often observe distinct and heterogeneous data distributions. A one-size-fits-all global model may underperform when applied uniformly across diverse users. This motivates the need for personalized federated learning, where local models are adapted to user-specific data without compromising the benefits of global coordination.
@@ -2013,6 +2032,18 @@ Each of these strategies reflects a different point in the tradeoff space. These
Selecting the appropriate personalization method depends on deployment constraints, data characteristics, and the desired balance between accuracy, privacy, and computational efficiency. In practice, hybrid approaches that combine elements of multiple strategies, including local finetuning atop a personalized head, are often employed to achieve robust performance across heterogeneous devices.
::: {.callout-note title="Figure Placeholder: Federated Personalization Architectures" collapse="true"}
```{.tikz}
% TODO: Diagram illustrating three personalization depths:
% 1. Full Model Adaptation: All weights updated (High compute, high personalization)
% 2. Head-Only Adaptation: Backbone frozen, only classifier head updated (Low compute)
% 3. Adapter-Based (PEFT): Small adapter modules inserted between frozen layers (Efficient)
% Use color coding: Grey=Frozen, Blue=Play/Trainable.
\node[draw, align=center] {Personalization Architectures};
```
**Federated Personalization Architectures**. Architectural strategies for adapting global models to local data. **(a) Full Fine-tuning** updates all model parameters, offering maximum expressivity but high compute cost. **(b) Head-Only Adaptation** updates only the final classifier layers while keeping the feature extractor frozen, suitable for resource-constrained devices. **(c) Adapter-Based Learning** (e.g., LoRA) inserts small trainable modules into a frozen backbone, balancing efficiency with the ability to adapt deep representations.
:::
#### Federated Privacy {#sec-edge-intelligence-federated-privacy-a1ed}
While federated learning is often motivated by privacy concerns, as it involves keeping raw data localized instead of transmitting it to a central server, the paradigm introduces its own set of security and privacy risks. Although devices do not share their raw data, the transmitted model updates (such as gradients or weight changes) can inadvertently leak information about the underlying private data. Techniques such as model inversion attacks[^fn-model-inversion] and membership inference attacks[^fn-membership-inference] demonstrate that adversaries may partially reconstruct or infer properties of local datasets by analyzing these updates.
@@ -2027,6 +2058,18 @@ To mitigate such risks, modern federated ML systems commonly employ protective m
While these techniques enhance privacy, they introduce additional system complexity and tradeoffs between model utility, communication cost, and robustness. A deeper exploration of these attacks, defenses, and their implications requires dedicated coverage of security principles in distributed ML systems.
::: {.callout-note title="Figure Placeholder: Secure Aggregation Protocol" collapse="true"}
```{.tikz}
% TODO: Conceptual diagram of the Secure Aggregation protocol (Secret Sharing)
% Step 1: Client A and Client B generate shared secret pair (Mask AB)
% Step 2: Client A adds Mask AB to update; Client B subtracts Mask AB from update
% Step 3: Server sees (Update A + Mask) and (Update B - Mask).
% Step 4: Summing them cancels the mask, revealing (Update A + Update B) without seeing individual updates.
\node[draw, align=center] {Secure Aggregation Logic};
```
**Secure Aggregation Protocol**. A simplified view of how cryptographic masking protects individual updates. Pairs of clients agree on shared random masks that are added by one and subtracted by the other. The central server sums the masked updates; the masks mathematically cancel out in the aggregate, revealing the global update sum without ever exposing the raw value of any single client's contribution.
:::
### Large-Scale Device Orchestration {#sec-edge-intelligence-largescale-device-orchestration-1360}
Federated learning transforms machine learning into a massive distributed systems challenge that extends far beyond traditional algorithmic considerations. Coordinating thousands or millions of heterogeneous devices with intermittent connectivity requires sophisticated distributed systems protocols that handle Byzantine failures, network partitions, and communication efficiency at unprecedented scale. These challenges fundamentally differ from the controlled environments of data center distributed training, where high-bandwidth networks and reliable infrastructure enable straightforward coordination protocols.
@@ -2123,6 +2166,19 @@ Federated A/B testing enables validation of new adaptation strategies or model a
These operational transformations necessitate new tooling and infrastructure that systematically extends traditional MLOps practices. The CI/CD pipelines, monitoring dashboards, A/B testing frameworks, and incident response procedures established for centralized deployments form the foundation for on-device learning operations. The federated learning protocols (@sec-edge-intelligence-federated-learning-6e7e) provide coordination mechanisms for distributed training, while monitoring challenges (@sec-edge-intelligence-distributed-system-observability-2270) address observability gaps created by decentralized adaptation.
::: {.callout-note title="Figure Placeholder: Shadow Model Drift Detection" collapse="true"}
```{.tikz}
% TODO: Architecture diagram for On-Device Validation with Shadow Modes
% Component 1: Input Data Stream (User interaction)
% Component 2: Shadow Model (Frozen Baseline, Version N) - processes input
% Component 3: Active Model (Locally Adapted, Version N+1) - processes input
% Component 4: Comparator/Arbiter - compares confidence/output. Sits between models and UI.
% Logic: If Active Model confidence < Shadow Model - Margin => Rollback or Warning.
\node[draw, align=center] {Shadow Mode Validation};
```
**On-Device Shadow Validation**. To detect model drift without ground truth labels, a reliable "Shadow Model" (the frozen, known-good base version) runs in parallel with the "Active Model" (the locally adapted version). An on-device arbiter compares their predictions and confidence scores on the same live data. If the adapted model consistently diverges or shows lower confidence than the baseline, the system detects "Personalization Drift" and can trigger a rollback to the safe shadow version.
:::
Successful on-device learning deployments build upon proven MLOps methodologies while adapting them to the unique challenges of distributed, heterogeneous learning environments. This evolutionary approach ensures operational reliability while enabling the benefits of edge learning.
### Bio-Inspired Learning Efficiency {#sec-edge-intelligence-bioinspired-learning-efficiency-55ad}
@@ -2258,6 +2314,18 @@ Effective systems integration requires adherence to key engineering principles t
This integrated approach transforms on-device learning from a collection of techniques into a coherent systems capability that provides robust personalization within real-world deployment constraints.
::: {.callout-note title="Figure Placeholder: Tiered On-Device Adaptation" collapse="true"}
```{.tikz}
% TODO: A layered pyramid or device-tiering diagram showing the strategy from the Voice Assistant case study.
% Top Tier (Flagship Phones): "LoRA Rank-32 Adapters" + "Full Replay Buffer"
% Middle Tier (Mid-range): "LoRA Rank-16 Adapters" + "Compact Replay"
% Bottom Tier (IoT/Budget): "Bias-Only Updates" + "No/Tiny Buffer"
% Context: Uniform Base Model -> Device-Specific Adaptation Strategy
\node[draw, align=center] {Tiered Adaptation Strategy};
```
**Tiered Adaptation Hierarchy**. A production strategy for managing device heterogeneity. Instead of a single training configuration, the system assigns adaptation strategies based on device capabilities. Flagship devices support high-rank adapters and large replay buffers for maximum personalization. Mid-range devices use efficient low-rank adapters. Budget/IoT devices are restricted to lightweight bias-only updates, ensuring every class of device improves without exceeding its specific hardware constraints.
:::
## Persistent Technical and Operational Challenges {#sec-edge-intelligence-persistent-technical-operational-challenges-8c12}
The solution techniques explored above—model adaptation, data efficiency, and federated coordination—address many fundamental constraints of on-device learning but also reveal persistent challenges that emerge from their interaction in real-world deployments. These challenges represent the current frontiers of on-device learning research and highlight areas where the techniques discussed earlier reach their limits or create new operational complexities. Understanding these challenges provides critical context for evaluating when on-device learning approaches are appropriate and where alternative strategies may be necessary.

View File

@@ -206,6 +206,14 @@ While fail-stop failures are relatively straightforward to handle because the co
Byzantine failures are particularly dangerous in distributed training because the standard assumption that workers compute identical gradients for identical data no longer holds. A single Byzantine worker can corrupt the averaged gradient, potentially causing training to diverge or converge to a poor solution.
::: {.callout-note title="Figure Placeholder: Fail-Stop vs Byzantine" collapse="true"}
```{.tikz}
% TODO: Visual comparison. Left: Fail-Stop (Node X's silent). Right: Byzantine (Node lies about gradient).
\node[draw, align=center] {Failure Types\nSilent Death vs Malicious Lies};
```
**Fail-Stop vs. Byzantine Failures**. In the fail-stop model (left), a failed worker simply ceases to send messages, which is easily detected by timeouts. In the Byzantine model (right), a failed worker continues to participate but sends incorrect data (e.g., corrupted gradients), which can poison the global model state if not detected by validational redundancy.
:::
Detection of Byzantine failures requires redundant computation. Multiple workers computing gradients for the same data enable comparison of results. Statistical outlier detection can identify workers consistently producing anomalous gradients. These detection mechanisms add computational overhead and may not catch subtle corruption.
Byzantine-resilient distributed training algorithms exist but impose significant overhead. Algorithms such as Krum [@blanchard2017machine] and coordinate-wise trimmed mean [@yin2018byzantine] compute aggregates that are robust to a bounded number of Byzantine workers, but they require more communication and computation than simple averaging. We examine hardware faults and Byzantine failures in greater depth in @sec-robust-ai.[^fn-byzantine-ml]
@@ -239,6 +247,14 @@ Defending against correlated failures requires understanding failure domains and
: **Failure Domains in ML Infrastructure**: Understanding failure domain boundaries enables placement of redundant components across independent domains, preventing correlated failures from defeating redundancy strategies. {#tbl-failure-domains}
::: {.callout-note title="Figure Placeholder: Failure Domains Hierarchy" collapse="true"}
```{.tikz}
% TODO: Nested boxes diagram. Rack contains Nodes. Node contains GPUs. Power Domain spans Racks.
\node[draw, align=center] {Failure Domains\nScope of Impact};
```
**Hierarchy of Failure Domains**. Failure domains are often nested or overlapping. A GPU failure affects one device. A node failure affects 8 GPUs. A rack switch failure affects 32-64 GPUs. A power distribution unit (PDU) failure may affect multiple racks. Effective fault tolerance requires placing replicas across independent domains (e.g., different racks or rows) to survive correlated failures.
:::
### The Bathtub Curve and Hardware Lifecycle {#sec-bathtub-curve}
The failure taxonomy above classifies failure types and domains, answering WHAT KIND of failures occur. Equally important for designing fault tolerance is understanding WHEN in a component's lifetime failures are most likely to occur. Hardware failure rates are not constant over component lifetime. The bathtub curve, a well-established model in reliability engineering, describes how failure rates vary across three distinct phases:
@@ -255,6 +271,14 @@ The failure taxonomy above classifies failure types and domains, answering WHAT
The practical implication for ML systems is that fleet-wide failure rates depend on age distribution. A cluster populated entirely with new GPUs will experience elevated failure rates during the first few weeks, followed by a stable period, then increasing failures as the fleet ages. Mixed-age fleets exhibit more consistent aggregate failure rates because different cohorts are in different lifecycle phases.
::: {.callout-note title="Figure Placeholder: Bathtub Curve" collapse="true"}
```{.tikz}
% TODO: Plot failure rate lambda(t) vs time. High initial, flat middle, rising end.
\node[draw, align=center] {Bathtub Curve\nFailure Rate vs Component Age};
```
**The Bathtub Curve**. Hardware failure rates $\lambda(t)$ vary over time. (1) **Infant Mortality**: High failure rate initially due to manufacturing defects. (2) **Useful Life**: Constant, low failure rate where random failures dominate. (3) **Wear-Out**: Increasing failure rate as components age. Burn-in testing aims to filter out infant mortality failures before deployment.
:::
Proactive maintenance strategies aim to replace components approaching wear-out before they fail in production. Predictive analytics using GPU telemetry can identify components likely to fail soon. Temperature trends, error counts, and performance degradation enable scheduled replacement during maintenance windows rather than unplanned outages during training runs.
### Model-Type Diversity in Failure Impact {#sec-failure-impact-diversity}
@@ -565,6 +589,14 @@ The challenge is distinguishing stragglers from failures. Stragglers should trig
### Elastic Training {#sec-elastic-training}
::: {.callout-note title="Figure Placeholder: Elastic Training Flow" collapse="true"}
```{.tikz}
% TODO: Flowchart. Training with N workers -> Failure -> Rescale to N-1 -> Continue.
\node[draw, align=center] {Elastic Training\nDynamic Resizing Flow};
```
**Elastic Training Recovery**. Unlike static training which aborts on failure, elastic training adapts. When a worker fails, the job pauses, redistributes the dataset and model shards across the remaining $N-1$ workers, and resumes training from the last consistent state. This capability transforms hard failures into temporary throughput degradations.
:::
The checkpoint and restart mechanisms developed above share a common assumption: training operates with a fixed number of workers. When failures occur, we stop, recover state from checkpoint, and restart, ideally with replacement resources. But what if the system could adapt to failures dynamically, continuing training with fewer workers rather than stopping entirely? This alternative approach, **elastic training**, enables dynamic scaling by adding or removing workers without stopping training.
Elastic training provides several advantages. For fault tolerance, failures reduce worker count rather than stopping training. For resource efficiency, training can use variable resource allocations. For preemption handling, systems gracefully handle preemption in shared clusters. For cost optimization, systems scale based on spot instance availability.
@@ -957,6 +989,14 @@ Circuit breakers[^fn-circuit-breaker] prevent resource exhaustion by failing fas
Circuit breakers operate in three states. In the closed state, normal operation proceeds and requests pass through. In the open state, downstream is failing and requests fail immediately without attempting downstream call. In the half-open state, the breaker tests recovery by allowing limited requests through to test if downstream recovered.
::: {.callout-note title="Figure Placeholder: Circuit Breaker State Machine" collapse="true"}
```{.tikz}
% TODO: State machine diagram. Closed -(failure threshold)-> Open -(timeout)-> Half-Open -> (success/fail) -> Closed/Open.
\node[draw, align=center] {Circuit Breaker\nState Transition Diagram};
```
**Circuit Breaker States**. The circuit breaker protects the system from cascading failure. **Closed**: Normal operation. **Open**: Error threshold exceeded; all requests fail fast to prevent resource exhaustion. **Half-Open**: After a timeout, a limited number of requests are allowed through to probe the dependency's health. Success resets to Closed; failure returns to Open.
:::
#### Graceful Degradation Implementation {#sec-degradation-implementation}
Implementing graceful degradation requires continuous monitoring of system health. Track request latency percentiles at p50, p95, and p99. Monitor error rates by error type. Measure resource utilization for CPU, GPU, and memory. Watch queue depths and wait times.

View File

@@ -218,18 +218,13 @@ To organize the optimization techniques in this chapter, we introduce the servin
**Platform level**: Optimizations across multiple services and tenants. Resource allocation, multi-tenancy, scheduling, and cluster management operate here. The target metric is overall resource efficiency while meeting diverse SLO requirements.
::: {.callout-note title="Figure Placeholder: The Serving Hierarchy" collapse="true"}
```{.tikz}
% TODO: Pyramid diagram showing the four levels of the serving hierarchy
\node[draw, align=center] {Platform Level (Multi-tenancy)\\Service Level (Load Balancing)\\Replica Level (GPU Utilization)\\Request Level (Batching)};
```
Platform Level [Multi-tenancy, Scheduling, Resource Allocation]
Service Level [Load Balancing, Routing, Autoscaling]
Replica Level [GPU Utilization, Memory Management]
Request Level [Batching, Caching, Preprocessing]
```
**The Serving Hierarchy**. Optimization occurs at four distinct levels, each with different objectives and techniques. The Request Level focuses on minimizing per-request latency via batching and caching. The Replica Level maximizes single-instance throughput through kernel optimization and memory management. The Service Level manages distribution across multiple replicas to meet aggregate demand. The Platform Level handles efficient resource sharing across multiple services and tenants.
:::
Each level has distinct optimization levers (@tbl-serving-hierarchy):
@@ -1251,27 +1246,13 @@ After computing local attention, an AllReduce operation (covered in @sec-communi
The communication pattern for tensor-parallel inference follows:
::: {.callout-note title="Figure Placeholder: Tensor Parallelism Inference Flow" collapse="true"}
```{.tikz}
% TODO: Diagram showing column-row partitioning and AllReduce synchronization
\node[draw, align=center] {Tensor Parallelism\\Attention Heads Split -> AllReduce\\FFN Split (Column/Row) -> AllReduce};
```
Input activations (replicated on all devices)
Attention heads (computed in parallel, H/P heads per device)
AllReduce (combine attention outputs)
Feed-forward layer 1 (column-parallel)
Feed-forward layer 2 (row-parallel)
AllReduce (combine FF outputs)
Output activations (replicated on all devices)
```
**Tensor Parallelism for Inference**. Computation is distributed across devices by splitting tensor operations. Attention heads are partitioned across GPUs, requiring an AllReduce operation to synchronize results. Feed-forward networks use a column-row splitting strategy that requires only one AllReduce synchronization per block. This approach reduces latency for large models but introduces communication overhead that demands high-bandwidth interconnects like NVLink.
:::
The inference time with tensor parallelism follows @eq-tensor-parallel-time:
@@ -1330,15 +1311,13 @@ Pipeline parallelism distributes layers across devices sequentially, with each d
For inference, pipeline parallelism creates bubbles differently than in training:
::: {.callout-note title="Figure Placeholder: Pipeline Parallelism Bubbles" collapse="true"}
```{.tikz}
% TODO: Diagram comparing single-request latency (sequential) vs pipelined throughput
\node[draw, align=center] {Pipeline Parallelism\\Sequential processing of single request (High Latency)\\Pipelined processing of multiple requests (High Throughput)};
```
Device 0 (Layers 1-20): [Prefill] ─────────────────────────────
\
Device 1 (Layers 21-40): [Prefill] ──────────────────
\
Device 2 (Layers 41-60): [Prefill] ────────
\
Device 3 (Layers 61-80): [Prefill]
```
**Pipeline Parallelism Bubbles**. For a single inference request, pipeline parallelism offers no latency benefit as the request must traverse all stages sequentially (top). However, when processing multiple concurrent requests, pipeline bubble utilization improves significantly (bottom), allowing throughput to scale with the number of stages. This makes pipeline parallelism ideal for high-throughput batch processing but less suitable for latency-critical interactive serving.
:::
For a single request, pipeline parallelism provides no latency benefit: the request must traverse all stages sequentially. The pipeline fill time equals the sequential execution time.
@@ -1386,14 +1365,13 @@ $$\text{Output} = \sum_{i \in \text{top-}k} g_i \cdot \text{Expert}_i(\text{inpu
**Expert parallelism** distributes experts across devices, with each device hosting $E/P$ experts:
::: {.callout-note title="Figure Placeholder: Mixture-of-Experts Routing" collapse="true"}
```{.tikz}
% TODO: Network diagram showing token routing to experts and AllToAll communication
\node[draw, align=center] {MoE Routing\\Token -> Gating -> AllToAll Dispatch -> Experts -> AllToAll Gather};
```
Token arrives with gating decision: [Expert 2, Expert 7]
Device 0 (Experts 0-3): Compute Expert 2
Device 1 (Experts 4-7): Compute Expert 7
AllToAll: Gather results back to original device
```
**Mixture-of-Experts (MoE) Routing**. Expert parallelism distributes "experts" across different devices. For each token, a gating mechanism selects the top-k experts. An AllToAll communication step dispatches tokens to the devices hosting their selected experts (1). Experts process the tokens in parallel (2). A second AllToAll step gathers the results back to the original device (3). This pattern enables massive model capacity but introduces all-to-all communication overhead.
:::
The communication pattern differs from tensor parallelism: instead of all-reduce (same data to all devices), expert parallelism uses all-to-all (different data to different devices based on routing).
@@ -1470,6 +1448,14 @@ The choice of embedding sharding strategy depends on lookup patterns and communi
: **Embedding Sharding Strategies**: Different strategies trade off lookup locality against load balance. {#tbl-embedding-sharding}
::: {.callout-note title="Figure Placeholder: Embedding Sharding Strategies" collapse="true"}
```{.tikz}
% TODO: Visual comparison of Row-wise, Column-wise, and Hybrid sharding
\node[draw, align=center] {Embedding Sharding\\Row-wise: By User ID (Network Gather)\\Column-wise: By Vector Dim (AllGather)\\Hybrid: Hot/Cold Split};
```
**Embedding Sharding Strategies**. **Row-wise sharding** places complete embedding vectors on specific servers based on entity ID, requiring a network gather for lookup. **Column-wise sharding** splits each vector across all servers, allowing parallel local lookups followed by an AllGather, which is efficient for popular "hot" embeddings. **Hybrid sharding** combines these approaches, using column sharding for hot items and row sharding for the "cold" long tail to balance load and memory.
:::
::: {.callout-note title="Embedding Sharding at Scale: Meta Infrastructure"}
Meta's recommendation infrastructure demonstrates embedding sharding at extreme scale:
@@ -2166,15 +2152,13 @@ PagedAttention [@kwon2023vllm], introduced in vLLM, applies virtual memory[^fn-v
The key concepts include page tables that map logical sequence positions to physical memory pages, block size that defines the number of tokens per page (typically 16 tokens), and physical blocks that provide fixed-size memory allocations assignable to any sequence.
::: {.callout-note title="Figure Placeholder: PagedAttention Memory Mapping" collapse="true"}
```{.tikz}
% TODO: Diagram showing Logical KV Cache pages mapping to Non-continuous Physical Blocks
\node[draw, align=center] {PagedAttention\\Logical View: Contiguous Sequence\\Physical View: Scattered Blocks in HBM\\Block Table: The Mapping};
```
Sequence A (150 tokens, 10 pages allocated):
Logical: [Page 0][Page 1][Page 2]...[Page 9]
Physical: [Block 5][Block 12][Block 3]...[Block 8]
Sequence B (80 tokens, 5 pages allocated):
Logical: [Page 0][Page 1][Page 2][Page 3][Page 4]
Physical: [Block 1][Block 7][Block 15][Block 2][Block 11]
```
**PagedAttention Memory Mapping**. Conceptually similar to virtual memory in operating systems, PagedAttention decouples the logical view of a sequence's KV cache (contiguous pages) from its physical storage (non-contiguous 16-token blocks). A block table maps logical pages to physical blocks, allowing the system to fill fragmentation gaps with small blocks from any sequence. This eliminates external fragmentation and enables near-100% memory utilization.
:::
PagedAttention provides several benefits. It eliminates internal fragmentation by allocating only the pages needed for actual tokens. It eliminates external fragmentation because any free page can be used by any sequence. It enables dynamic growth so sequences can grow without pre-allocation. It supports memory sharing so common prefixes can share physical pages.
@@ -2243,23 +2227,13 @@ Many LLM workloads share common prefixes across requests. System prompts like "Y
**Prefix caching** shares KV cache entries across requests with common prefixes:
::: {.callout-note title="Figure Placeholder: Prefix Caching with Shared Blocks" collapse="true"}
```{.tikz}
% TODO: Tree or Block diagram showing multiple requests pointing to the same System Prompt blocks
\node[draw, align=center] {Prefix Caching\\System Prompt Blocks [0-5] (Shared)\\Request A [6-8] (Unique)\\Request B [6-9] (Unique)};
```
Request A: [System prompt (500 tokens)] + [User query A (50 tokens)]
Request B: [System prompt (500 tokens)] + [User query B (75 tokens)]
Request C: [System prompt (500 tokens)] + [User query C (30 tokens)]
Without prefix caching:
- 3 × 500 = 1500 tokens of prefill compute
- 3 × 500 tokens of KV cache storage
With prefix caching:
- 500 tokens of prefill compute (cached)
- 500 tokens of KV cache storage (shared)
- 155 tokens of unique prefill compute
- 155 tokens of unique KV cache storage
```
**Prefix Caching via Block Sharing**. PagedAttention enables efficient prefix caching by allowing multiple sequences' block tables to point to the same physical blocks for shared content. In this example, the System Prompt is stored in blocks 0-5. Request A and Request B maps their first 6 logical pages to these same physical blocks, storing only their unique suffixes in new blocks. This dramatically reduces memory usage and prefill computation for workloads with shared context.
:::
**Implementation with PagedAttention**:
@@ -2375,22 +2349,13 @@ Autoregressive generation is inherently sequential: each token depends on previo
**Scenario**: Generating "The quick brown fox jumps"
::: {.callout-note title="Figure Placeholder: Speculative Decoding Timeline" collapse="true"}
```{.tikz}
% TODO: Sequence/Timeline diagram of Draft generation followed by Parallel Verification
\node[draw, align=center] {Speculative Decoding\\Draft Model: Gen 4 tokens (Fast)\\Target Model: Verify 4 tokens (Parallel)\\Accept/Reject: Rollback to first error};
```
Step 1: Draft model generates 4 tokens
"The" → [quick, brown, fox, jumps]
Time: 4/300 = 13ms
Step 2: Target model verifies in parallel
Input: "The quick brown fox jumps"
Accepts: "quick", "brown" (match)
Rejects: "fox" → target predicted "lazy"
Time: 1/30 = 33ms
Step 3: Output "quick brown", continue from "brown"
Effective tokens: 2 in 46ms = 43 tokens/second
Step 4: Repeat from "brown"
```
**Speculative Decoding Process**. Instead of generating tokens sequentially with the large target model (slow), a small draft model quickly proposes a sequence of $K$ tokens. The target model then verifies all $K$ tokens in a single parallel forward pass (similar to prefill). If the draft tokens match the target's output, they are accepted, effectively generating multiple tokens per target model step. If a mismatch occurs, the sequence is rolled back to the first error.
:::
**Effective speedup**: 43/30 = 1.43x
@@ -3068,13 +3033,13 @@ The bulkhead pattern[^fn-bulkhead] [@nygard2007releaseit] physically isolates te
**Deployment-level bulkheads**: Dedicate replicas to specific tenants or tenant groups.
::: {.callout-note title="Figure Placeholder: Bulkhead Isolation Patterns" collapse="true"}
```{.tikz}
% TODO: Block diagram contrasting Shared Pool vs Directed/Dedicated Pools
\node[draw, align=center] {Bulkhead Pattern\\Gold Tier -> Dedicated Replicas\\Standard Tier -> Shared Replicas w/ Quotas\\Failure in Shared Pool does not affect Gold Tier};
```
GPU Cluster (24 GPUs):
Tenant A (gold tier): GPUs 0-7 (dedicated)
Tenant B (gold tier): GPUs 8-15 (dedicated)
Tenants C-Z (shared): GPUs 16-23 (shared pool)
```
**Bulkhead Isolation Patterns**. To prevent cascading failures in multi-tenant systems, bulkheads isolate resources. **Deployment-level bulkheads** (shown) assign dedicated physical replicas to high-priority tenants, ensuring complete isolation. **Request-level bulkheads** enforce strict concurrency limits within shared processes. Like ship compartments, these boundaries ensure that a failure or resource exhaustion in one segment cannot sink the entire platform.
:::
**Pros**: Complete isolation for premium tenants
**Cons**: Lower utilization, more operational overhead
@@ -3318,6 +3283,14 @@ Bringing up a new replica for Llama-70B on H100:
:::
::: {.callout-note title="Figure Placeholder: Cold Start Latency Breakdown" collapse="true"}
```{.tikz}
% TODO: Horizontal stacked bar chart showing time components of cold start
\node[draw, align=center] {Cold Start Timeline\\Provisioning (60s) | Download (180s) | Load (45s) | Warmup (15s)\\Total: >5 mins};
```
**Anatomy of a Cold Start**. Bringing up a new GPU inference replica is a multi-step process taking minutes. While container startup is fast, provisioning the specialized instance and downloading massive model weights (100GB+) dominate the timeline. CUDA context initialization and "warmup" inference passes add further delay. This 5+ minute lag makes purely reactive scaling dangerous for handling sudden traffic spikes.
:::
### Reactive Scaling {#sec-inference-reactive-scaling}
Reactive scaling adjusts capacity based on observed metrics:
@@ -3454,6 +3427,14 @@ A chatbot service shows predictable daily patterns:
:::
::: {.callout-note title="Figure Placeholder: Predictive vs Reactive Scaling" collapse="true"}
```{.tikz}
% TODO: Line chart showing Traffic Curve, Reactive Capacity (lagging step function), and Predictive Capacity (smooth lead)
\node[draw, align=center] {Scaling Strategies\\Traffic Curve (Sine wave)\\Reactive: Steps up AFTER spike (Late)\\Predictive: Ramps up BEFORE spike (On time)};
```
**Predictive vs. Reactive Scaling**. Reactive scaling (dashed line) responds to traffic spikes after they occur, leading to periods of under-provisioning (red zones) where SLOs are violated due to cold start latency. Predictive scaling (solid line) anticipates known traffic patterns (like daily cycles) and begins provisioning capacity *before* the traffic arrives, eliminating the deficit and ensuring consistent performance.
:::
### Warm Pool Management {#sec-inference-warm-pools}
Maintaining a pool of pre-warmed replicas reduces effective cold start time:
@@ -3913,14 +3894,13 @@ Google Search uses ensemble serving to combine multiple specialized models for q
**Architecture overview**:
::: {.callout-note title="Figure Placeholder: Ranking Cascade Funnel" collapse="true"}
```{.tikz}
% TODO: Funnel diagram showing candidate count reduction and model complexity increase
\node[draw, align=center] {Ranking Cascade\\Retrieval (1M Items, Simple) ->\\L1 Ranking (10K Items, Linear) ->\\L2 Ranking (100 items, Small NN) ->\\Final Ranking (10 items, Large Ensemble)};
```
Query → Query Understanding → Candidate Retrieval → Ranking Cascade → Results
│ │ │
▼ ▼ ▼
BERT QU Embeddings L1 → L2 → L3
(10 models) (100s shards) (progressively complex)
```
**Ranking Cascade Architecture**. To optimize latency and cost, search and recommendation systems use a cascade of increasingly complex models. Early stages (Retrieval) use cheap, fast models (embeddings, linear) to filter millions of candidates down to thousands. Later stages use expensive, high-precision models (transformers, ensembles) to rank the remaining few candidates. This funnel structure ensures that heavy computation is spent only on the most promising items.
:::
**Key design decisions**:

View File

@@ -67,6 +67,14 @@ A single NVIDIA DGX H100 system consumes 10.2 kW at peak load. A rack containing
| **Server PSU** | 12V DC | Component-level power |
+-----------------------------+---------------------+-------------------------+
::: {.callout-note title="Figure Placeholder: Power Distribution Hierarchy" collapse="true"}
```{.tikz}
% TODO: Diagram showing power flow from Utility (HV) to Substation to PDU to Server PSU (12V)
\node[draw, align=center] {Power Hierarchy\nUtility -> Substation -> PDU -> Server};
```
**Datacenter Power Distribution**. A hierarchical view of power delivery, illustrating voltage step-downs from utility feeds (high voltage) to server components (12V DC). Redundancy is typically implemented at the UPS and PDU levels (2N) to ensure continuous operation during grid failures.
:::
**Power Usage Effectiveness.** The PUE metric[^fn-pue], developed by The Green Grid consortium in 2007 [@thegreengrid2007pue], quantifies datacenter energy efficiency:
[^fn-pue]: **Power Usage Effectiveness (PUE)**: An industry-standard metric where values closer to 1.0 indicate greater efficiency. A PUE of 2.0 means half the power goes to overhead (cooling, lighting, power distribution), while 1.1 means only 10% goes to overhead. Google's most efficient datacenters achieve PUE of 1.06, while typical enterprise facilities operate at 1.5-2.0.
@@ -373,6 +381,14 @@ InfiniBand[^fn-infiniband] has emerged as the dominant interconnect for ML train
| **XDR (800 Gb/s)** | 100 GB/s | &lt; 0.5 μs | Emerging |
+-----------------------+---------------+-------------+-----------------------+
::: {.callout-note title="Figure Placeholder: InfiniBand vs RoCE Stack" collapse="true"}
```{.tikz}
% TODO: Side-by-side stack comparison. InfiniBand (Physical -> Link -> Network -> Transport -> Verbs) vs RoCE (Ethernet -> IP -> UDP -> IB Transport -> Verbs)
\node[draw, align=center] {Protocol Stacks\nInfiniBand vs RoCE};
```
**High-Performance Networking Stacks**. Comparison of InfiniBand and RoCE protocol stacks. InfiniBand uses a native lossless fabric, while RoCE encapsulates RDMA traffic within UDP/IP packets over Ethernet. Both expose the same Verbs API to applications, but RoCE relies on Priority Flow Control (PFC) in the Ethernet layer to approximate InfiniBand's lossless guarantees.
:::
RDMA over Converged Ethernet[^fn-roce] provides an alternative that leverages existing Ethernet infrastructure. RoCEv2 operates over UDP/IP, enabling RDMA semantics across routed networks. While RoCE offers lower capital costs and operational familiarity, it requires careful configuration of Priority Flow Control and Explicit Congestion Notification to prevent packet loss. In ML workloads, even small packet loss rates cause significant performance degradation because collective operations must wait for retransmissions.
[^fn-roce]: **RoCE (RDMA over Converged Ethernet)**: A protocol that implements RDMA semantics over Ethernet networks. RoCEv2 uses UDP encapsulation for routability across L3 networks. RoCE reduces infrastructure costs by using commodity Ethernet switches but requires lossless Ethernet configuration (PFC/ECN) to achieve RDMA performance. Cloud providers typically offer RoCE-based networking (AWS EFA, Azure NDR) as a lower-cost alternative to InfiniBand.
@@ -417,6 +433,14 @@ Rail-optimized topologies offer an alternative for workloads dominated by tensor
| **Torus (TPU)** | Dimension-based | O(√N) | Structured comms |
+-----------------------+------------------+--------------------+-------------------+
::: {.callout-note title="Figure Placeholder: Network Topologies" collapse="true"}
```{.tikz}
% TODO: Visual diagrams of Fat-Tree (hierarchical), Ring/Torus (mesh), and Rail-Optimized (dedicated groups)
\node[draw, align=center] {Topology Comparison\nFat-Tree vs Rail-Optimized vs Torus};
```
**Network Topologies for ML**. Visualizing three common interconnect architectures. (A) Fat-Tree provides full bisection bandwidth for general-purpose communication. (B) Torus (used in TPUs) connects neighbors in a grid/mesh, optimizing for local patterns. (C) Rail-Optimized designs prioritize dedicated paths between corresponding GPUs across nodes, minimizing switch hops for tensor parallelism.
:::
The distinction between non-blocking and oversubscribed networks carries significant implications for ML workloads. A 2-to-1 oversubscription ratio halves the effective bisection bandwidth, potentially doubling AllReduce time for large collectives. While oversubscription reduces infrastructure costs, the impact on training throughput often negates the savings. Most production ML clusters deploy non-blocking networks for training, reserving oversubscribed designs for serving traffic where request-response patterns tolerate contention.
#### Multi-Rack and Multi-Datacenter Training
@@ -543,6 +567,14 @@ The SuperPOD design embodies the principles examined throughout this section: hi
The datacenter infrastructure and high-speed networks discussed in @sec-datacenter-architecture and @sec-networking-ml provide the physical foundation for large-scale ML. However, translating these resources into productive workloads requires sophisticated scheduling systems that balance utilization, fairness, and job completion time. This section examines the scheduling challenges unique to ML workloads and the systems designed to address them.
::: {.callout-note title="Figure Placeholder: Cluster Scheduling Architecture" collapse="true"}
```{.tikz}
% TODO: Architecture diagram showing User Job -> Scheduler (Queue policy) -> Resource Manager -> Node Agents -> GPUs
\node[draw, align=center] {Scheduling Architecture\nJob Queue -> Scheduler -> Cluster Nodes};
```
**Cluster Resource Management Architecture**. A high-level view of a distributed scheduler. Jobs enter a prioritized queue. The scheduler matches resource requests (GPUs, Memory) against available nodes, enforcing fairness and locality constraints. Node agents (like Kubelet or Slurmd) launch containers and monitor health, reporting status back to the control plane.
:::
ML workloads present scheduling challenges distinct from traditional computing. Training jobs require coordinated access to multiple GPUs, often spanning nodes connected via the InfiniBand fabric discussed in the previous section. Inference workloads demand consistent latency while handling unpredictable traffic patterns. Both compete for the same accelerator resources, creating tension between throughput-oriented batch processing and latency-sensitive serving.
### Why Distributed Scheduling is Hard
@@ -877,6 +909,14 @@ where $r$ is the discount rate reflecting cost of capital. For a 256-GPU cluster
| **Annual Total** | $14,230,000 | $3,106,000 | $3,263,000 | $6,645,000 |
+-------------------------+-------------+------------+------------+------------+
::: {.callout-note title="Figure Placeholder: TCO Breakdown" collapse="true"}
```{.tikz}
% TODO: Pie chart or stacked bar showing cost components over 4 years. Hardware vs Power vs Ops.
\node[draw, align=center] {TCO Components\nCapEx vs Power vs Ops};
```
**Total Cost of Ownership Breakdown**. Analysis of infrastructure costs over a 4-year lifecycle. While hardware CapEx is the largest initial outlay, operational costs (Power and Staffing) accumulate to exceed hardware costs over the system's life. High utilization is key to amortizing these fixed and ongoing costs.
:::
The NPV at 8% discount rate equals approximately $24.1 million, yielding a 4-year cost per GPU-hour of $4.30 at 70% utilization. This compares favorably to cloud A100 pricing of $3-4/hour only when accounting for the H100's 3x performance advantage, yielding effective cost per computation approximately 40% below cloud alternatives at this utilization level.
Power cost sensitivity analysis reveals the importance of electricity pricing in deployment decisions. A $0.04/kWh difference in electricity rates shifts the 4-year TCO by approximately $2.7 million for a 256-GPU cluster, potentially changing the optimal deployment strategy. Organizations with access to low-cost renewable energy enjoy structural cost advantages that compound over multi-year infrastructure investments.

View File

@@ -593,15 +593,15 @@ An e-commerce platform might operate the following models:
The dependency graph reveals operational implications:
```text
User Embedding ─┬──────────────────────────────┐
│ │
├─► Candidate Retrieval ──────►│
│ │
Product Embed. ─┴─► Price Sensitivity ────────►├─► Ranking ─► Diversity ─► Business Rules
::: {.callout-note title="Figure Placeholder: E-Commerce Model Dependency Graph" collapse="true"}
```{.tikz}
% TODO: Directed graph showing data flow between models
% Nodes: User Embedding, Product Embedding, Candidate Retrieval, Price Sensitivity, Ranking, Diversity, Business Rules
% Edges showing prediction flow
\node[draw, align=center] {E-Commerce Model Dependency Graph\nEmbeddings -> Retrieval -> Ranking -> Business Logic};
```
**E-Commerce Model Ecosystem**. A complex dependency graph where upstream models (Embeddings) feed into mid-tier models (Retrieval, Price Sensitivity) which feed into final ranking and logic layers. Changes to identifying "User Embedding" require coordinated updates to all downstream consumers.
:::
Updating User Embedding affects four downstream models. Operational procedures must:
@@ -863,7 +863,7 @@ Maintaining request consistency during deployment transitions requires explicit
**Consistency Models for Deployment**
The choice of consistency model affects both deployment complexity and validity of deployment metrics:
The choice of consistency model affects both deployment complexity and validity of deployment metrics, as shown in @tbl-multi-region-consistency:
+-----------------------+-------------------------------------------+----------------------------------------+-------------------------------+
| **Model** | **Guarantee** | **Use Case** | **Coordination Overhead** |
@@ -922,6 +922,16 @@ Organizations deploying safety-critical models typically implement coordinator r
Shadow deployment runs the new model in parallel with production, receiving the same inputs and logging outputs, but not affecting user-visible results. This provides the highest fidelity testing environment short of actual production exposure, enabling detection of issues that escape offline validation.
::: {.callout-note title="Figure Placeholder: Shadow Deployment Architecture" collapse="true"}
```{.tikz}
% TODO: Architecture diagram for Shadow Deployment
% Components: Request Router, Production Model, Shadow Model, Log Async Service, Comparison Dashboard
% Flow: Request -> Router -> (Split) -> Prod Model (Response to User) & Shadow Model (Log only)
\node[draw, align=center] {Shadow Deployment Architecture\nTraffic Mirroring and Asynchronous Comparison};
```
**Shadow Deployment Architecture**. Production traffic is mirrored to the shadow model asynchronously. The router returns the production response to the user immediately, while both responses are logged for offline quality comparison and operational validation.
:::
**Shadow Deployment Benefits**
Shadow deployment provides four critical validation capabilities:
@@ -1056,6 +1066,17 @@ Interleaving implementation:
This pattern is essential for recommendation systems where detecting small engagement changes quickly enables rapid iteration.
::: {.callout-note title="Figure Placeholder: Interleaving Experiments" collapse="true"}
```{.tikz}
% TODO: Diagram showing Team Draft Interleaving
% Left: Ranking A (List A), Right: Ranking B (List B)
% Center: Interleaved List (A1, B1, A2, B2...)
% Bottom: User Clicks attributed to A or B
\node[draw, align=center] {Interleaving Experiment\nBlending Rankings for Sensitivity};
```
**Interleaving vs. A/B Testing**. In traditional A/B testing (left), users see only one variant. In interleaving (right), users see a blended list. Clicks on items are attributed to the source ranker, providing a higher-sensitivity signal that controls for user-specific variance.
:::
**A/B Testing Statistical Foundations**
Building on the A/B testing concepts introduced in Volume I (@sec-ml-operations), this section addresses the statistical challenges and infrastructure requirements that emerge when operating experimentation platforms at scale. A/B testing provides rigorous frameworks for comparing model variants, but at scale requires careful attention to statistical power, significance thresholds, and multiple testing correction. Improper statistical practices lead to false positives that waste engineering resources or false negatives that miss genuine improvements.
@@ -1111,7 +1132,7 @@ This means a 64% chance of falsely detecting an improvement. Three correction ap
*Bonferroni correction* adjusts the significance threshold to $\alpha' = \frac{\alpha}{k}$ for $k$ tests. This is conservative but simple. For 20 tests with α=0.05, use α'=0.0025 for each test. This controls the familywise error rate but reduces statistical power.
*Šidák correction* provides a less conservative adjustment:
*Šidák correction* provides a less conservative adjustment, as shown in @eq-sidak-correction:
$$\alpha' = 1 - (1-\alpha)^{1/k}$$ {#eq-sidak-correction}
@@ -1177,7 +1198,7 @@ Network effects manifest in three primary forms, each requiring different detect
**Quantifying SUTVA Violations**
The severity of network effect bias depends on network structure and outcome correlation. For a social graph with clustering coefficient $C$ (probability that two connected users share a common connection), the variance inflation factor due to network effects follows:
The severity of network effect bias depends on network structure and outcome correlation. For a social graph with clustering coefficient $C$ (probability that two connected users share a common connection), @eq-sutva-vif gives the variance inflation factor due to network effects:
$$VIF \approx 1 + C \times \rho$$ {#eq-sutva-vif}
@@ -1398,6 +1419,18 @@ Even if 99% of these are deduplicated or auto-resolved, the remaining 144 alerts
The alert fatigue problem demands a fundamentally different approach. The solution is hierarchical monitoring that presents different levels of detail to different audiences and aggregates signals to reduce alert volume while maintaining detection capability.
::: {.callout-note title="Figure Placeholder: Hierarchical Monitoring Pyramid" collapse="true"}
```{.tikz}
% TODO: Pyramid diagram
% Top: Business Metrics (Revenue, Engagement) - Alerts Executives
% Middle: Portfolio Metrics (Domain Health) - Alerts Product Owners
% Base: Model Metrics (Latency, Accuracy) - Alerts Model Owners
% Foundation: Infrastructure (GPU, Network) - Alerts Platform Team
\node[draw, align=center] {Hierarchical Monitoring Pyramid\nBusiness -> Portfolio -> Model -> Infrastructure};
```
**Hierarchical Monitoring Architecture**. To prevent alert fatigue, monitoring operates at four abstraction levels. High-level business metrics trigger alarms for broad issues, while lower-level metrics are used primarily for investigation and root cause analysis.
:::
**Level 1: Business Metrics**
The highest monitoring level tracks business outcomes that ML systems affect:
@@ -1877,7 +1910,7 @@ ML workloads present unique cost management challenges that traditional IT FinOp
**Cost Components**
ML platform costs span multiple categories with different optimization strategies:
ML platform costs span multiple categories with different optimization strategies, as detailed in @tbl-ops-scale-cost-breakdown:
+-----------------------+-------------------+------------------------------------+------------------------------------+
| **Cost Category** | **Typical Share** | **Primary Drivers** | **Optimization Lever** |
@@ -1952,7 +1985,7 @@ Several attribution approaches exist:
**Cost Per Inference Analysis**
For serving workloads, cost per inference provides the key unit economic metric:
For serving workloads, cost per inference provides the key unit economic metric. @eq-cost-per-inference defines this:
$$\text{Cost per inference} = \frac{\text{Total serving cost}}{\text{Total inferences served}}$$ {#eq-cost-per-inference}
@@ -1991,7 +2024,7 @@ These controls should inform rather than block. The goal is cost awareness, not
**Cost-Quality Tradeoffs**
Model selection should explicitly consider cost alongside accuracy:
Model selection should explicitly consider cost alongside accuracy. @tbl-ops-scale-cost-quality illustrates these trade-offs:
+------------+--------------+-------------------+---------------------+----------------------------+
| **Model** | **Accuracy** | **Training Cost** | **Serving Cost/1K** | **Value Judgment** |
@@ -2084,6 +2117,16 @@ Training data must use features as they existed at the time of each training exa
[^fn-data-leakage]: **Data Leakage**: A subtle but devastating error where information from the future is inadvertently used to make predictions about the past. In financial models, this might mean using features computed from the full dataset (including future data) to predict historical events. Models with leakage often show spectacular offline performance (sometimes 99%+ accuracy) but fail completely in production where future information is unavailable.
::: {.callout-note title="Figure Placeholder: Point-in-Time Correctness" collapse="true"}
```{.tikz}
% TODO: Timeline showing "Time Travel" prevention
% Timeline with Feature Updates (t1, t3, t5) and Training Events (t2, t4)
% Show Join selecting Feature(t1) for Event(t2), NOT Feature(t3)
\node[draw, align=center] {Point-in-Time Join Logic\nRetrieving valid historical state};
```
**Point-in-Time Correctness**. Preventing data leakage by joining training events with feature values as they existed *at the event timestamp*, not the current values. This ensures the model learns from the information actually available at inference time.
:::
**The Leakage Problem ("Time Travel")**
This is the most common and devastating bug in ML pipelines. "Time Travel" occurs when a model is trained using data that was not yet available at the moment of prediction.
@@ -2307,19 +2350,16 @@ A centralized ML platform team builds and maintains shared infrastructure while
**Structure**
```text
ML Platform Team (15-30 engineers)
├── Infrastructure: Compute, storage, networking
├── ML Systems: Training pipelines, serving infrastructure
├── Data Platform: Feature store, data pipelines
├── Developer Experience: APIs, SDKs, documentation
└── Reliability: Monitoring, on-call, incident response
Model Teams (5-10 engineers each)
├── Model development and experimentation
├── Model-specific data pipelines
└── Business integration
::: {.callout-note title="Figure Placeholder: ML Organization Patterns" collapse="true"}
```{.tikz}
% TODO: Comparative org charts
% 1. Centralized: Platform Team service all Model Teams
% 2. Embedded: Platform Engineers inside Model Teams
% 3. Hybrid: Central Core + Embedded Specialists
\node[draw, align=center] {ML Organization Models\nCentralized vs. Embedded vs. Hybrid};
```
**Organizational Patterns for ML**. (Left) Centralized model provides consistency but risks bottlenecks. (Center) Embedded model provides velocity but risks fragmentation. (Right) Hybrid usage of a core platform team with embedded specialists offers a balance of standardization and responsiveness.
:::
**Advantages**
@@ -2345,23 +2385,6 @@ An alternative places ML infrastructure expertise within model teams, with coord
**Structure**
```text
Model Team A (8-12 engineers)
├── ML Engineers (3-4): Models, experiments
├── Platform Engineer (1): Infrastructure, ops
└── Data Engineers (2-3): Pipelines, features
Model Team B (8-12 engineers)
├── ML Engineers (3-4): Models, experiments
├── Platform Engineer (1): Infrastructure, ops
└── Data Engineers (2-3): Pipelines, features
ML Community of Practice
├── Weekly sync across embedded platform engineers
├── Shared documentation and patterns
└── Coordinated tool selection
```
**Advantages**
*Responsiveness*: Platform expertise is directly available to model teams without cross-team coordination.

View File

@@ -1406,12 +1406,12 @@ class RealTimeFairnessMonitor:
async def _trigger_bias_alert(self, metrics: FairnessMetrics):
"""Trigger alert when bias threshold exceeded"""
alert_message = (
f"BIAS ALERT: Demographic parity difference: "
f"{metrics.demographic_parity_diff:.3f}, "
f"BIAS ALERT: Demographic parity difference: "
f"{metrics.demographic_parity_diff:.3f}, "
)
alert_message += (
f"Equalized odds difference: "
f"{metrics.equalized_odds_diff:.3f}"
f"Equalized odds difference: "
f"{metrics.equalized_odds_diff:.3f}"
)
# Log to audit system

View File

@@ -300,6 +300,8 @@ Object storage systems like S3 and GCS are suitable for training data. Since Dec
: Consistency requirements vary by storage tier and access pattern. {#tbl-consistency-requirements}
As @tbl-consistency-requirements shows, each storage tier has distinct consistency requirements based on its access patterns and failure mode consequences.
::: {.callout-note title="CAP Theorem Implications for ML Storage"}
The CAP theorem's implications for ML storage differ from traditional applications. As Stoica et al. observe [@stoica2017berkeley], training storage can sacrifice availability for consistency (a brief storage outage during checkpoint writes is acceptable if it ensures checkpoint correctness), while serving storage might sacrifice consistency for availability (serving stale features is preferable to failing requests entirely). Understanding these tradeoffs enables storage architecture decisions that match the actual requirements of each ML system component.
@@ -595,6 +597,8 @@ sample0002.cls
: Format selection guidelines matched to model type and access characteristics. {#tbl-format-selection}
@tbl-format-selection provides guidance for matching storage formats to model types, taking into account the specific access patterns and data characteristics of each workload.
### Data Loading Pipelines {#sec-data-loading-pipelines}
The data loading pipeline connects storage to accelerators, transforming raw data into training batches. Pipeline design determines whether storage bandwidth is fully utilized and whether GPUs remain fed during training.
@@ -630,6 +634,14 @@ where $N_{prefetch}$ is the number of batches buffered and $T_{batch}$ is the GP
For a 200 ms batch time with 100 ms storage latency, prefetching just 1 batch hides the storage latency. However, variance in storage latency requires larger buffers: if storage latency varies from 50-500 ms, prefetching 3-5 batches ensures GPUs never wait.
::: {.callout-note title="Figure Placeholder: Data Loading Pipeline" collapse="true"}
```{.tikz}
% TODO: Horizontal pipeline diagram showing CPU stages (Read, Decode, Transform) overlapping with GPU Compute.
\node[draw, align=center] {Data Loading Pipeline\nPrefetch Buffer Hiding Latency};
```
**Hiding Storage Latency with Prefetching**. Without prefetching (top), the GPU sits idle while the CPU loads and transforms the next batch. With pipelining (bottom), the CPU prepares Batch $N+1$ while the GPU processes Batch $N$, ensuring the GPU is never starved of data. The prefetch buffer smooths out I/O latency jitter.
:::
#### Caching Strategies {#sec-caching-strategies}
Caching can dramatically improve data loading performance when datasets are accessed repeatedly.
@@ -791,6 +803,14 @@ Asynchronous checkpointing hides I/O latency but introduces complexity: memory o
For most training runs, asynchronous checkpointing reduces overhead to near zero. The memory overhead (one additional copy of model state) is typically acceptable on systems with sufficient host memory.
::: {.callout-note title="Figure Placeholder: Async vs Sync Checkpointing" collapse="true"}
```{.tikz}
% TODO: Two timelines comparing Blocking Checkpoint vs Async Checkpoint
\node[draw, align=center] {Checkpointing Strategies\nStop-the-World vs Async Copy-on-Write};
```
**Zero-Overhead Checkpointing**. (A) Synchronous checkpointing halts training (`Stop-the-World`), wasting valuable accelerator time on I/O. (B) Asynchronous checkpointing captures a snapshot in CPU memory and writes to storage in the background while training resumes immediately. This overlaps the massive write I/O with useful computation.
:::
### Distributed Checkpointing for Sharded Models {#sec-distributed-checkpointing-storage}
When models are sharded across multiple devices using parallelism strategies (examined in @sec-distributed-training), each device holds only a portion of the model state. Checkpointing must coordinate across all devices to produce a consistent, complete checkpoint at the same logical training point.
@@ -892,6 +912,14 @@ where $T_{save}$ is the time to save a checkpoint and $MTBF$ is the mean time be
This formula minimizes expected wasted work, accounting for both checkpoint overhead and work lost to failures. The MTBF[^fn-mtbf] for GPU clusters decreases inversely with cluster size, making optimal checkpoint intervals surprisingly short for large-scale training.
::: {.callout-note title="Figure Placeholder: Young-Daly Tradeoff" collapse="true"}
```{.tikz}
% TODO: U-shaped cost curve. X-axis: Checkpoint Interval. Y-axis: Total Overhead.
\node[draw, align=center] {Young-Daly Optimization\nBalancing Overhead vs Risk};
```
**The Checkpoint Tradeoff**. Plotting total training time overhead against checkpoint interval. Checkpointing too frequently (left side) incurs high I/O overhead. Checkpointing too rarely (right side) risks losing hours of work when failures occur. The optimal point $T_{opt}$ minimizes the sum of these costs.
:::
[^fn-mtbf]: **MTBF (Mean Time Between Failures)**: A reliability metric representing the average time a system operates before experiencing a failure. For a single GPU, MTBF might be 30,000 hours (about 3.5 years). For a cluster, MTBF decreases inversely with component count: 1,024 GPUs with individual MTBF of 30,000 hours yield a cluster MTBF of roughly 29 hours, meaning failures occur almost daily. This inverse scaling explains why large-scale ML training is fundamentally a distributed systems problem: at 10,000 GPUs, MTBF drops to about 3 hours, making failure handling the dominant engineering challenge rather than an edge case.
::: {.callout-example title="Checkpoint Interval for Large-Scale Training"}
@@ -1067,6 +1095,14 @@ LEFT JOIN features
AND features.timestamp > labels.event_time - INTERVAL 1 DAY
```
::: {.callout-note title="Figure Placeholder: Point-in-Time Correctness" collapse="true"}
```{.tikz}
% TODO: Timeline showing 'Event Time' and 'Feature Time'. Correct join vs Leakage join.
\node[draw, align=center] {Temporal Joins\nPreventing Data Leakage};
```
**Point-in-Time Correctness**. To prevent data leakage, features for a training example must be joined based on the timestamp of the event. If an ad impression occurred at $T_{event}$, the model must be trained using features as they existed at $T < T_{event}$. Using features from $T > T_{event}$ (e.g., "user clicked") to predict the click introduces future information, rendering the model useless in production.
:::
#### Online Store {#sec-online-store}
The online store provides low-latency access to the most recent feature values for serving. It trades historical depth for speed.

View File

@@ -336,7 +336,7 @@ Understanding where energy goes in AI systems requires grounding in the physics
#### The CMOS Power Equation {#sec-sustainable-ai-cmos-power-equation}
Every digital circuit consumes power through two fundamental mechanisms. Dynamic power arises from switching transistors between states, while static power results from leakage current that flows even when transistors are nominally off. The total power consumption follows:
Every digital circuit consumes power through two fundamental mechanisms. Dynamic power arises from switching transistors between states, while static power results from leakage current that flows even when transistors are nominally off. The total power consumption follows @eq-cmos-power:
$$P_{total} = P_{dynamic} + P_{static} = \alpha C V^2 f + V I_{leak}$$ {#eq-cmos-power}
@@ -356,13 +356,13 @@ Specialized accelerators improve the activity factor $\alpha$ by designing circu
#### Facility-Level Power Metrics {#sec-sustainable-ai-facility-level-metrics}
Beyond chip-level power, data center infrastructure imposes additional energy overhead that the Power Usage Effectiveness (PUE) metric captures:
Beyond chip-level power, data center infrastructure imposes additional energy overhead that the Power Usage Effectiveness (PUE) metric captures. @eq-pue defines this relationship:
$$PUE = \frac{P_{total\_facility}}{P_{IT\_equipment}}$$ {#eq-pue}
A PUE of 1.0 would indicate perfect efficiency where all energy powers computation, though this is physically impossible since cooling, power distribution, and lighting require nonzero energy. Industry-average data centers operate at PUE of 1.5 to 2.0, meaning that 50% to 100% additional energy beyond computation goes to infrastructure. Leading hyperscale facilities achieve PUE between 1.1 and 1.2 through advanced cooling techniques including free-air cooling in cold climates, liquid cooling for high-density GPU clusters, and optimized power distribution.
Water Usage Effectiveness (WUE) captures the water consumption that evaporative cooling and other processes require:
Water Usage Effectiveness (WUE) captures the water consumption that evaporative cooling and other processes require, as expressed in @eq-wue:
$$WUE = \frac{W_{annual\_water\_usage}}{P_{IT\_equipment\_energy}}$$ {#eq-wue}
@@ -391,7 +391,7 @@ The carbon impact of electricity consumption depends critically on the energy ge
: **Carbon Intensity by Energy Source**: Electricity generation carbon intensity varies by more than two orders of magnitude across energy sources. Geographic location of computation can dramatically affect emissions even for identical workloads. {#tbl-carbon-intensity}
These variations create opportunities for carbon-aware computing. Training the same model in Quebec with a hydro-powered grid at approximately 20 gCO2eq/kWh versus Poland with a coal-dominated grid at approximately 700 gCO2eq/kWh produces a 35-fold difference in carbon emissions. Even within a single grid, carbon intensity varies temporally with renewable generation. Midday solar peaks can reduce intensity by 30 to 50 percent compared to evening hours when natural gas peaker plants operate.
@tbl-carbon-intensity illustrates these variations, which create opportunities for carbon-aware computing. Training the same model in Quebec with a hydro-powered grid at approximately 20 gCO2eq/kWh versus Poland with a coal-dominated grid at approximately 700 gCO2eq/kWh produces a 35-fold difference in carbon emissions. Even within a single grid, carbon intensity varies temporally with renewable generation. Midday solar peaks can reduce intensity by 30 to 50 percent compared to evening hours when natural gas peaker plants operate.
### Systematic Energy Metrics {#sec-sustainable-ai-systematic-energy-metrics}
@@ -401,7 +401,7 @@ Quantifying energy efficiency requires systematic metrics that enable comparison
The fundamental metric for computational energy efficiency is energy consumed per operation, typically measured in picojoules. For AI workloads, the most relevant metrics are energy per floating-point operation and energy per multiply-accumulate, where one MAC operation performs both a multiplication and addition, equivalent to two FLOPs.
Hardware architecture determines energy efficiency across orders of magnitude:
Hardware architecture determines energy efficiency across orders of magnitude, as shown in @tbl-energy-per-op:
+------------------------------+-------------------------+--------------------------------+
| **Architecture** | **Energy Efficiency** | **Characteristics** |
@@ -460,7 +460,7 @@ Data movement often dominates energy consumption in modern AI systems. The energ
: **Memory Hierarchy Energy Costs**: Energy per byte increases by orders of magnitude moving down the memory hierarchy. Data movement can easily dominate computation energy. {#tbl-energy-per-byte}
The critical insight is that moving data from DRAM consumes 10 to 100 times more energy than performing arithmetic operations. For a GPU operating at 10 pJ/FLOP, accessing one FP32 operand from DRAM (4 bytes times 100 pJ/byte = 400 pJ) costs 40 times more than the computation itself. This energy gap drives architectural innovations including:
As @tbl-energy-per-byte illustrates, the critical insight is that moving data from DRAM consumes 10 to 100 times more energy than performing arithmetic operations. For a GPU operating at 10 pJ/FLOP, accessing one FP32 operand from DRAM (4 bytes times 100 pJ/byte = 400 pJ) costs 40 times more than the computation itself. This energy gap drives architectural innovations including:
- On-chip memory for data reuse (NVIDIA tensor cores with shared memory)
- Optimized data layouts minimizing DRAM access (Google TPU systolic arrays)
@@ -468,15 +468,15 @@ The critical insight is that moving data from DRAM consumes 10 to 100 times more
#### Arithmetic Intensity and Energy Roofline {#sec-sustainable-ai-arithmetic-intensity-energy}
The balance between computation and data movement determines whether energy consumption is compute-bound or memory-bound. Arithmetic intensity (AI) quantifies this relationship:
The balance between computation and data movement determines whether energy consumption is compute-bound or memory-bound. Arithmetic intensity (AI) quantifies this relationship in @eq-arithmetic-intensity:
$$AI = \frac{\text{Total FLOPs}}{\text{Total Bytes Moved}}$$ {#eq-arithmetic-intensity}
Arithmetic intensity measured in FLOPs per byte determines the dominant energy consumer. The energy roofline model extends traditional performance rooflines to energy analysis:
Arithmetic intensity measured in FLOPs per byte determines the dominant energy consumer. The energy roofline model extends traditional performance rooflines to energy analysis, as captured in @eq-energy-roofline:
$$E_{total} = \max\left(E_{compute}, E_{memory}\right) = \max\left(\text{FLOPs} \times e_{flop}, \text{Bytes} \times e_{byte}\right)$$ {#eq-energy-roofline}
where $e_{flop}$ is energy per FLOP and $e_{byte}$ is energy per byte moved. The crossover arithmetic intensity where compute and memory energy balance is:
where $e_{flop}$ is energy per FLOP and $e_{byte}$ is energy per byte moved. The crossover arithmetic intensity where compute and memory energy balance is given by @eq-ai-crossover:
$$AI_{crossover} = \frac{e_{byte}}{e_{flop}}$$ {#eq-ai-crossover}
@@ -533,7 +533,7 @@ Quantifying AI system energy consumption requires measurement at multiple levels
Modern processors include dedicated circuitry for power measurement that software can query through manufacturer-provided interfaces. These hardware counters measure actual power draw rather than estimating from activity, providing ground-truth energy consumption data at microsecond resolution.
Intel's Running Average Power Limit (RAPL) interface exposes power measurements for CPU packages, DRAM, and integrated graphics through model-specific registers (MSRs). RAPL reports energy consumption in microjoules with updates every millisecond, enabling fine-grained attribution of energy to specific code regions. The following code demonstrates RAPL-based measurement for a training loop:
Intel's Running Average Power Limit (RAPL) interface exposes power measurements for CPU packages, DRAM, and integrated graphics through model-specific registers (MSRs). RAPL reports energy consumption in microjoules with updates every millisecond, enabling fine-grained attribution of energy to specific code regions. @lst-rapl-measurement demonstrates RAPL-based measurement for a training loop:
::: {#lst-rapl-measurement lst-cap="**RAPL Energy Measurement**: Reading Intel RAPL counters to measure CPU and DRAM energy consumption during model training."}
```{.python}
@@ -577,7 +577,7 @@ RAPL measurements exclude discrete GPUs, which require separate monitoring throu
#### GPU Power Monitoring {#sec-sustainable-ai-gpu-power-monitoring}
NVIDIA GPUs expose power measurements through the NVIDIA Management Library (NVML), accessible via the `nvidia-smi` command-line tool or programmatic bindings. GPU power monitoring reports instantaneous power draw, which can vary significantly during computation due to dynamic voltage and frequency scaling.
NVIDIA GPUs expose power measurements through the NVIDIA Management Library (NVML), accessible via the `nvidia-smi` command-line tool or programmatic bindings. GPU power monitoring reports instantaneous power draw, which can vary significantly during computation due to dynamic voltage and frequency scaling. @lst-gpu-power shows how to measure GPU power during inference:
::: {#lst-gpu-power lst-cap="**GPU Power Monitoring**: Using NVIDIA's pynvml library to measure GPU power consumption during inference."}
```{.python}
@@ -648,7 +648,7 @@ Mobile devices provide platform-specific APIs for energy attribution, though wit
- **ARM Streamline**: Provides energy-annotated profiling for Cortex-A and Mali GPU platforms, enabling identification of inefficient kernel implementations.
- **Apple Instruments Energy Log**: Reports thermal state and energy impact scores for iOS applications, though without direct wattage measurements.
These mobile profiling tools integrate with development workflows, enabling iterative optimization of on-device inference energy consumption during model deployment.
These mobile profiling tools integrate with development workflows, enabling iterative optimization of on-device inference energy consumption during model deployment. @tbl-edge-power-monitors summarizes the available instrumentation options.
**Edge Measurement Methodology**
@@ -658,7 +658,7 @@ Edge energy measurement requires careful methodology to produce reproducible res
2. **Warm-up Period**: Execute 100 or more inference iterations before measurement to reach thermal equilibrium, as initial iterations may exhibit different power characteristics due to cache warming and voltage regulator settling.
3. **Duty Cycle Accounting**: Edge devices typically operate with significant idle periods between inferences. Report both peak inference power and average power at realistic duty cycles:
3. **Duty Cycle Accounting**: Edge devices typically operate with significant idle periods between inferences. Report both peak inference power and average power at realistic duty cycles, as expressed in @eq-edge-duty-cycle:
$$P_{average} = P_{active} \times D + P_{idle} \times (1 - D)$$ {#eq-edge-duty-cycle}
@@ -668,13 +668,13 @@ where $D$ is the duty cycle (fraction of time performing inference).
#### System-Level Energy Profiling {#sec-sustainable-ai-system-profiling}
Comprehensive energy accounting requires combining chip-level measurements with infrastructure overhead. The total energy for an AI workload includes:
Comprehensive energy accounting requires combining chip-level measurements with infrastructure overhead. @eq-total-energy shows that the total energy for an AI workload includes:
$$E_{total} = (E_{CPU} + E_{GPU} + E_{memory} + E_{network}) \times PUE$$ {#eq-total-energy}
System-level profilers like Intel VTune, NVIDIA Nsight Systems, and open-source tools such as PowerJoular aggregate measurements across components. For production deployments, smart power distribution units (PDUs) at the rack level provide facility-verified measurements that include cooling overhead.
The relationship between measured component power and total facility energy follows from PUE:
The relationship between measured component power and total facility energy follows from PUE, as shown in @eq-facility-power:
$$P_{facility} = P_{IT} \times PUE = (P_{servers} + P_{network} + P_{storage}) \times PUE$$ {#eq-facility-power}
@@ -686,7 +686,7 @@ Translating energy measurements into carbon footprint requires accounting for th
#### Operational Carbon Calculation {#sec-sustainable-ai-operational-carbon}
Operational carbon emissions result from electricity consumption during training and inference, scaled by grid carbon intensity:
Operational carbon emissions result from electricity consumption during training and inference, scaled by grid carbon intensity. @eq-operational-carbon expresses this relationship:
$$C_{operational} = E_{total} \times CI_{grid} \times PUE$$ {#eq-operational-carbon}
@@ -724,7 +724,7 @@ The geographic choice alone produces a 21-fold difference in training emissions.
Embodied carbon encompasses emissions from raw material extraction, semiconductor fabrication, assembly, transportation, and end-of-life disposal. For AI hardware, manufacturing emissions are substantial due to the energy-intensive nature of advanced semiconductor processes.
A single NVIDIA H100 GPU embodies approximately 150 to 200 kg CO2eq from manufacturing, including wafer fabrication at advanced process nodes, high-bandwidth memory production, and packaging. Amortizing this embodied carbon over the hardware lifetime provides per-use emissions:
A single NVIDIA H100 GPU embodies approximately 150 to 200 kg CO2eq from manufacturing, including wafer fabrication at advanced process nodes, high-bandwidth memory production, and packaging. Amortizing this embodied carbon over the hardware lifetime provides per-use emissions, as shown in @eq-embodied-daily:
$$C_{embodied,daily} = \frac{C_{manufacturing}}{L_{lifetime} \times 365}$$ {#eq-embodied-daily}
@@ -740,7 +740,7 @@ This embodied contribution of 108 kg represents approximately 2.4% of the operat
#### Lifecycle Carbon Accounting {#sec-sustainable-ai-lifecycle-carbon}
Complete lifecycle assessment combines operational and embodied emissions across all phases:
Complete lifecycle assessment combines operational and embodied emissions across all phases. @eq-lifecycle-carbon captures this:
$$C_{lifecycle} = C_{training} + C_{inference} + C_{embodied}$$ {#eq-lifecycle-carbon}
@@ -760,7 +760,7 @@ This lifecycle perspective reveals that optimization efforts should prioritize i
Accurate carbon accounting requires reliable grid intensity data. Real-time carbon intensity varies with generation mix, which changes hourly based on demand, renewable availability, and plant dispatch decisions. Several data sources provide this information:
The US Energy Information Administration (EIA) publishes historical grid emissions factors by region, updated annually. For prospective analysis, these annual averages provide reasonable estimates. ElectricityMap and WattTime provide real-time carbon intensity APIs covering major grids worldwide, enabling carbon-aware scheduling systems. For retrospective analysis of completed training runs, hourly marginal emissions data from these sources enables accurate attribution.
The US Energy Information Administration (EIA) publishes historical grid emissions factors by region, updated annually. For prospective analysis, these annual averages provide reasonable estimates. ElectricityMap and WattTime provide real-time carbon intensity APIs covering major grids worldwide, enabling carbon-aware scheduling systems. For retrospective analysis of completed training runs, hourly marginal emissions data from these sources enables accurate attribution. @lst-carbon-calculation demonstrates how to compute lifecycle carbon footprint:
::: {#lst-carbon-calculation lst-cap="**Lifecycle Carbon Calculation**: Computing total carbon footprint including operational and embodied emissions."}
```{.python}
@@ -1266,11 +1266,11 @@ ARM-based edge devices operate under fundamentally different power constraints t
: **Edge AI Power Budget Categories**: Edge platforms span five orders of magnitude in power consumption, from sub-milliwatt TinyML systems to automotive compute platforms approaching datacenter power levels. Sustainable deployment requires matching workload requirements to appropriate power tiers. {#tbl-edge-power-budgets}
These power budgets reflect the physical constraints of battery capacity, thermal dissipation, and deployment environment. TinyML devices operating from coin cells or energy harvesting cannot exceed milliwatt average power. Mobile devices must balance user experience with battery life, limiting sustained AI workloads. Automotive systems face thermal constraints within enclosed vehicle compartments despite having access to vehicle power.
@tbl-edge-power-budgets summarizes these power budgets, which reflect the physical constraints of battery capacity, thermal dissipation, and deployment environment. TinyML devices operating from coin cells or energy harvesting cannot exceed milliwatt average power. Mobile devices must balance user experience with battery life, limiting sustained AI workloads. Automotive systems face thermal constraints within enclosed vehicle compartments despite having access to vehicle power.
#### TinyML Power State Dynamics {#sec-sustainable-ai-tinyml-power-states}
TinyML efficiency depends heavily on duty cycling, where devices alternate between deep sleep and active inference. The average power consumption follows:
TinyML efficiency depends heavily on duty cycling, where devices alternate between deep sleep and active inference. The average power consumption follows @eq-tinyml-duty-cycle:
$$P_{average} = P_{active} \times \frac{t_{inference}}{T_{period}} + P_{sleep} \times \frac{T_{period} - t_{inference}}{T_{period}}$$ {#eq-tinyml-duty-cycle}
@@ -1335,7 +1335,7 @@ With sufficient optimization, TinyML enables energy-autonomous operation where d
: **Energy Harvesting Power Budgets**: Ambient energy harvesting enables batteryless TinyML deployments when average power consumption remains within harvesting capacity. Solar harvesting provides the highest power density for most deployments. {#tbl-energy-harvesting}
A keyword spotting model optimized to 0.5 mW average power can operate indefinitely on approximately 5 square centimeters of indoor solar harvesting, eliminating battery replacement and associated e-waste for distributed sensor deployments. This perpetual operation model represents the ultimate sustainable edge AI deployment, where operational energy comes entirely from ambient sources.
As shown in @tbl-energy-harvesting, a keyword spotting model optimized to 0.5 mW average power can operate indefinitely on approximately 5 square centimeters of indoor solar harvesting, eliminating battery replacement and associated e-waste for distributed sensor deployments. This perpetual operation model represents the ultimate sustainable edge AI deployment, where operational energy comes entirely from ambient sources.
#### Sustainable Edge Deployment Patterns {#sec-sustainable-ai-edge-deployment-patterns}
@@ -1343,7 +1343,7 @@ Beyond individual device efficiency, architectural patterns determine total syst
**Cascade Inference Architecture**
Deploy a small edge model (under 100 KB) to filter inputs before cloud inference:
Deploy a small edge model (under 100 KB) to filter inputs before cloud inference. @eq-cascade-energy expresses the total energy:
$$E_{cascade} = E_{edge} + p_{escalate} \times (E_{transmit} + E_{cloud})$$ {#eq-cascade-energy}
@@ -1376,7 +1376,7 @@ This hierarchical approach achieves 15 microwatts average power compared to 10 m
**Federated Learning Energy Analysis**
Training at the edge eliminates data transmission but increases local compute:
Training at the edge eliminates data transmission but increases local compute. @eq-federated-energy compares the energy trade-offs:
$$E_{federated} = N \times E_{local\_train} + E_{aggregation}$$
$$E_{centralized} = N \times E_{transmit} + E_{cloud\_train}$$ {#eq-federated-energy}
@@ -1887,7 +1887,7 @@ These optimization techniques represent strategies for sustainable AI developmen
**TinyML Optimization Stack**
TinyML deployments face unique constraints beyond datacenter optimization: models must fit in kilobytes of SRAM, execute with microsecond latency, and consume milliwatts of power. Standard optimization techniques like INT8 quantization (4x memory reduction, 8-16x energy savings) and structured pruning (2-10x improvements at 90% sparsity) provide the foundation for microcontroller deployment. However, achieving sustainable operation on energy-harvesting devices requires pushing optimization to extremes. This section examines techniques that enable truly autonomous TinyML systems operating on harvested energy budgets of 10-100 microwatts.
TinyML deployments face unique constraints beyond datacenter optimization: models must fit in kilobytes of SRAM, execute with microsecond latency, and consume milliwatts of power. Standard optimization techniques like INT8 quantization (4x memory reduction, 8-16x energy savings) and structured pruning (2-10x improvements at 90% sparsity) provide the foundation for microcontroller deployment. However, achieving sustainable operation on energy-harvesting devices requires pushing optimization to extremes. This section examines techniques that enable truly autonomous TinyML systems operating on harvested energy budgets of 10-100 microwatts, as summarized in @tbl-tinyml-optimization.
+----------------------------+----------------------+----------------------+----------------------+
| **Technique** | **Typical Accuracy** | **Memory Reduction** | **Energy Reduction** |
@@ -1979,11 +1979,11 @@ For sub-watt TinyML deployments, MLPerf Tiny provides benchmarks specifically de
: **MLPerf Tiny Benchmark Suite**: Standardized benchmarks for TinyML systems measure accuracy, latency, and energy consumption on microcontroller-class hardware. Reference model sizes indicate minimum viable deployments; optimized implementations often achieve 2-10x better energy efficiency. {#tbl-mlperf-tiny}
The MLPerf Tiny measurement methodology requires external power monitors (such as the instruments described in @sec-sustainable-ai-edge-mobile-energy) and specifies warm-up periods, measurement windows, and statistical reporting requirements to ensure reproducible results across submissions.
@tbl-mlperf-tiny lists the benchmark tasks and their typical energy requirements. The MLPerf Tiny measurement methodology requires external power monitors (such as the instruments described in @sec-sustainable-ai-edge-mobile-energy) and specifies warm-up periods, measurement windows, and statistical reporting requirements to ensure reproducible results across submissions.
**Energy Delay Product**
Beyond simple energy metrics, the Energy Delay Product (EDP) balances energy consumption against latency:
Beyond simple energy metrics, the Energy Delay Product (EDP) balances energy consumption against latency. @eq-energy-delay-product defines this metric:
$$EDP = E \times T = P \times T^2$$ {#eq-energy-delay-product}