Standardizes citation formatting

Streamlines in-text references by removing redundant author mentions and consolidating to Quarto's native `@` syntax. Improves consistency and readability of academic citations.
This commit is contained in:
Vijay Janapa Reddi
2026-03-04 10:00:41 -05:00
parent 3097fe9931
commit 4bd062c545
7 changed files with 17 additions and 17 deletions

View File

@@ -640,7 +640,7 @@ Managing evolution requires architectural discipline: cohort-based monitoring fo
### Code and Architecture Debt {#sec-ml-operations-code-architecture-debt-9140}
Data dependencies and system evolution create debt through implicit coupling. ML systems also accumulate code-level debt patterns that differ from traditional software. Sculley et al. [@sculley2015hidden] identify several that deserve explicit attention.
Data dependencies and system evolution create debt through implicit coupling. ML systems also accumulate code-level debt patterns that differ from traditional software. @sculley2015hidden identify several that deserve explicit attention.
*Glue code*\index{Glue Code!integration overhead} dominates ML codebases: systems often require substantial integration code to connect general-purpose ML packages to specific data pipelines and serving systems, with the glue constituting up to 95% of the codebase while the actual ML code represents only 5%. This glue creates tight coupling between package APIs and the surrounding system, meaning that when packages update their interfaces, all glue code must be rewritten. Mitigation requires wrapping ML packages in stable internal APIs and treating external dependencies as substitutable components.
@@ -1147,7 +1147,7 @@ TensorFlow Extended (TFX) emerged from Google's internal ML infrastructure, prod
**Impact**: Open-sourced in 2019, TFX patterns influenced Kubeflow Pipelines, MLflow, and Vertex AI Pipelines. The "transform once, serve everywhere" pattern became industry standard for eliminating training-serving skew.
*Reference: Baylor et al. [@baylor2017tfx]*
*Reference: @baylor2017tfx*
:::
@@ -2219,7 +2219,7 @@ Netflix's monitoring system illustrates these alerting principles at extreme sca
**Results**: The multi-layer approach substantially improved pre-impact detection rates and reduced mean time to detection, while adaptive thresholds kept false positive rates low enough to avoid alert fatigue.
*Reference: Steck et al. [@steck2021netflix]*
*Reference: @steck2021netflix*
:::
@@ -2848,7 +2848,7 @@ The ML Test Score\index{ML Test Score!production readiness} [@breck2020ml] provi
| | Training and serving features are not skewed | Training-serving skew detection |
| | Model staleness triggers retraining | Automated retraining pipelines |
: **ML Test Score Checklist.** A practical rubric for assessing ML system production readiness. Each test scores 0 (not implemented), 0.5 (partially implemented), or 1 (fully automated). Systems scoring below 5 require significant investment before production deployment; scores above 10 indicate mature operational practices. Based on Breck et al. [@breck2020ml]. {#tbl-ml-test-score}
: **ML Test Score Checklist.** A practical rubric for assessing ML system production readiness. Each test scores 0 (not implemented), 0.5 (partially implemented), or 1 (fully automated). Systems scoring below 5 require significant investment before production deployment; scores above 10 indicate mature operational practices. Based on @breck2020ml. {#tbl-ml-test-score}
- **05**: High-risk deployment. Critical gaps in reproducibility, monitoring, or validation. Expect frequent incidents and difficulty debugging production issues.
- **510**: Developing practices. Basic automation exists but gaps remain. Suitable for low-stakes internal applications with active engineering support.

View File

@@ -2760,7 +2760,7 @@ Self-attention learns dynamic activation patterns across the input sequence. Unl
\index{Layer Normalization!Transformer}
The Transformer architecture applies this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections. Examine the full architecture in @fig-transformer and trace the data flow: input tokens enter at the bottom, pass through repeated blocks of attention and feed-forward layers (each wrapped with residual connections and normalization), and emerge as contextualized representations — all positions processed in parallel rather than sequentially. Transformers have demonstrated significant effectiveness across a wide range of tasks, from natural language processing to computer vision, transforming deep learning architectures across domains.
::: {#fig-transformer fig-env="figure" fig-pos="htb" fig-cap="**Transformer Architecture (Encoder-Decoder)**: Complete architecture from Vaswani et al. The encoder (left, repeated $N$ times) consists of multi-head attention followed by feed-forward layers, each with residual connections (arrows bypassing blocks) and layer normalization. The decoder (right) adds masked attention to prevent attending to future tokens during autoregressive generation. Positional encodings\index{Positional Encoding} (sine waves) [@su2024roformer] inject sequence order information absent from the permutation-invariant attention operation. This design enables training parallelism across all positions while the decoder maintains autoregressive causality during inference. Source: Vaswani et al. [@vaswani2017attention]." fig-alt="Encoder-decoder architecture. Encoder: multi-head attention, add-norm, feed-forward, add-norm, repeated Nx. Decoder adds masked attention. Positional encoding sine waves at inputs. Skip connections bypass sublayers. Linear and softmax at top."}
::: {#fig-transformer fig-env="figure" fig-pos="htb" fig-cap="**Transformer Architecture (Encoder-Decoder)**: Complete architecture. The encoder (left, repeated $N$ times) consists of multi-head attention followed by feed-forward layers, each with residual connections (arrows bypassing blocks) and layer normalization. The decoder (right) adds masked attention to prevent attending to future tokens during autoregressive generation. Positional encodings\index{Positional Encoding} (sine waves) [@su2024roformer] inject sequence order information absent from the permutation-invariant attention operation. This design enables training parallelism across all positions while the decoder maintains autoregressive causality during inference. Source: @vaswani2017attention." fig-alt="Encoder-decoder architecture. Encoder: multi-head attention, add-norm, feed-forward, add-norm, repeated Nx. Decoder adds masked attention. Positional encoding sine waves at inputs. Skip connections bypass sublayers. Linear and softmax at top."}
```{.tikz}
\scalebox{0.7}{

View File

@@ -1314,7 +1314,7 @@ This rule follows from the observation that with batch size $B$, the expected gr
::: {.callout-war-story title="Linear Scaling Warmup"}
Goyal et al. [@goyal2017accurate] demonstrated that linear scaling without warmup causes training instability for large batches. Their warmup schedule increases the learning rate linearly from $\eta_{\text{base}}$ to $k \cdot \eta_{\text{base}}$ over the first $W$ iterations:
@goyal2017accurate demonstrated that linear scaling without warmup causes training instability for large batches. Their warmup schedule increases the learning rate linearly from $\eta_{\text{base}}$ to $k \cdot \eta_{\text{base}}$ over the first $W$ iterations:
$$
\eta_t = \eta_{\text{base}} + \frac{t}{W}(k \cdot \eta_{\text{base}} - \eta_{\text{base}}) \quad \text{for } t < W

View File

@@ -2696,7 +2696,7 @@ Implementing elastic training requires adapting several training components:
Batch size adjustment is the first concern: with fewer workers, each worker must process more samples to maintain the global batch size, or the global batch size must be reduced. Reducing global batch size may require learning rate adjustment.
The relationship between batch size and optimal learning rate has been studied extensively. Goyal et al. [@goyal2017accurate] demonstrated that a linear scaling rule works well in practice: when scaling the batch size by factor $k$, scale the learning rate also by factor $k$. @eq-lr-scaling expresses an alternative square root scaling law that provides more conservative adjustment:
The relationship between batch size and optimal learning rate has been studied extensively. @goyal2017accurate demonstrated that a linear scaling rule works well in practice: when scaling the batch size by factor $k$, scale the learning rate also by factor $k$. @eq-lr-scaling expresses an alternative square root scaling law that provides more conservative adjustment:
$$ \eta_{new} = \eta_{base} \times \sqrt{\frac{N_{new}}{N_{base}}} $$ {#eq-lr-scaling}

View File

@@ -921,7 +921,7 @@ The mathematical implications of elastic scaling interact with the optimization
The first approach, constant batch size scaling, adjusts $B_{per\_worker}$ inversely with $N_{workers}$ to maintain $B_{effective}$ at its original value. When workers are added, each worker processes fewer samples per step; when workers are removed, each processes more. This approach preserves the optimization dynamics exactly (the gradient noise level is unchanged), but changes per-worker compute and memory requirements. If $B_{per\_worker}$ becomes too small after aggressive scale-up, per-GPU utilization drops because the batch cannot saturate the GPU's compute units. If $B_{per\_worker}$ becomes too large after scale-down, it may exceed per-GPU memory capacity.
The alternative, adaptive batch size scaling, keeps $B_{per\_worker}$ constant and adjusts the learning rate using scaling rules like the linear scaling rule of Goyal et al. [@goyal2017accurate], which increases the learning rate proportionally with batch size. This approach accepts changed optimization dynamics (larger batch sizes produce lower-variance gradient estimates, which may require different learning rate schedules) in exchange for simpler worker management (each worker's compute and memory requirements remain constant regardless of group size). For many practical workloads, adaptive batch size with linear learning rate scaling produces equivalent convergence outcomes, but the interaction between batch size scaling and other training hyperparameters (warmup schedule, weight decay, momentum) requires careful tuning.
The alternative, adaptive batch size scaling, keeps $B_{per\_worker}$ constant and adjusts the learning rate using scaling rules like the linear scaling rule of @goyal2017accurate, which increases the learning rate proportionally with batch size. This approach accepts changed optimization dynamics (larger batch sizes produce lower-variance gradient estimates, which may require different learning rate schedules) in exchange for simpler worker management (each worker's compute and memory requirements remain constant regardless of group size). For many practical workloads, adaptive batch size with linear learning rate scaling produces equivalent convergence outcomes, but the interaction between batch size scaling and other training hyperparameters (warmup schedule, weight decay, momentum) requires careful tuning.
### Framework Support {#sec-fleet-orchestration-elastic-frameworks}

View File

@@ -620,7 +620,7 @@ The constraint is memory: each additional request in the batch requires its own
The continuous batching systems introduced in @sec-inference-scale manage batch size dynamically, adding and removing requests as they complete. Performance engineering's role is to maximize the effective batch size by minimizing the per-request memory footprint, primarily through KV cache compression and weight quantization.
A critical enabler for large batch sizes is **paged KV cache management**[^fn-paged-attention-os], introduced by vLLM (Kwon et al., 2023). Traditional KV cache implementations pre-allocate contiguous memory for each request's maximum possible sequence length.
A critical enabler for large batch sizes is **paged KV cache management**[^fn-paged-attention-os], introduced by vLLM [@kwon2023vllm]. Traditional KV cache implementations pre-allocate contiguous memory for each request's maximum possible sequence length.
[^fn-paged-attention-os]: **Paged Attention**: Named by direct analogy to OS virtual memory paging, where the OS maps non-contiguous physical pages to contiguous virtual addresses. The insight, presented at SOSP 2023, was that the same mechanism eliminates internal fragmentation in KV caches, recovering the 60--80% of GPU memory wasted by worst-case pre-allocation. This single abstraction transformed LLM serving economics by enabling 2--4$\times$ larger batch sizes without any change to model weights or precision. \index{Paged Attention!memory management}
@@ -828,7 +828,7 @@ class FlashAttentionSavings:
flash_savings_factor_str = f"{savings_ratio:.0f}"
```
FlashAttention, introduced by Dao et al. (2022), reformulates attention as a **tiled** computation. Instead of materializing the full $N \times N$ attention matrix, it processes $Q$, $K$, and $V$ in small blocks that fit in on-chip SRAM. The algorithm loads a block of $Q$ rows and iterates over blocks of $K$ and $V$ columns, computing partial attention scores and maintaining running statistics (online softmax) to produce the exact result without ever storing the full attention matrix in HBM.
FlashAttention [@dao2022flashattention] reformulates attention as a **tiled** computation. Instead of materializing the full $N \times N$ attention matrix, it processes $Q$, $K$, and $V$ in small blocks that fit in on-chip SRAM. The algorithm loads a block of $Q$ rows and iterates over blocks of $K$ and $V$ columns, computing partial attention scores and maintaining running statistics (online softmax) to produce the exact result without ever storing the full attention matrix in HBM.
The HBM traffic reduction is dramatic. For a sequence length of `{python} FlashAttentionSavings.head_n_str`, 32 heads, and head dimension `{python} FlashAttentionSavings.head_d_str` in FP16, the na\"ive attention reads and writes approximately `{python} FlashAttentionSavings.naive_mb_str` MB of attention matrices through HBM. FlashAttention reads $Q$, $K$, $V$ and writes $O$ once each, totaling approximately `{python} FlashAttentionSavings.flash_mb_str` MB. This is a `{python} FlashAttentionSavings.savings_str`$\times$ reduction in HBM traffic, translating directly into a proportional speedup for this memory-bound operation.
@@ -857,7 +857,7 @@ FlashAttention processes the computation in tiles of size $B_r \times B_c$ (typi
FlashAttention-2 (Dao, 2023) further optimizes the algorithm for modern GPU architectures by restructuring the parallelism pattern. The original FlashAttention parallelizes over batch and head dimensions, meaning each thread block handles one (batch, head) pair and iterates over the full sequence. FlashAttention-2 additionally parallelizes over the sequence dimension of the query matrix, distributing work across thread blocks more efficiently and achieving better occupancy on GPUs with many streaming multiprocessors. It also reduces the number of non-GEMM FLOPs by restructuring the rescaling operations and exploiting the asymmetry between the Q loop (outer) and K/V loop (inner).
FlashAttention-3 (Dao et al., 2024) targets the H100's new hardware features: FP8 Tensor Cores and the Tensor Memory Accelerator (TMA). By computing attention in FP8 with selective FP16 accumulation, FlashAttention-3 achieves near-peak FP8 utilization for the attention operation, further closing the gap between achieved and theoretical performance.
FlashAttention-3 targets the H100's new hardware features: FP8 Tensor Cores and the Tensor Memory Accelerator (TMA). By computing attention in FP8 with selective FP16 accumulation, FlashAttention-3 achieves near-peak FP8 utilization for the attention operation, further closing the gap between achieved and theoretical performance.
::: {.callout-war-story title="The FlashAttention Breakthrough"}
@@ -958,7 +958,7 @@ If moving a 2-byte FP16 weight from memory to compute takes 100 nanoseconds, how
### Block-wise Quantization {#sec-performance-engineering-block-quant}
Post-training quantization to INT8 or INT4 delivers even greater bandwidth savings for inference, but LLMs present a unique challenge: **outlier features**[^fn-outlier-features]. Dettmers et al. (2022) discovered that large language models develop a small number of hidden dimensions (typically fewer than 1% of all dimensions) with activation magnitudes 10--100$\times$ larger than the rest.
Post-training quantization to INT8 or INT4 delivers even greater bandwidth savings for inference, but LLMs present a unique challenge: **outlier features**[^fn-outlier-features]. @dettmers2022llm discovered that large language models develop a small number of hidden dimensions (typically fewer than 1% of all dimensions) with activation magnitudes 10--100$\times$ larger than the rest.
[^fn-outlier-features]: **Outlier Features**: Large-scale transformers develop emergent "outlier" dimensions with activation magnitudes up to 100$\times$ larger than typical values. While these outliers constitute less than 0.1% of all features, clipping them during INT8 quantization destroys the model's reasoning capabilities. This physical property of large models is the reason Post-Training Quantization (PTQ) requires "outlier-aware" strategies like LLM.int8() or AWQ. \index{Outlier Features!quantization challenge}
Applying uniform per-tensor INT8 quantization clips these outliers, destroying the information they carry, or expands the quantization range to accommodate them, wasting precision on the majority of near-zero values.
@@ -975,11 +975,11 @@ Post-training quantization to INT8 or INT4 delivers even greater bandwidth savin
LLM.int8() solves this by decomposing each matrix multiplication into two parts: a small set of outlier dimensions processed in FP16, and the remaining dimensions processed in INT8. The system identifies outlier dimensions at runtime (those exceeding a magnitude threshold, typically 6.0), routes them to an FP16 GEMM, and routes the remaining dimensions to an INT8 GEMM. The results are combined to produce the final output. This achieves nearly lossless INT8 inference for models that would otherwise degrade substantially under uniform quantization.
GPTQ (Frantar et al., 2023) takes a different approach: weight-only quantization using second-order information. Instead of quantizing each weight independently, GPTQ processes weights column by column, using the Hessian of the layer's loss surface to determine which quantization errors matter most and redistributing those errors across unquantized columns. This produces INT4 weight representations with minimal accuracy loss, even for models with severe outlier features. The key insight is that quantization error in one weight can be compensated by adjusting correlated weights.
GPTQ [@frantar2023gptq] takes a different approach: weight-only quantization using second-order information. Instead of quantizing each weight independently, GPTQ processes weights column by column, using the Hessian of the layer's loss surface to determine which quantization errors matter most and redistributing those errors across unquantized columns. This produces INT4 weight representations with minimal accuracy loss, even for models with severe outlier features. The key insight is that quantization error in one weight can be compensated by adjusting correlated weights.
AWQ (Activation-Aware Weight Quantization, Lin et al., 2024) observes that not all weights are equally important: weights connected to high-activation channels contribute disproportionately to model output. AWQ identifies these salient weights by analyzing activation magnitudes across a calibration dataset, then applies per-channel scaling to protect them before uniform group quantization. This achieves INT4 weight quantization with quality comparable to GPTQ but with 10--100$\times$ faster quantization time (minutes instead of hours), since it avoids the expensive Hessian computation.
AWQ [Activation-Aware Weight Quantization; @lin2024awq] observes that not all weights are equally important: weights connected to high-activation channels contribute disproportionately to model output. AWQ identifies these salient weights by analyzing activation magnitudes across a calibration dataset, then applies per-channel scaling to protect them before uniform group quantization. This achieves INT4 weight quantization with quality comparable to GPTQ but with 10--100$\times$ faster quantization time (minutes instead of hours), since it avoids the expensive Hessian computation.
SmoothQuant (Xiao et al., 2023) takes yet another approach to the outlier problem. Rather than handling outliers at runtime (LLM.int8()) or through weight optimization (GPTQ, AWQ), SmoothQuant smooths the activation distribution *before* quantization by migrating the quantization difficulty from activations to weights. The key observation is that activation outliers are channel-specific: certain hidden dimensions consistently produce large values across all tokens. SmoothQuant applies a per-channel scaling transformation that divides the activation by a smoothing factor and multiplies the corresponding weight by the same factor. This mathematically equivalent transformation reduces activation outlier magnitudes at the cost of slightly increasing weight magnitudes, making both tensors more amenable to uniform INT8 quantization. The result is efficient W8A8 (weight-8-bit, activation-8-bit) quantization that exploits INT8 Tensor Cores for both bandwidth and compute benefits.
SmoothQuant [@xiao2023smoothquant] takes yet another approach to the outlier problem. Rather than handling outliers at runtime (LLM.int8()) or through weight optimization (GPTQ, AWQ), SmoothQuant smooths the activation distribution *before* quantization by migrating the quantization difficulty from activations to weights. The key observation is that activation outliers are channel-specific: certain hidden dimensions consistently produce large values across all tokens. SmoothQuant applies a per-channel scaling transformation that divides the activation by a smoothing factor and multiplies the corresponding weight by the same factor. This mathematically equivalent transformation reduces activation outlier magnitudes at the cost of slightly increasing weight magnitudes, making both tensors more amenable to uniform INT8 quantization. The result is efficient W8A8 (weight-8-bit, activation-8-bit) quantization that exploits INT8 Tensor Cores for both bandwidth and compute benefits.
These four approaches, LLM.int8(), GPTQ, AWQ, and SmoothQuant, represent a progression in the sophistication of quantization techniques for LLMs. LLM.int8() handles outliers at runtime with mixed-precision decomposition but limits compression to INT8. GPTQ uses second-order information for aggressive INT4 weight compression but requires hours of calibration per model. AWQ achieves similar INT4 quality with minutes of calibration by focusing on activation-aware scaling. SmoothQuant enables W8A8 quantization by preprocessing the weight-activation pairs. In practice, AWQ has become the default choice for weight-only quantization in production LLM deployment, while SmoothQuant is preferred when both weight and activation quantization are needed for compute-bound workloads.

View File

@@ -438,7 +438,7 @@ class SilentErrorProbability:
```
The scale of modern GPU clusters transforms these per-device error rates into near-certainty at the system level. @fig-silent-error-probability illustrates this compounding effect: for a cluster of $N$ devices each with per-device silent data corruption probability $p$ per hour, the probability of at least one SDC event is $P(\geq 1) = 1 - (1 - p)^N$. At the rates reported by Meta, which found SDC rates "orders of magnitude higher than soft-error predictions" across hundreds of thousands of machines [@dixit2021silent], silent errors become effectively certain at cluster scales beyond `{python} SilentErrorProbability.n_gpus_certain_str` devices.
::: {#fig-silent-error-probability fig-env="figure" fig-pos="htb" fig-cap="**Silent Error Probability at Scale**. Probability of at least one silent data corruption event per hour as a function of cluster size, for three per-device error rates. At the rates reported by Meta (Dixit et al., 2021), which are orders of magnitude above traditional soft-error models, silent errors become effectively certain at cluster scales beyond a few thousand devices." fig-alt="Semilog plot showing three S-curves for per-device SDC rates of 1e-3, 1e-4, and 1e-5. All curves reach probability 1.0 as cluster size grows to 100000 GPUs."}
::: {#fig-silent-error-probability fig-env="figure" fig-pos="htb" fig-cap="**Silent Error Probability at Scale**. Probability of at least one silent data corruption event per hour as a function of cluster size, for three per-device error rates. At the rates reported by Meta, which are orders of magnitude above traditional soft-error models, silent errors become effectively certain at cluster scales beyond a few thousand devices." fig-alt="Semilog plot showing three S-curves for per-device SDC rates of 1e-3, 1e-4, and 1e-5. All curves reach probability 1.0 as cluster size grows to 100000 GPUs."}
```{python}
#| echo: false
@@ -1407,7 +1407,7 @@ These vulnerabilities highlight the urgent need for defense strategies examined
Data poisoning presents a critical challenge to the integrity and reliability of machine learning systems. Unlike adversarial attacks, which perturb inputs at inference time, poisoning corrupts the training data itself---contaminating the model's learned mapping before deployment begins. This distinction is analogous to fooling a trained student during an exam versus giving a student wrong information while they are learning. Both cause incorrect answers, but poisoning is far harder to detect because the model has genuinely learned wrong patterns. As ML systems increasingly ingest data from automated pipelines, web scraping, and crowdsourced annotation, understanding how poisoning occurs and propagates through the system is essential for developing effective defenses.
First formalized by Biggio et al.[^fn-data-poisoning-attack], poisoning attacks alter existing training samples, introduce malicious examples, or interfere with the data collection pipeline (@fig-dirty-label-example) [@biggio2012poisoning; @shan2023prompt]. The consequences are especially severe in high-stakes domains like healthcare, where even small disruptions to training data can lead to dangerous misdiagnoses [@marulli2022sensitivity].
First formalized by @biggio2012poisoning[^fn-data-poisoning-attack], poisoning attacks alter existing training samples, introduce malicious examples, or interfere with the data collection pipeline (@fig-dirty-label-example) [see also @shan2023prompt]. The consequences are especially severe in high-stakes domains like healthcare, where even small disruptions to training data can lead to dangerous misdiagnoses [@marulli2022sensitivity].
[^fn-data-poisoning-attack]: **Data Poisoning**: First formalized by Biggio et al. (2012), poisoning attacks inject malicious samples into training data to corrupt the learning process itself. Unlike adversarial examples that target inference (and can be mitigated at serving time), poisoning embeds vulnerabilities into model weights during training---making detection require auditing billions of training samples rather than filtering individual inference requests. At web-scale data collection, even 0.01% poisoned data can shift decision boundaries. \index{Data Poisoning!training attack}