mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
footnotes: ADD pass — 15 new footnotes across 9 Vol1 chapters
model_serving: - fn-queuing-divergence: M/M/1 (1-ρ)^-1 math — 70% rule is mathematical, not heuristic - fn-jevons-paradox: 1865 coal efficiency → inference demand paradox - fn-speculative-decoding: k·α throughput math, parallel verification mechanism training: - fn-backprop-provenance: Linnainmaa 1970 / Werbos 1974 — 12-year adoption lag - fn-saddle-points: overparameterized landscape geometry — saddle points > local minima - fn-ridge-point-precision: precision shift moves the ridge point (FP32→BF16 doubles it) hw_acceleration: - fn-tensor-core-alignment: 8/16 multiple requirement, 8-16x fallback penalty frameworks: - fn-bf16-design: Google Brain 2018 origin, loss-scaling elimination via exponent match model_compression: - fn-sparsity-vectorization: SIMD lane waste mechanism, 90% threshold explained nn_architectures: - fn-kv-cache-depth: 14 GB weights + 1.07 GB/user math, memory-not-quality constraint nn_computation: - fn-batch-norm-cost: sync barrier, small-batch sensitivity, LayerNorm substitution - fn-algorithm-hardware-lag: Werbos 1974→1986 lag; Bahdanau 2014→Transformer 2017 introduction: - fn-ai-winters-systems: Lighthill Report + Lisp Machine collapse as systems failures data_selection: - fn-labeling-economics: $1k-$3k vs $75k-$150k clinical labeling cost arithmetic - fn-chinchilla-ratio: D/N diagnostic (GPT-3 at 1.7, LLaMA-2 70B at 28, optimal ~20)
This commit is contained in:
@@ -441,9 +441,7 @@ plt.show()
|
||||
|
||||
We can formalize this as the ICR:
|
||||
|
||||
$
|
||||
\text{ICR} = \frac{\Delta \text{Model Performance}}{\Delta \text{FLOPs}}
|
||||
$
|
||||
$$\text{ICR} = \frac{\Delta \text{Model Performance}}{\Delta \text{FLOPs}}$$
|
||||
|
||||
### The ICR Frontier: When Data Becomes a Tax {#sec-data-selection-icr-frontier}
|
||||
|
||||
@@ -451,7 +449,7 @@ The Information-Compute Ratio is not constant; it follows a law of diminishing r
|
||||
|
||||
Mathematically, let $I(D)$ be the information content of a dataset of size $D$. In a redundant dataset, $I(D)$ often scales logarithmically ($\log D$) while the compute cost $C(D)$ scales linearly ($O \cdot D$). The resulting ICR follows @eq-icr-decay:
|
||||
|
||||
$ \text{ICR}(D) = \frac{\frac{d}{dD} I(D)}{\frac{d}{dD} C(D)} \approx \frac{1/D}{O} = \frac{1}{O \cdot D} $ {#eq-icr-decay}
|
||||
$$\text{ICR}(D) = \frac{\frac{d}{dD} I(D)}{\frac{d}{dD} C(D)} \approx \frac{1/D}{O} = \frac{1}{O \cdot D}$$ {#eq-icr-decay}
|
||||
|
||||
This decay creates what we call **The Data Wall**\index{Data Wall!zero learning signal}. Beyond the frontier, adding more data yields near-zero learning but still costs linear compute. In this regime, data is no longer an asset; it is a **Data Tax**\index{Data Tax!redundant compute cost} that inflates the $O$ term of the **Iron Law** without improving the accuracy numerator of the **RoC** (Return on Compute, see @sec-introduction-roc-invariant). A systems engineer's goal is to keep the system operating at the "Knee" of the ICR curve, where the learning signal per FLOP is maximized. This motivates the static and dynamic selection techniques that follow.
|
||||
|
||||
@@ -1343,7 +1341,9 @@ Random sampling will miss these rare failures. Instead, the Wake Vision team use
|
||||
### Semi-Supervised Learning: Using Unlabeled Data {#sec-data-selection-semisupervised-learning-using-unlabeled-data-51fc}
|
||||
|
||||
\index{Semi-supervised Learning!definition}
|
||||
Consider a medical imaging dataset: a hospital has 50,000 chest X-rays, but only 500 have been reviewed and labeled by radiologists—a labeling rate of 1%. Training a supervised model on 500 examples yields poor accuracy, but the structural patterns in the remaining 49,500 unlabeled images contain information about what healthy and abnormal lungs look like. Semi-supervised learning exploits this abundant unlabeled data to improve the model trained on the scarce labeled examples.
|
||||
Consider a medical imaging dataset: a hospital has 50,000 chest X-rays, but only 500 have been reviewed and labeled[^fn-labeling-economics] by radiologists---a labeling rate of 1%. Training a supervised model on 500 examples yields poor accuracy, but the structural patterns in the remaining 49,500 unlabeled images contain information about what healthy and abnormal lungs look like. Semi-supervised learning exploits this abundant unlabeled data to improve the model trained on the scarce labeled examples.
|
||||
|
||||
[^fn-labeling-economics]: **Clinical Labeling Economics**: A radiologist reviews approximately 50--80 chest X-rays per hour; labeling 500 scans from a pool of 50,000 requires 7--10 hours of specialist time at \$150--300 per hour --- a \$1,000--3,000 investment. Full supervised labeling of all 50,000 would cost \$75,000--150,000 and require months of specialist availability. The 1% labeling threshold is not a pedagogical convenience but a reflection of healthcare economics: semi-supervised learning is not optional in clinical ML, it is the only approach that fits within clinical research budgets. This cost structure generalizes --- any domain requiring credentialed specialists (legal, financial, scientific) faces the same arithmetic. \index{Data Labeling!clinical economics}
|
||||
|
||||
Active learning optimizes which samples to label but still requires human annotation for every selected example. **Semi-supervised learning** takes a more aggressive approach: rather than asking *which* samples to label, it asks whether we can extract learning signal from unlabeled data directly. It uses a small set of labeled examples to guide learning on a much larger unlabeled pool, typically achieving 80–95% of fully supervised accuracy with only 10–20% of the labels.
|
||||
|
||||
@@ -3273,7 +3273,9 @@ The frontier provides a practical diagnostic framework:
|
||||
|
||||
#### The Chinchilla Rule of Thumb {.unnumbered}
|
||||
|
||||
For compute-optimal training, the number of training tokens should scale roughly as $D_{opt} \propto C^{0.5}$. Doubling your compute budget means you should increase data by about 40%, not 100%. This explains why the Data Wall is so constraining: as compute grows exponentially, the demand for quality data grows with its square root, but even that slower growth outpaces the supply of high-quality human-generated content.
|
||||
For compute-optimal training[^fn-chinchilla-ratio], the number of training tokens should scale roughly as $D_{opt} \propto C^{0.5}$. Doubling your compute budget means you should increase data by about 40%, not 100%. This explains why the Data Wall is so constraining: as compute grows exponentially, the demand for quality data grows with its square root, but even that slower growth outpaces the supply of high-quality human-generated content.
|
||||
|
||||
[^fn-chinchilla-ratio]: **Chinchilla Ratio Diagnostic**: The Chinchilla scaling law provides a practical data-starvation diagnostic: the $D/N$ ratio (training tokens per model parameter). A ratio below 10 indicates severe data starvation; around 20 is compute-optimal; above 40 yields diminishing returns. For reference, GPT-3 was trained at $D/N \approx 1.7$ (175B parameters, 300B tokens) --- chronically undertrained. LLaMA-2 70B at $D/N \approx 28$ is near-optimal. This single ratio is the fastest diagnostic for determining whether a training run is data-limited (add more tokens) or compute-limited (train a smaller model longer) before committing to an expensive run. \index{Chinchilla!D/N ratio diagnostic}
|
||||
|
||||
#### Applying the Diagnostic {.unnumbered}
|
||||
|
||||
|
||||
@@ -2088,7 +2088,9 @@ Mixed precision exploits a hardware asymmetry to improve two Iron Law terms simu
|
||||
|
||||
Frameworks exploit this through automatic mixed-precision APIs that select reduced precision for compute-intensive operations while maintaining FP32 where numerical stability demands it. Inside these APIs, frameworks automatically apply precision rules: matrix multiplications and convolutions use FP16 for bandwidth efficiency, while numerically sensitive operations like softmax and layer normalization remain in FP32. This selective precision maintains accuracy while achieving speedups on modern GPUs with specialized hardware units. Because FP16 has a narrower dynamic range than FP32, gradients can underflow to zero during backpropagation. Loss scaling addresses this by multiplying the loss by a large factor before the backward pass, then dividing gradients by the same factor afterward.
|
||||
|
||||
Frameworks also support multiple precision formats including FP16, BF16, and TF32, each with different trade-offs between range and precision. BF16 maintains FP32's dynamic range, simplifying training by eliminating most gradient underflow issues and removing the need for loss scaling entirely. @sec-model-training examines the mechanics of mixed-precision training in detail, including loss scaling algorithms, memory savings analysis, and numerical stability considerations. @lst-autocast-usage demonstrates PyTorch's mixed precision API: the `autocast` context manager automatically selects FP16 for compute-intensive operations while `GradScaler` prevents gradient underflow by dynamically scaling loss values.
|
||||
Frameworks also support multiple precision formats including FP16, BF16[^fn-bf16-design], and TF32, each with different trade-offs between range and precision. BF16 maintains FP32's dynamic range, simplifying training by eliminating most gradient underflow issues and removing the need for loss scaling entirely. @sec-model-training examines the mechanics of mixed-precision training in detail, including loss scaling algorithms, memory savings analysis, and numerical stability considerations. @lst-autocast-usage demonstrates PyTorch's mixed precision API: the `autocast` context manager automatically selects FP16 for compute-intensive operations while `GradScaler` prevents gradient underflow by dynamically scaling loss values.
|
||||
|
||||
[^fn-bf16-design]: **BFloat16 Design Rationale**: Developed by Google Brain circa 2018 specifically for TPU training stability, BF16 preserves FP32's 8-bit exponent range while halving memory footprint — an explicit trade-off of mantissa precision (7 bits vs. FP16's 10) for dynamic range. The critical consequence is loss scaling elimination: FP16's 5-bit exponent causes gradient underflow for values below $6 \times 10^{-5}$, requiring manual loss scaling to keep gradients in range. BF16's FP32-matched exponent makes this entire class of training instability impossible, which is why BF16 and FP16 are not interchangeable: BF16 is preferred when training stability matters; FP16 is preferred when numerical precision matters more than gradient stability. \index{BFloat16!design rationale}
|
||||
|
||||
::: {#lst-autocast-usage lst-cap="**Mixed-Precision API**: Modern frameworks provide automatic mixed-precision support through context managers that handle precision selection and numerical stability."}
|
||||
```{.python}
|
||||
|
||||
@@ -1294,7 +1294,9 @@ The listing above shows a CUDA[^fn-cuda-ecosystem] kernel where SIMT execution a
|
||||
\index{Tensor Cores!definition}
|
||||
Consider a single transformer attention head computing the $Q \times K^T$ product for a 2048-token sequence with 64-dimensional embeddings. This operation requires multiplying a $2048 \times 64$ matrix by a $64 \times 2048$ matrix — roughly 537 million multiply-accumulate operations. On a scalar processor executing one operation per cycle at 2 GHz, this single attention head would take 268 milliseconds. A GPU's SIMT execution reduces this to roughly 34 milliseconds through thread-level parallelism. But a tensor core, processing entire $16 \times 16$ matrix tiles per instruction, completes the same operation in under 0.5 milliseconds — a 500$\times$ improvement over scalar execution. This dramatic speedup arises not from faster clock speeds but from a fundamentally different approach to organizing computation around matrix blocks rather than individual elements.
|
||||
|
||||
While SIMD and SIMT units provide efficient execution of vector operations, neural networks rely heavily on matrix computations\index{Matrix Operations!neural network workloads} that require specialized execution units for structured multi-dimensional processing. The energy economics of matrix operations drive this specialization: traditional scalar processing can require multiple off-chip memory accesses per operation, while tensor cores\index{Tensor Cores!energy efficiency} amortize data movement across entire matrix blocks. Tensor processing units extend SIMD and SIMT principles by enabling efficient matrix operations through dedicated hardware blocks (**tensor cores**) that execute matrix multiplications and accumulations on matrix tiles. In many cases, this shifts the dominant cost from off-chip data movement toward on-chip reuse and arithmetic, depending on the kernel mix and memory behavior.
|
||||
While SIMD and SIMT units provide efficient execution of vector operations, neural networks rely heavily on matrix computations\index{Matrix Operations!neural network workloads} that require specialized execution units for structured multi-dimensional processing. The energy economics of matrix operations drive this specialization: traditional scalar processing can require multiple off-chip memory accesses per operation, while tensor cores\index{Tensor Cores!energy efficiency} amortize data movement across entire matrix blocks. Tensor processing units extend SIMD and SIMT principles by enabling efficient matrix operations through dedicated hardware blocks (**tensor cores**) that execute matrix multiplications and accumulations on matrix tiles[^fn-tensor-core-alignment]. In many cases, this shifts the dominant cost from off-chip data movement toward on-chip reuse and arithmetic, depending on the kernel mix and memory behavior.
|
||||
|
||||
[^fn-tensor-core-alignment]: **Tensor Core Dimension Alignment**: NVIDIA Tensor Cores require matrix dimensions that are multiples of 8 (FP16) or 16 (BF16/INT8) to engage; non-aligned dimensions force scalar fallback to CUDA cores, reducing effective throughput by 8--16$\times$. This is why model architects pad embedding dimensions to the nearest multiple of 64 and why batch-size-1 inference frequently fails to engage Tensor Cores — the alignment failure, not compute intensity, is the binding constraint. A layer with 512 output features runs 8$\times$ faster than one with 500 features at identical FLOP count, making dimension alignment a first-class performance design decision. \index{Tensor Cores!dimension alignment}
|
||||
|
||||
Tensor cores[^fn-tensor-core-origin] provide an example of this approach. @lst-tensor_core_op exposes matrix computation capabilities through specialized instructions that use dedicated hardware blocks.
|
||||
|
||||
|
||||
@@ -270,7 +270,9 @@ AI's evolution reveals a progression of bottlenecks, each overcome by systems in
|
||||
|
||||
[^fn-eliza-brittleness]: **ELIZA**: A 1966 natural language program that ran on 256 KB mainframes using pattern-matching rules with no learned state — its brittleness was a direct systems consequence of zero memory across turns. Every new input variation required a new hand-written rule, making maintenance cost grow faster than capability and foreshadowing the knowledge bottleneck that killed expert systems a decade later. \index{ELIZA!brittleness}
|
||||
|
||||
The timeline below reveals a recurring pattern: periods of intense optimism followed by "AI winters" when funding collapsed, each triggered by systems limitations that algorithms alone could not overcome. @fig-ai-timeline captures this boom-and-bust rhythm across seven decades: notice how each winter arrives precisely when the dominant paradigm hits its systems ceiling, and each resurgence follows a breakthrough in engineering infrastructure rather than in algorithms alone. Each era represents a paradigm shift attempting to overcome the limitations of the previous approach.
|
||||
The timeline below reveals a recurring pattern: periods of intense optimism followed by "AI winters"[^fn-ai-winters-systems] when funding collapsed, each triggered by systems limitations that algorithms alone could not overcome. @fig-ai-timeline captures this boom-and-bust rhythm across seven decades: notice how each winter arrives precisely when the dominant paradigm hits its systems ceiling, and each resurgence follows a breakthrough in engineering infrastructure rather than in algorithms alone. Each era represents a paradigm shift attempting to overcome the limitations of the previous approach.
|
||||
|
||||
[^fn-ai-winters-systems]: **AI Winters as Systems Failures**: The first AI winter (1974--1980) was precipitated by the 1973 Lighthill Report, which concluded that AI had failed to deliver on its promises --- but the underlying cause was a mismatch between algorithm ambition and available compute. The second winter (1987--1993) was triggered by the collapse of the Lisp Machine market when cheaper general-purpose workstations undercut specialized AI hardware. Both winters are explicitly systems failures, not algorithmic dead ends: the algorithms were mathematically sound but required hardware that did not yet exist. The same compute-constraint pattern drove neural network research underground from 1969 (Minsky's perceptron critique) to 1986 (backpropagation revival on faster hardware). \index{AI Winters!systems failure}
|
||||
|
||||
::: {#fig-ai-timeline fig-env="figure" fig-pos="t!" fig-cap="**AI Development Timeline.** A chronological curve traces AI research activity from the 1950s to the 2020s, with gray bands marking the two AI Winter periods (1974 to 1980, 1987 to 1993). Callout boxes highlight key milestones including the Turing Test [@turing1950computing], the Dartmouth conference [@mccarthy1956dartmouth], the Perceptron, ELIZA, Deep Blue, and GPT-3." fig-alt="Timeline from 1950 to 2020 with red line showing AI publication frequency. Gray bands mark two AI Winters (1974-1980, 1987-1993). Callout boxes mark milestones: Turing 1950, Dartmouth 1956, Perceptron 1957, ELIZA 1966, Deep Blue 1997, GPT-3 2020."}
|
||||
```{.tikz}
|
||||
|
||||
@@ -240,7 +240,9 @@ bf_collapse_latency_s_str = BlackFridayCalc.bf_collapse_latency_s_str
|
||||
|
||||
Trace the curve in @fig-tail-latency-explosion and notice how latency remains manageable until utilization crosses roughly 70%, then explodes—this is *why* production systems must run at relatively low utilization (40–60%) to guarantee stable tail latency\index{Tail Latency!utilization threshold} (p99). For a mathematical treatment of long-tailed distributions and why P99 latency becomes the *median* user experience at scale, see @sec-data-foundations-distributions-long-tail-901f. The curve is a simple queueing approximation intended for intuition rather than a specific workload.
|
||||
|
||||
Beyond the technical limits of latency, the economics of serving have undergone a radical transformation. As models become more efficient and hardware becomes more specialized, the cost of "intelligence" is collapsing. To grasp the speed of this collapse, examine the log-scale price trajectory in @fig-intelligence-deflation, which tracks public API list prices as a market proxy.
|
||||
Beyond the technical limits of latency, the economics of serving have undergone a radical transformation. As models become more efficient and hardware becomes more specialized, the cost of "intelligence" is collapsing[^fn-jevons-paradox]. To grasp the speed of this collapse, examine the log-scale price trajectory in @fig-intelligence-deflation, which tracks public API list prices as a market proxy.
|
||||
|
||||
[^fn-jevons-paradox]: **Jevons Paradox**: William Stanley Jevons observed in 1865 that efficiency improvements in coal-powered steam engines *increased* total coal consumption by making steam power economically viable for applications previously too costly. The same dynamic governs AI inference: each 10$\times$ cost reduction opens application classes that were economically infeasible at the previous price point, expanding aggregate demand by more than the efficiency gain. This is why cheaper inference reliably increases, not decreases, total GPU fleet demand — efficiency and demand are complements in AI, not substitutes. \index{Jevons Paradox!inference demand}
|
||||
|
||||
::: {#fig-intelligence-deflation fig-env="figure" fig-pos="htb" fig-cap="**Intelligence Deflation**: Cost per 1M output tokens (USD) over time (Log Scale). Prices are based on public API list prices (2020–2025) and are intended as a market trend indicator, not a controlled comparison. The cost of token generation has collapsed by multiple orders of magnitude, transforming the economics of automated AI workflows." fig-alt="Line plot showing token pricing collapsing from \$20/M tokens in 2020 to <\$0.10/M tokens in 2025. Log scale highlights the deflationary trend with models from OpenAI, Anthropic, Google, and DeepSeek."}
|
||||
```{python}
|
||||
@@ -1739,7 +1741,9 @@ $$W = \frac{1}{\mu - \lambda} = \frac{\text{service time}}{1 - \rho}$$ {#eq-mm1-
|
||||
|
||||
where $\mu$ is the service rate (requests per second the server can handle), and $\rho = \lambda/\mu$ is the utilization\index{Utilization!latency relationship}\index{M/M/1 Queue!wait time formula} (fraction of time the server is busy).
|
||||
|
||||
This equation reveals why serving systems exhibit nonlinear behavior: small increases in load near capacity cause disproportionate latency increases. @tbl-utilization-latency quantifies this relationship, showing how average time in system grows rapidly as utilization approaches 100%.
|
||||
This equation reveals why serving systems exhibit nonlinear behavior: small increases in load near capacity cause disproportionate latency increases[^fn-queuing-divergence]. @tbl-utilization-latency quantifies this relationship, showing how average time in system grows rapidly as utilization approaches 100%.
|
||||
|
||||
[^fn-queuing-divergence]: **Super-Linear Latency Divergence**: The 70% utilization threshold follows directly from M/M/1 queuing theory: mean response time $E[T] = \frac{1/\mu}{1-\rho}$, where $\rho = \lambda/\mu$ is utilization. The $(1-\rho)^{-1}$ term diverges as $\rho \to 1$: at $\rho = 0.7$, mean response time is already $3.3\times$ the base service time; at $\rho = 0.9$ it is $10\times$. This is not a conservative heuristic but a mathematical inevitability — there is no "stretching" from 80% to 90% utilization without disproportionate tail latency growth. \index{Queuing Theory!utilization divergence}
|
||||
|
||||
The M/M/1 model assumes exponentially distributed service times, but ML inference typically has near-constant service time for fixed batch sizes, making the M/D/1\index{M/D/1 Queue!deterministic service} (deterministic service) model more accurate in practice. We use M/M/1 here because it yields closed-form solutions and produces conservative estimates. For M/D/1 queues, average wait time is approximately half of M/M/1 at the same utilization, which matters for capacity planning: M/M/1 analysis will slightly over-provision, erring on the side of meeting SLOs rather than violating them.[^fn-kendall-notation-serving]
|
||||
|
||||
@@ -3235,7 +3239,9 @@ The memory pressure from KV caches can be further mitigated through architectura
|
||||
|
||||
When the aggregate KV cache exceeds GPU VRAM, systems can employ **KV Cache Offloading**\index{KV Cache!offloading}. This strategy spills inactive or low-priority context windows to host CPU RAM or NVMe SSD, freeing VRAM for active generation. While retrieving offloaded context introduces a latency "tax" due to PCIe bandwidth limits (@sec-model-serving-model-swapping-host-memory-c54f), it prevents Out-of-Memory (OOM) failures and enables handling much larger context windows than the hardware could otherwise support.
|
||||
|
||||
Advanced techniques including speculative decoding and distributed parallelism are covered in specialized treatments of large-scale systems.
|
||||
Advanced techniques including speculative decoding[^fn-speculative-decoding] and distributed parallelism are covered in specialized treatments of large-scale systems.
|
||||
|
||||
[^fn-speculative-decoding]: **Speculative Decoding**: A small "draft" model generates $k$ candidate tokens autoregressively; the large target model then verifies all $k$ in a single parallel forward pass. When the draft model's proposals are accepted at rate $\alpha$, effective throughput scales as $k \cdot \alpha$ — but verification is parallel, so wall-clock cost is approximately one large-model step regardless of $k$. At $\alpha = 0.8$ with $k = 4$, speculative decoding delivers roughly 3.2$\times$ throughput improvement over sequential decoding without modifying the target model. This breaks the serial autoregressive bottleneck at the runtime layer, not the architecture layer. \index{Speculative Decoding!throughput}
|
||||
|
||||
The computational intensity of managing KV caches across concurrent requests raises a broader question: *what* is the energy cost of each token generated? Unlike classification models where energy per inference is constant, LLM energy consumption scales with response length—every generated token requires reading the entire model from memory. Quantifying *the carbon cost of a chat* translates these hardware demands into energy and carbon metrics that make the environmental impact concrete.
|
||||
|
||||
|
||||
@@ -2892,7 +2892,9 @@ Inference is autoregressive (generating one token at a time) and typically *memo
|
||||
3. Read/Write the **KV Cache**\index{KV Cache}.
|
||||
|
||||
\index{KV Cache!linear growth}
|
||||
The **KV Cache** grows linearly with sequence length ($O(N \cdot d)$, distinct from the $O(N^2)$ attention score matrix during training), storing the Key and Value vectors for all previous tokens to avoid recomputing them. For long contexts, this cache becomes massive (e.g., 100+ GB), forcing the system to fetch the full cache from HBM for every generated token. As the GPT-2 Lighthouse quantified above, the arithmetic intensity drops to $\approx 1$ Op/Byte, explaining why serving LLMs requires massive HBM bandwidth (e.g., H100's 3 TB/s) rather than raw FLOPS.
|
||||
The **KV Cache**[^fn-kv-cache-depth] grows linearly with sequence length ($O(N \cdot d)$, distinct from the $O(N^2)$ attention score matrix during training), storing the Key and Value vectors for all previous tokens to avoid recomputing them. For long contexts, this cache becomes massive (e.g., 100+ GB), forcing the system to fetch the full cache from HBM for every generated token. As the GPT-2 Lighthouse quantified above, the arithmetic intensity drops to $\approx 1$ Op/Byte, explaining why serving LLMs requires massive HBM bandwidth (e.g., H100's 3 TB/s) rather than raw FLOPS.
|
||||
|
||||
[^fn-kv-cache-depth]: **KV Cache Memory Scaling**: For a 7B-parameter Transformer in FP16, model weights consume ~14 GB. A single concurrent request's KV cache requires: 32 layers × 2 (K,V) × 32 heads × 2048 tokens × 128-dim head × 2 bytes ≈ 1.07 GB. At 8 concurrent users, KV cache alone (~8.6 GB) rivals the model weights — and grows linearly with both context length and concurrent users. Scaling serving throughput therefore requires grouped-query attention (fewer KV heads), shorter context windows, or KV offloading strategies. This is a memory systems constraint, not a model quality trade-off: the model is identical regardless of which memory strategy is chosen. \index{KV Cache!memory scaling}
|
||||
|
||||
This implementation reveals three key computational characteristics. Self-attention enables parallel processing across all positions in the sequence, mapping efficiently to modern hardware during training. However, the quadratic complexity creates a training bottleneck for long sequences. And the autoregressive nature of inference creates a bandwidth bottleneck, making memory speed—not compute speed—the primary determinant of generation latency.
|
||||
|
||||
|
||||
@@ -620,7 +620,9 @@ Deep learning evolved to meet these challenges through concurrent advances in ha
|
||||
\index{Hinton, Geoffrey!backpropagation}
|
||||
The **backpropagation**\index{Backpropagation!historical development} algorithm, first applied to neural networks by Paul Werbos in his 1974 PhD thesis and building on Seppo Linnainmaa's 1970 work on automatic differentiation, was popularized by Rumelhart, Hinton, and Williams in 1986 [@rumelhart1986learning][^fn-backprop-credit-assignment]. Their publication demonstrated the algorithm's practical effectiveness and brought it to widespread attention in the machine learning community, triggering renewed interest in neural networks. The systems-level implementation of this algorithm is detailed in @sec-model-training. Despite this breakthrough, the computational demands far exceeded available hardware capabilities. Training even modest networks could take weeks, making experimentation and practical applications challenging. This mismatch between algorithmic requirements and hardware capabilities contributed to a period of reduced interest in neural networks.
|
||||
|
||||
This historical trajectory offers an important lesson in systems engineering: a groundbreaking algorithm is only as powerful as the hardware available to execute it. The decades-long gap between the mathematical formulation of backpropagation and its widespread adoption was not a failure of theory, but a latency in infrastructure. It teaches us that efficient ML systems engineering is not just about designing for the best math, but co-designing for the available silicon. The eventual deep learning revolution was sparked not by a new mathematical discovery alone, but by the convergence of data availability, algorithmic maturity, and the parallel processing power of GPUs.
|
||||
This historical trajectory offers an important lesson in systems engineering: a groundbreaking algorithm is only as powerful as the hardware available to execute it. The decades-long gap between the mathematical formulation of backpropagation[^fn-algorithm-hardware-lag] and its widespread adoption was not a failure of theory, but a latency in infrastructure. It teaches us that efficient ML systems engineering is not just about designing for the best math, but co-designing for the available silicon. The eventual deep learning revolution was sparked not by a new mathematical discovery alone, but by the convergence of data availability, algorithmic maturity, and the parallel processing power of GPUs.
|
||||
|
||||
[^fn-algorithm-hardware-lag]: **Algorithm-Hardware Adoption Lag**: Backpropagation was mathematically complete by 1974 (Werbos) but not widely adopted until 1986 — a 12-year gap explained by insufficient compute: training a meaningful network required hardware that did not exist. The pattern recurs: attention mechanisms were formalized in 2014 (Bahdanau) but required TPU-scale infrastructure (2017) before Transformers became practical. The implication is that apparently "failed" algorithms may simply be hardware-premature. An engineer evaluating today's computationally intractable techniques should ask not "does this work?" but "what hardware would make this work?" \index{Algorithm Development!hardware latency}
|
||||
|
||||
[^fn-backprop-credit-assignment]: **Backpropagation**: Short for "backward propagation of errors," the algorithm solves the credit assignment problem---determining which of millions of weights caused a given error---using the chain rule. Werbos applied it to neural networks in 1974, but the 1986 Rumelhart, Hinton, and Williams publication demonstrated practical effectiveness. The systems cost: backprop requires storing all forward-pass activations, roughly doubling memory consumption compared to inference alone. \index{Backpropagation!memory cost}
|
||||
|
||||
@@ -1317,7 +1319,9 @@ $$ \text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text
|
||||
ReLU's characteristic shape—a straight line for positive inputs and zero for negative inputs—provides three advantages that explain its dominance. First, gradient flow remains intact: for positive inputs, ReLU's gradient is exactly 1, allowing gradients to propagate unchanged through many layers and preventing the vanishing gradient problem that plagues sigmoid and tanh in deep architectures. Second, ReLU introduces natural *sparsity* by zeroing all negative activations. Typically, about 50% of neurons in a ReLU network output zero for any given input, reducing overfitting and improving interpretability. Third, computational efficiency\index{Activation Function!ReLU!computational efficiency} improves dramatically: unlike sigmoid and tanh, which require expensive exponential calculations, ReLU is computed with a single comparison—`output = (input > 0) ? input : 0`—translating to faster execution and lower energy consumption, particularly important on resource-constrained devices.
|
||||
|
||||
\index{Batch Normalization!dead neuron mitigation}
|
||||
ReLU is not without drawbacks. The **dying ReLU problem**\index{ReLU!dying ReLU problem}—neurons that permanently output zero and cease learning—occurs when neurons become stuck in the inactive state. If a neuron's weights evolve during training such that the pre-activation $z = \mathbf{w}^T\mathbf{x} + b$ is consistently negative across all training examples, the neuron outputs zero for every input. Since ReLU's gradient is also zero for negative inputs, no gradient flows back through this neuron during backpropagation: the weights cannot update, and the neuron remains dead. This can happen with large learning rates that push weights into unfavorable regions. From a systems perspective, dead neurons represent wasted capacity—parameters that consume memory and compute during inference but contribute nothing to the output. In extreme cases, 10–40% of a network's neurons can die during training, effectively reducing model capacity without reducing resource consumption. Careful initialization [@he2015delving], moderate learning rates, and architectural choices (leaky ReLU variants or batch normalization [@ioffe2015batch]) help mitigate this issue.
|
||||
ReLU is not without drawbacks. The **dying ReLU problem**\index{ReLU!dying ReLU problem}—neurons that permanently output zero and cease learning—occurs when neurons become stuck in the inactive state. If a neuron's weights evolve during training such that the pre-activation $z = \mathbf{w}^T\mathbf{x} + b$ is consistently negative across all training examples, the neuron outputs zero for every input. Since ReLU's gradient is also zero for negative inputs, no gradient flows back through this neuron during backpropagation: the weights cannot update, and the neuron remains dead. This can happen with large learning rates that push weights into unfavorable regions. From a systems perspective, dead neurons represent wasted capacity—parameters that consume memory and compute during inference but contribute nothing to the output. In extreme cases, 10–40% of a network's neurons can die during training, effectively reducing model capacity without reducing resource consumption. Careful initialization [@he2015delving], moderate learning rates, and architectural choices (leaky ReLU variants or batch normalization[^fn-batch-norm-cost] [@ioffe2015batch]) help mitigate this issue.
|
||||
|
||||
[^fn-batch-norm-cost]: **Batch Normalization Systems Cost**: BatchNorm adds two learned parameters per feature (scale γ and shift β), a synchronization barrier during training (requiring all-reduce across the batch dimension), and diverges in computational graph structure between training (live mean/variance from the batch) and inference (frozen running statistics). The critical failure mode is small-batch sensitivity: batch sizes below 8–16 produce noisy mean/variance estimates that degrade accuracy by 3–8%, forcing larger batches and more memory. This coupling between a regularization technique and hardware batch-size constraints is why LayerNorm replaced BatchNorm in Transformers — LayerNorm normalizes across features, not batch, making its statistics independent of batch size. \index{Batch Normalization!systems cost}
|
||||
|
||||
##### Softmax {#sec-neural-computation-softmax-ebe5}
|
||||
|
||||
|
||||
@@ -5577,7 +5577,9 @@ Sparsity can emerge naturally during training, often as a result of regularizati
|
||||
Sparsity in neural networks falls into two broad categories: unstructured sparsity and structured sparsity.
|
||||
|
||||
\index{Sparsity!unstructured}
|
||||
Unstructured sparsity occurs when individual weights are set to zero without any specific pattern, typically through magnitude-based pruning. While highly flexible, unstructured sparsity is less efficient on hardware because it lacks a predictable structure. Exploiting it requires specialized hardware or software optimizations.
|
||||
Unstructured sparsity occurs when individual weights are set to zero without any specific pattern, typically through magnitude-based pruning. While highly flexible, unstructured sparsity is less efficient on hardware because it lacks a predictable structure[^fn-sparsity-vectorization]. Exploiting it requires specialized hardware or software optimizations.
|
||||
|
||||
[^fn-sparsity-vectorization]: **Unstructured Sparsity and SIMD Waste**: Modern CPUs and GPUs process data in vector registers 8--32 elements wide. Unstructured sparsity scatters non-zero elements randomly through memory, so loading a 16-element vector register may yield only 1--2 non-zeros — wasting 14--15 compute lanes while still paying the full memory access cost. The processor cannot skip zero elements without first loading them. Below ~90% sparsity, this SIMD lane waste dominates any arithmetic savings, explaining why structured sparsity (which packs non-zeros contiguously) achieves speedups at 50% sparsity while unstructured sparsity requires 90%+ to break even. \index{Sparsity!SIMD vectorization}
|
||||
|
||||
\index{Sparsity!structured}
|
||||
Structured sparsity involves removing entire components of the network, such as filters, neurons, or channels. Because these removals produce predictable memory access patterns, structured sparsity is more efficient on hardware accelerators like GPUs or TPUs. It is the preferred approach when deployment requires predictable computational resource usage.
|
||||
|
||||
@@ -639,7 +639,9 @@ Four dimensions structure this cost analysis. First, FLOP counts of matrix opera
|
||||
### Neural Network Computation {#sec-model-training-neural-network-computation-5660}
|
||||
|
||||
\index{Backpropagation!historical introduction}\index{BLAS!matrix computation foundation}
|
||||
Neural network training consists of repeated matrix operations and nonlinear transformations. These operations are conceptually simple but create the system-level challenges that dominate modern training infrastructure. The introduction of backpropagation by @rumelhart1986learning and the development of efficient matrix computation libraries such as BLAS[^fn-blas-training] [@dongarra1988extended] laid the groundwork for modern training architectures.
|
||||
Neural network training consists of repeated matrix operations and nonlinear transformations. These operations are conceptually simple but create the system-level challenges that dominate modern training infrastructure. The introduction of backpropagation[^fn-backprop-provenance] by @rumelhart1986learning and the development of efficient matrix computation libraries such as BLAS[^fn-blas-training] [@dongarra1988extended] laid the groundwork for modern training architectures.
|
||||
|
||||
[^fn-backprop-provenance]: **Backpropagation Provenance**: The algorithm was independently derived by Linnainmaa (1970) for automatic differentiation of computer programs and by Werbos (1974) in a Harvard PhD thesis on economic modeling --- over a decade before Rumelhart, Hinton, and Williams popularized it for neural networks in 1986. This 12-year latency between derivation and adoption recurs in ML systems history: attention mechanisms were formalized in 1993 but required TPU-scale hardware to become the dominant architecture in 2017. Every modern framework's `backward()` call implements Linnainmaa's reverse-mode AD, not textbook chain rule --- the difference is that graph-reverse topological traversal enables parallel gradient computation across independent subgraphs. \index{Backpropagation!provenance}
|
||||
|
||||
[^fn-blas-training]: **BLAS (Basic Linear Algebra Subprograms)**: Standardized by Lawson, Hanson, Kincaid, and Krogh in 1979, BLAS defines three levels: Level 1 (vector-vector, $O(n)$ work), Level 2 (matrix-vector, $O(n^2)$), and Level 3 (matrix-matrix, $O(n^3)$ work on $O(n^2)$ data). Training is dominated by Level 3 operations precisely because their high arithmetic intensity---$O(n)$ FLOPs per byte---saturates hardware compute units rather than starving on memory bandwidth. cuBLAS and oneDNN implement these as the kernel layer beneath every framework's matrix multiplication. \index{BLAS!training compute hierarchy}
|
||||
|
||||
@@ -1068,7 +1070,9 @@ This establishes a central theme in training systems: the hardware-software trad
|
||||
|
||||
#### Adaptive and Momentum-Based Optimizers {#sec-model-training-adaptive-momentumbased-optimizers-f079}
|
||||
|
||||
\index{Optimizer!momentum-based methods}\index{Optimizer!adaptive learning rate}SGD computes correct gradients but struggles with ill-conditioned loss landscapes where some dimensions are steep (requiring small steps) while others are shallow (benefiting from large steps). A single learning rate[^fn-learning-rate-training] either oscillates dangerously in steep dimensions or moves glacially in shallow ones. Each subsequent optimizer we examine solves a specific limitation of its predecessors: momentum smooths oscillations by averaging gradient history, RMSprop adapts step sizes per parameter, and Adam combines both strategies. Understanding this progression clarifies why Adam became the default choice for transformer training while revealing the system costs, specifically memory and computation, that each refinement introduces [@kingma2014adam].
|
||||
\index{Optimizer!momentum-based methods}\index{Optimizer!adaptive learning rate}SGD computes correct gradients but struggles with ill-conditioned loss landscapes[^fn-saddle-points] where some dimensions are steep (requiring small steps) while others are shallow (benefiting from large steps). A single learning rate[^fn-learning-rate-training] either oscillates dangerously in steep dimensions or moves glacially in shallow ones. Each subsequent optimizer we examine solves a specific limitation of its predecessors: momentum smooths oscillations by averaging gradient history, RMSprop adapts step sizes per parameter, and Adam combines both strategies. Understanding this progression clarifies why Adam became the default choice for transformer training while revealing the system costs, specifically memory and computation, that each refinement introduces [@kingma2014adam].
|
||||
|
||||
[^fn-saddle-points]: **Loss Landscape Geometry**: The "local minima" framing of neural network optimization is misleading at scale. For overparameterized networks (parameters >> training samples), the dominant challenge is saddle points --- critical points where the gradient is zero but the Hessian has both positive and negative eigenvalues. In high-dimensional spaces, almost all local minima have approximately equivalent loss values, so avoiding bad minima is less important than maintaining gradient signal through saddle regions. This is why batch size, learning rate schedule, and normalization choices matter more than optimizer type for training stability: they govern how aggressively the optimizer escapes saddle points, not how carefully it descends to a minimum. \index{Loss Landscape!saddle points}
|
||||
|
||||
[^fn-learning-rate-training]: **Learning Rate ($\eta$)**: The single most consequential hyperparameter---it controls step size along the gradient direction. Too large and the optimizer overshoots minima; too small and training stalls for days. Modern practice replaces fixed rates with schedules (warmup + cosine decay), and the linear scaling rule requires $\eta$ to increase proportionally with batch size. Learning rate also interacts with numerical precision: FP16's limited mantissa constrains the range of effective rates, creating a hidden coupling between hardware choice and convergence. \index{Learning Rate!precision interaction}
|
||||
|
||||
@@ -1620,7 +1624,9 @@ $$
|
||||
\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Moved}}
|
||||
$$
|
||||
|
||||
Operations with high arithmetic intensity are compute-bound: their performance is limited by the processor's computational throughput. Operations with low arithmetic intensity are memory-bound: they spend more time moving data than computing. For the formal definition of the Roofline Model and how to compute a hardware's ridge point, see @sec-machine-foundations-roofline-model-2529.
|
||||
Operations with high arithmetic intensity are compute-bound: their performance is limited by the processor's computational throughput. Operations with low arithmetic intensity are memory-bound: they spend more time moving data than computing. For the formal definition of the Roofline Model and how to compute a hardware's ridge point[^fn-ridge-point-precision], see @sec-machine-foundations-roofline-model-2529.
|
||||
|
||||
[^fn-ridge-point-precision]: **Ridge Point and Precision**: The roofline ridge point --- the arithmetic intensity threshold separating memory-bound from compute-bound operations --- shifts with numerical precision. On an A100, FP32 Tensor Cores deliver 312 TFLOPS against 2 TB/s HBM bandwidth, giving a ridge point of ~156 FLOPs/byte. FP16/BF16 doubles throughput to 624 TFLOPS at the same bandwidth, raising the ridge point to ~312 FLOPs/byte. Switching from FP32 to BF16 mixed precision therefore converts previously compute-bound operations into memory-bound ones and vice versa --- a transformation that can change which optimization technique yields returns, making precision selection inseparable from roofline analysis. \index{Ridge Point!precision dependence}
|
||||
|
||||
Consider @tbl-training-arithmetic-intensity: dense matrix multiplication achieves O(n) FLOP/byte (compute-bound), while activation functions operate at just 0.25 FLOP/byte (memory-bound), explaining why optimization strategies must differ between these operation types.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user