footnotes: Vol1 targeted enrich/add/remove pass from quality audit

Restoration pass (selective, based on Three-Job Rule audit):
- introduction: restore fn-eliza-brittleness, fn-dartmouth-systems, fn-bobrow-student
- data_engineering: restore fn-soc-always-on (always-on island architecture)
- benchmarking: restore fn-glue-saturation (Goodhart's Law arc, 1-year saturation)

Group A surgical edits:
- nn_computation: remove fn-overfitting (Context=1, Tether=1 — only confirmed failure)
- training: strip dead etymology from fn-convergence-training, fn-hyperparameter-training
- model_serving: enrich fn-onnx-runtime-serving with 5–15% TensorRT throughput figure

Group B new footnotes:
- nn_computation: add fn-alexnet-gpu-split (GTX 580 3 GB ceiling → model parallelism lineage)
- responsible_engr: add fn-zillow-dam (D·A·M decomposition of $304M failure)
This commit is contained in:
Vijay Janapa Reddi
2026-02-24 08:40:01 -05:00
parent 446d848fa8
commit 704f7555fe
7 changed files with 60 additions and 49 deletions

View File

@@ -1384,10 +1384,12 @@ A task definition is only as good as the data used to evaluate it. Standardized
\index{CIFAR-10!classification reference dataset}
\index{SQuAD!reading comprehension dataset}
\index{GLUE!language understanding benchmark}
In computer vision, ImageNet [@imagenet_website] [@deng2009imagenet], COCO [@lin2014microsoft] [@lin2014microsoft], and CIFAR-10 [@cifar10_website] [@krizhevsky2009learning] serve as reference standards; in natural language processing, SQuAD [@squad_website][^fn-squad] [@rajpurkar2016squad], GLUE [@wang2018glue] [@wang2018glue], and WikiText [@wikitext_website] [@merity2016pointer] fulfill similar roles, each encompassing a range of complexities and edge cases.
In computer vision, ImageNet [@imagenet_website] [@deng2009imagenet], COCO [@lin2014microsoft] [@lin2014microsoft], and CIFAR-10 [@cifar10_website] [@krizhevsky2009learning] serve as reference standards; in natural language processing, SQuAD [@squad_website][^fn-squad] [@rajpurkar2016squad], GLUE[^fn-glue-saturation] [@wang2018glue] [@wang2018glue], and WikiText [@wikitext_website] [@merity2016pointer] fulfill similar roles, each encompassing a range of complexities and edge cases.
[^fn-squad]: **SQuAD (Stanford Question Answering Dataset)**: Introduced in 2016 with 100,000+ question-answer pairs from Wikipedia. AI systems exceeded the 86.8% human F1 baseline by 2018, but this "superhuman" result illustrates a benchmarking failure mode: the task's extractive format (answers are text spans within the passage) makes it easier than open-ended question answering, inflating perceived capability relative to production NLP systems. \index{SQuAD!saturation}
[^fn-glue-saturation]: **GLUE**: GLUE's saturation arc is the canonical benchmark obsolescence case study. Introduced in 2018 with a human baseline of 87.1%, BERT [@devlin2019bert] reached 80.2% within months and models exceeded the human baseline by mid-2019 — less than one year after launch. This is Goodhart's Law in action: once GLUE became a target, it ceased to be a good measure, as models learned to exploit dataset artifacts rather than develop genuine language understanding. The pattern forced the creation of SuperGLUE and now BIG-bench, each requiring progressively harder tasks. \index{GLUE!benchmark saturation}
Dataset selection shapes everything downstream. In the audio anomaly detection example (@fig-benchmark-components), the dataset must include representative waveform samples of normal operation alongside comprehensive examples of anomalous conditions; domain-specific collections like ToyADMOS[^fn-toyadmos] [@koizumi2019toyadmos] for industrial manufacturing and Google Speech Commands for general sound recognition address these requirements. Effective benchmark datasets must balance two competing demands: accurately representing real-world challenges while maintaining sufficient complexity to differentiate model performance. Simplified datasets like ToyADMOS are valuable for methodological development but may not capture the full complexity of production environments.
[^fn-toyadmos]: **ToyADMOS**: Developed by NTT Communications in 2019 for acoustic anomaly detection, containing audio recordings from toy car and conveyor belt operations (1,000+ normal, 300+ anomalous samples per machine type). The "toy" prefix is intentional: the controlled environment enables reproducible benchmarking but creates a domain gap -- models achieving 95%+ AUC on ToyADMOS may drop to 70--80% on factory floors with background noise, vibration, and sensor degradation. \index{ToyADMOS!domain gap}

View File

@@ -387,7 +387,7 @@ $$ \text{Data Selection Gain} \propto \frac{\text{Information Entropy}}{\text{Da
# │ Exports: req_bw_gbs_str, disk_bw_mbs_str, feeding_tax_pct_str, img_per_sec_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import (
A100_FLOPS_FP16_TENSOR, RESNET50_FLOPs,
A100_FLOPS_FP16_TENSOR, RESNET50_FLOPs,
IMAGE_DIM_RESNET, IMAGE_CHANNELS_RGB, BYTES_FP32,
TFLOPs, second, GB, MB, MILLION
)
@@ -403,21 +403,21 @@ class FeedingProblem:
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
gpu_flops = A100_FLOPS_FP16_TENSOR
model_flops = RESNET50_FLOPs
# Image: 224x224x3 @ FP32
img_size_bytes = IMAGE_DIM_RESNET * IMAGE_DIM_RESNET * IMAGE_CHANNELS_RGB * BYTES_FP32.m_as('B')
# Standard Cloud Disk (e.g. AWS gp3 baseline)
disk_bw_mbs = 250.0
disk_bw_mbs = 250.0
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
# Throughput (Images/sec) = GPU_Peak / Model_FLOPs
img_per_sec = (gpu_flops / model_flops).to_base_units().m_as('count/second')
# Required Bandwidth (Bytes/sec) = img_per_sec * img_size_bytes
req_bw_bytes_sec = img_per_sec * img_size_bytes
req_bw_gbs = req_bw_bytes_sec / (1 * GB).m_as('B')
# Efficiency (eta) = Disk_BW / (Required_BW in MB/s)
eta = min(disk_bw_mbs / (req_bw_bytes_sec / (1 * MB).m_as('B')), 1.0)
feeding_tax_pct = (1.0 - eta) * 100
@@ -1068,7 +1068,9 @@ Operational metrics further track response time (keyword utterance to system res
Stakeholder priorities create additional tension. Device manufacturers prioritize low power consumption, software developers emphasize ease of integration, and end users demand accuracy and responsiveness. Balancing these competing requirements shapes system architecture decisions throughout development.
Embedded device constraints impose hard boundaries on these architectural choices. Memory limitations require extremely lightweight models, often in the tens-of-kilobytes range, to fit in the always-on island of the SoC; this constraint covers only model weights, and preprocessing code must also fit within tight memory bounds. Limited computational capabilities (often a few hundred MHz of clock speed) demand aggressive model optimization. Most embedded devices run on batteries, so KWS systems target sub-milliwatt power consumption during continuous listening. Devices must also function across diverse deployment scenarios ranging from quiet bedrooms to noisy industrial settings.
Embedded device constraints impose hard boundaries on these architectural choices. Memory limitations require extremely lightweight models, often in the tens-of-kilobytes range, to fit in the always-on island of the SoC[^fn-soc-always-on]; this constraint covers only model weights, and preprocessing code must also fit within tight memory bounds. Limited computational capabilities (often a few hundred MHz of clock speed) demand aggressive model optimization. Most embedded devices run on batteries, so KWS systems target sub-milliwatt power consumption during continuous listening. Devices must also function across diverse deployment scenarios ranging from quiet bedrooms to noisy industrial settings.
[^fn-soc-always-on]: **SoC Always-On Island**: Modern System-on-Chip designs partition power domains so a low-power "always-on" island (typically achieving sub-milliwatt draw) monitors for wake triggers while the main processor sleeps. The critical constraint is that this island must hold both the model weights *and* the audio preprocessing code within its dedicated SRAM — a split budget that forces KWS architectures to optimize for total footprint, not just parameter count. \index{SoC!always-on island}
Data quality and diversity ultimately determine whether these constraints can be met. The dataset must capture demographic diversity (speakers with various accents, ages, and genders) to ensure broad recognition. Keyword variations require attention since people pronounce wake words differently, and background noise diversity proves essential for training models that perform across real-world scenarios from quiet environments to noisy conditions. Once a prototype system is developed, iterative feedback and refinement keep the system aligned with objectives as deployment scenarios evolve, requiring testing in real-world conditions and systematic refinement based on observed failure patterns.

View File

@@ -264,10 +264,12 @@ This data-centric paradigm requires rethinking the entire computing stack. The s
## AI Paradigm Evolution {#sec-introduction-ai-paradigm-evolution-ae2b}
AI's evolution reveals a progression of bottlenecks, each overcome by systems innovations that expanded what was computationally possible. The field's origin is often traced to Alan Turing's[^fn-turing-outputs] 1950 paper "Computing Machinery and Intelligence" [@turing1950computing], which posed the foundational question: *Can machines think?* Early systems that attempted to answer this question, such as the Perceptron (1957) and ELIZA (1966), were limited by manual logic and the constraints of mainframes, resulting in brittleness. Subsequent eras were limited by manual knowledge entry, creating scalability issues. Modern systems face a different bottleneck: computational throughput.
AI's evolution reveals a progression of bottlenecks, each overcome by systems innovations that expanded what was computationally possible. The field's origin is often traced to Alan Turing's[^fn-turing-outputs] 1950 paper "Computing Machinery and Intelligence" [@turing1950computing], which posed the foundational question: *Can machines think?* Early systems that attempted to answer this question, such as the Perceptron (1957) and ELIZA[^fn-eliza-brittleness] (1966), were limited by manual logic and the constraints of mainframes, resulting in brittleness. Subsequent eras were limited by manual knowledge entry, creating scalability issues. Modern systems face a different bottleneck: computational throughput.
[^fn-turing-outputs]: **Alan Turing**: His 1950 "Imitation Game" reframed intelligence as an output-measurement problem: judge a system by what it *does*, not by what it *is*. This engineering-first stance persists in every ML systems metric we use today—accuracy, latency, throughput, and FLOPS-per-watt are all output measurements—and explains why the Iron Law decomposes performance into observable, measurable terms rather than internal architectural properties. \index{Turing, Alan!Imitation Game}
[^fn-eliza-brittleness]: **ELIZA**: A 1966 natural language program that ran on 256 KB mainframes using pattern-matching rules with no learned state — its brittleness was a direct systems consequence of zero memory across turns. Every new input variation required a new hand-written rule, making maintenance cost grow faster than capability and foreshadowing the knowledge bottleneck that killed expert systems a decade later. \index{ELIZA!brittleness}
The timeline below reveals a recurring pattern: periods of intense optimism followed by "AI winters" when funding collapsed, each triggered by systems limitations that algorithms alone could not overcome. @fig-ai-timeline captures this boom-and-bust rhythm across seven decades: notice how each winter arrives precisely when the dominant paradigm hits its systems ceiling, and each resurgence follows a breakthrough in engineering infrastructure rather than in algorithms alone. Each era represents a paradigm shift attempting to overcome the limitations of the previous approach.
::: {#fig-ai-timeline fig-env="figure" fig-pos="t!" fig-cap="**AI Development Timeline.** A chronological curve traces AI research activity from the 1950s to the 2020s, with gray bands marking the two AI Winter periods (1974 to 1980, 1987 to 1993). Callout boxes highlight key milestones including the Turing Test [@turing1950computing], the Dartmouth conference [@mccarthy1956dartmouth], the Perceptron, ELIZA, Deep Blue, and GPT-3." fig-alt="Timeline from 1950 to 2020 with red line showing AI publication frequency. Gray bands mark two AI Winters (1974-1980, 1987-1993). Callout boxes mark milestones: Turing 1950, Dartmouth 1956, Perceptron 1957, ELIZA 1966, Deep Blue 1997, GPT-3 2020."}
@@ -453,7 +455,11 @@ Before machine learning existed as a discipline, engineers attempted to build in
#### Symbolic AI Era: The Logic Bottleneck {#sec-introduction-symbolic-ai-era-logic-bottleneck-a250}
The first era of AI engineering (1950s1970s) attempted to reduce intelligence to **Symbolic AI**\index{Symbolic AI} manipulation. Researchers at the 1956 **Dartmouth Conference**\index{Dartmouth Conference} [@mccarthy1956dartmouth] hypothesized that if they could formalize the rules of logic, machines could "think." Even then, some saw a different path: Arthur Samuel at IBM demonstrated in 1959 that a checkers program could improve through self-play, coining the very term "**machine learning**\index{Machine Learning}." But the dominant paradigm remained symbolic. Daniel Bobrow's *STUDENT* system [@bobrow1964student] (1964) exemplifies this approach.
The first era of AI engineering (1950s1970s) attempted to reduce intelligence to **Symbolic AI**\index{Symbolic AI} manipulation. Researchers at the 1956 **Dartmouth Conference**\index{Dartmouth Conference}[^fn-dartmouth-systems] [@mccarthy1956dartmouth] hypothesized that if they could formalize the rules of logic, machines could "think." Even then, some saw a different path: Arthur Samuel at IBM demonstrated in 1959 that a checkers program could improve through self-play, coining the very term "**machine learning**\index{Machine Learning}." But the dominant paradigm remained symbolic. Daniel Bobrow's *STUDENT*[^fn-bobrow-student] system [@bobrow1964student] (1964) exemplifies this approach.
[^fn-dartmouth-systems]: **Dartmouth Conference (1956)**: The workshop where John McCarthy coined "artificial intelligence" — but its participants focused almost entirely on algorithmic logic while ignoring the physical constraints of storage and compute. The same compute-agnostic assumption — that a better algorithm could always overcome a hardware limit — is precisely what this book exists to correct: every chapter that follows argues that systems constraints are first-class design variables, not afterthoughts. \index{Dartmouth Conference!systems oversight}
[^fn-bobrow-student]: **STUDENT**: Daniel Bobrow's 1964 MIT system exposed the core failure mode of symbolic AI — complexity grows faster than capability. Every new problem type required new hand-written parsing rules, so the system's maintenance burden scaled superlinearly with coverage. Data-driven approaches break this trap by learning the mapping from examples rather than encoding it as rules, which is why the shift to statistical ML in the 1980s90s was fundamentally a scaling breakthrough, not merely an accuracy improvement. \index{STUDENT!symbolic AI failure mode}
::: {.callout-example title="STUDENT (1964)"}
```{.text}

View File

@@ -1119,7 +1119,7 @@ A typical serving request for our ResNet-50 classifier shows the following laten
| **Phase** | **Operation** | **Time** | **Percentage** |
|:-------------------|:---------------------------|:---------------------------|:--------------------------|
| **Preprocessing** | JPEG decode | `{python} l_jpeg_str` | `{python} p_jpeg_str` |
| **Preprocessing** | Resize to $224\times224$ | `{python} l_resize_str` | `{python} p_resize_str` |
| **Preprocessing** | Resize to $224\times224$ | `{python} l_resize_str` | `{python} p_resize_str` |
| **Preprocessing** | Normalize (mean/std) | `{python} l_norm_str` | `{python} p_norm_str` |
| **Data Transfer** | CPU→GPU copy | `{python} l_transfer_str` | `{python} p_transfer_str` |
| **Inference** | **ResNet-50 forward pass** | **`{python} l_inf_str`** | **`{python} p_inf_str`** |
@@ -1386,8 +1386,8 @@ ridge_point_str = ResolutionBottleneckCalc.ridge_point_str
The resulting shift from compute-bound to memory-bound operation is evident in @tbl-resolution-bottleneck:
| **Resolution** | **Activation Size** | **Arith. Intensity** | **Bottleneck** |
|:--------------------|----------------------------:|---------------------------------:|:---------------|
| **Resolution** | **Activation Size** | **Arith. Intensity** | **Bottleneck** |
|:-------------------|----------------------------:|---------------------------------:|:---------------|
| **$224\times224$** | `{python} act_224_mb_str`MB | `{python} ai_224_str` FLOPs/byte | Compute |
| **$384\times384$** | `{python} act_384_mb_str`MB | `{python} ai_384_str` FLOPs/byte | Transitional |
| **$512\times512$** | `{python} act_512_mb_str`MB | `{python} ai_512_str` FLOPs/byte | Memory BW |
@@ -1677,7 +1677,7 @@ class BatchingTax:
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
lambda_qps = 500.0
# Inference times (ms)
t_inf_b1 = 2.0
t_inf_b32 = 15.0
@@ -1686,12 +1686,12 @@ class BatchingTax:
# Batch 1
w_form_b1 = (1-1) / (2 * lambda_qps) * 1000 # 0ms
lat_b1 = w_form_b1 + t_inf_b1
# Batch 32
# Formation Delay ~ (B-1) / (2 * lambda)
w_form_b32 = (32-1) / (2 * lambda_qps) * 1000 # ~31ms
lat_b32 = w_form_b32 + t_inf_b32
penalty_ratio = lat_b32 / lat_b1
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
@@ -1717,6 +1717,7 @@ penalty_ratio_str = BatchingTax.penalty_ratio_str
While Little's Law relates queue depth to throughput, it does not account for the **Batching Tax**\index{Batching Tax!latency penalty}—the deliberate delay introduced to maximize hardware utilization. In the tradition of quantitative systems, we analyze this as a **Queuing Delay** problem.
When an inference server batches requests, it introduces two distinct sources of latency:
1. **Batch Formation Delay ($W_{form}$)**: The time the first request in a batch waits for the last request to arrive.
2. **Inference Inflation ($W_{inf}$)**: The increase in execution time when the GPU processes $B$ samples instead of 1.
@@ -3346,7 +3347,7 @@ TorchScript and TensorFlow SavedModel formats enable ahead-of-time compilation a
ONNX Runtime[^fn-onnx-runtime-serving]\index{ONNX Runtime!cross-platform inference} provides a hardware-agnostic optimization layer [@onnxruntime2024]. Models export to ONNX format, then ONNX Runtime applies graph optimizations and selects execution providers for the target hardware. This enables single-format deployment across CPUs, GPUs, and specialized accelerators.
[^fn-onnx-runtime-serving]: **ONNX Runtime**: Microsoft's inference engine (released December 2018) acts as a hardware abstraction layer: the same ONNX model runs on CPUs, NVIDIA GPUs, AMD GPUs, or custom accelerators through pluggable "execution providers." ONNX Runtime applies framework-agnostic graph optimizations---constant folding, redundant node elimination, operator fusion---that benefit all targets. This cross-platform capability avoids maintaining separate optimization pipelines per hardware target, trading peak single-target performance for deployment flexibility across heterogeneous serving fleets. \index{ONNX Runtime!cross-platform serving}
[^fn-onnx-runtime-serving]: **ONNX Runtime**: Microsoft's inference engine (released December 2018) acts as a hardware abstraction layer: the same ONNX model runs on CPUs, NVIDIA GPUs, AMD GPUs, or custom accelerators through pluggable "execution providers." ONNX Runtime applies framework-agnostic graph optimizations---constant folding, redundant node elimination, operator fusion---that benefit all targets. This cross-platform capability avoids maintaining separate optimization pipelines per hardware target, accepting a 5--15% throughput loss versus TensorRT for vision models, offset by the ability to retarget the same `.onnx` artifact across CPU/GPU/NPU without recompilation---a flexibility premium that matters most in heterogeneous device fleets where recompiling per-target is measured in engineer-days. \index{ONNX Runtime!cross-platform serving}
#### Specialized Inference Engines {#sec-model-serving-specialized-inference-engines-475f}
@@ -3681,10 +3682,10 @@ To guide optimization efforts, @tbl-optimization-impact summarizes the key techn
| **Technique** | **Target Metric** | **Typical Gain** | **Implement. Cost** | **Best For** |
|:----------------------|:---------------------|-----------------:|:--------------------|:-------------------------|
| **Operator Fusion** | Latency & Throughput | 2--5$\times$ | Medium (Compiler) | Memory-bound layers |
| **INT8 Quantization** | Throughput | 3--4$\times$ | High (Calibration) | Inference-heavy nodes |
| **Graph Compilation** | Latency | 1.5--3$\times$ | Low (One-line) | Static graph models |
| **Zero-Copy Loading** | Startup Time | 10--50$\times$ | Low (File format) | Autoscaling / Cold Start |
| **Operator Fusion** | Latency & Throughput | 2--5$\times$ | Medium (Compiler) | Memory-bound layers |
| **INT8 Quantization** | Throughput | 3--4$\times$ | High (Calibration) | Inference-heavy nodes |
| **Graph Compilation** | Latency | 1.5--3$\times$ | Low (One-line) | Static graph models |
| **Zero-Copy Loading** | Startup Time | 10--50$\times$ | Low (File format) | Autoscaling / Cold Start |
| **CPU Pinning** | Tail Latency (P99) | 20-50% reduction | Low (Config) | Latency-critical apps |
: **Node-Level Optimization Impact**: A decision matrix for selecting optimization techniques. High-impact techniques like quantization often carry higher implementation costs (calibration data requirements), while architectural changes like zero-copy loading offer dramatic gains for specific metrics (startup time) with low effort. {#tbl-optimization-impact}
@@ -3881,7 +3882,6 @@ The linear growth of the KV cache with sequence length forces a hard trade-off:
#### Workload Profile {#sec-model-serving-workload-profile-a380}
* **Model**: Llama-3-8B (quantized to 4-bit AWQ\index{AWQ!4-bit quantization}; see @sec-model-compression for quantization techniques).
* **Hardware**: 1$\times$ NVIDIA H100 SXM5 GPU (`{python} h100_mem` GB HBM3, `{python} h100_bw_tbs` TB/s bandwidth).
* **Request Characteristics**: 1,000-token input prompt (Prefill), 256-token generated response (Decode).
* **Target SLOs**: TTFT $<$ 200 ms, TPOT $<$ 20 ms.

View File

@@ -421,7 +421,9 @@ This gradual layering of patterns reveals *why* neural network *depth* matters.
\index{ImageNet!competition progress}
\index{AlexNet!ImageNet breakthrough}
\index{ResNet!human-level performance}
This architecture exhibits predictable scaling\index{Scalability!deep learning}: unlike traditional approaches where performance plateaus, deep learning models continue improving with additional data (recognizing more variations) and computation (discovering subtler patterns). This scalability drove dramatic performance gains. In the ImageNet competition, traditional methods achieved approximately 25.8% top-5 error in 2011. AlexNet reduced this to 15.3% in 2012. By 2015, ResNet achieved 3.6% top-5 error, surpassing estimated human performance of approximately 5.1%.
This architecture exhibits predictable scaling\index{Scalability!deep learning}: unlike traditional approaches where performance plateaus, deep learning models continue improving with additional data (recognizing more variations) and computation (discovering subtler patterns). This scalability drove dramatic performance gains. In the ImageNet competition, traditional methods achieved approximately 25.8% top-5 error in 2011. AlexNet[^fn-alexnet-gpu-split] reduced this to 15.3% in 2012. By 2015, ResNet achieved 3.6% top-5 error, surpassing estimated human performance of approximately 5.1%.
[^fn-alexnet-gpu-split]: **AlexNet's Two-GPU Split**: Krizhevsky's team split AlexNet across two NVIDIA GTX 580s not by architectural preference but by physical constraint — each card had only 3 GB of VRAM, and the full model required more memory than a single card could provide. This forced the first production instance of model parallelism: half the feature maps on each GPU, with cross-GPU communication only at specific layers. The workaround that felt like a hack in 2012 became the template for model parallelism at scale — every modern pipeline-parallel strategy traces its lineage to this 3 GB ceiling. \index{AlexNet!model parallelism}
@fig-double-descent previews this scaling behavior through three distinct regimes. The underlying mechanisms (training error, overfitting, gradient-based learning) are developed in subsequent sections; here we establish the shape of the phenomenon. The *Classical Regime* is where traditional statistical intuitions hold, the *Interpolation Threshold* is where the model perfectly fits training data, and the *Modern Regime* is where massive overparameterization paradoxically improves generalization. The axes are normalized to emphasize shape rather than a specific dataset.
@@ -940,9 +942,7 @@ Line/.style={line width=1.0pt,black!50,text=black},
The data revolution transformed what was possible with neural networks. The rise of the internet and digital devices created vast new sources of training data: image sharing platforms provided millions of labeled images, digital text collections enabled language processing at scale, and sensor networks generated continuous streams of real-world data. This abundance provided the raw material neural networks needed to learn complex patterns effectively.
Algorithmic innovations made it possible to use this data effectively. New methods for initializing networks and controlling learning rates made training more stable. Techniques for preventing overfitting[^fn-overfitting] allowed models to generalize better to new data. Researchers discovered that neural network performance scaled predictably with model size, computation, and data quantity, leading to increasingly ambitious architectures.
[^fn-overfitting]: **Overfitting**: Occurs when a model memorizes training set noise instead of learning a generalizable signal, a risk that increases with model capacity and data quantity. The "prevention" techniques mentioned, like dropout or early stopping, actively constrain the model to force it to learn more robust patterns. This imposes a critical system-level trade-off: a small, controlled increase in training loss is accepted to achieve a large decrease in validation error, often reducing a 20-30% performance gap to under 5%. \index{Overfitting!compute waste}
Algorithmic innovations made it possible to use this data effectively. New methods for initializing networks and controlling learning rates made training more stable. Techniques for preventing overfitting allowed models to generalize better to new data. Researchers discovered that neural network performance scaled predictably with model size, computation, and data quantity, leading to increasingly ambitious architectures.
These algorithmic advances created demand for more powerful computing infrastructure, which evolved in response. On the hardware side, GPUs provided the parallel processing capabilities needed for efficient neural network computation, and specialized AI accelerators like TPUs[^fn-tpu-tensor-hardware] [@jouppi2023tpu] pushed performance further. High-bandwidth memory systems and fast interconnects addressed data movement challenges. Equally important were software advances: frameworks and libraries that simplified building and training networks, distributed computing systems that enabled training at scale, and tools for optimizing model deployment.
@@ -1367,11 +1367,11 @@ class ActivationLogic:
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
# ReLU is a comparator + mux
relu_transistors = 50
relu_transistors = 50
# Sigmoid/Tanh require exp() -> iterative Taylor or Lookup + Interpolation
# High-precision floating point exponential unit
sigmoid_transistors = 2500
sigmoid_transistors = 2500
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
ratio = sigmoid_transistors / relu_transistors
@@ -1392,9 +1392,9 @@ activation_ratio_str = ActivationLogic.activation_ratio_str
### The Transistor Tax: Logic Unit Cost {#sec-neural-computation-transistor-tax}
This hardware dominance is not just about speed; it is about silicon area. In computer architecture, we measure the **Logic Unit Cost**\index{Logic Unit Cost!activation functions} in terms of transistor count and energy per operation.
This hardware dominance is not just about speed; it is about silicon area. In computer architecture, we measure the **Logic Unit Cost**\index{Logic Unit Cost!activation functions} in terms of transistor count and energy per operation.
A ReLU unit is computationally trivial: it consists of a single comparator and a multiplexer, requiring approximately **`{python} relu_transistor_str` transistors**. In contrast, a Sigmoid or Tanh unit requires computing an exponential—a complex transcendental function that hardware must approximate using lookup tables or iterative Taylor expansions. A high-precision exponential unit can consume over **`{python} sigmoid_transistor_str` transistors**.
A ReLU unit is computationally trivial: it consists of a single comparator and a multiplexer, requiring approximately **`{python} relu_transistor_str` transistors**. In contrast, a Sigmoid or Tanh unit requires computing an exponential—a complex transcendental function that hardware must approximate using lookup tables or iterative Taylor expansions. A high-precision exponential unit can consume over **`{python} sigmoid_transistor_str` transistors**.
We call this disparity **The Transistor Tax**\index{Transistor Tax!activation functions}: selecting Sigmoid over ReLU increases the silicon "price" of an activation by over **`{python} activation_ratio_str`$\times$**. For a systems engineer, this means ReLU is not just a mathematical preference; it is a density optimization that allows hardware architects to pack orders of magnitude more neurons into the same power and area budget. This physical efficiency is the primary reason the deep learning era shifted away from the "biologically plausible" Sigmoid toward the "silicon-efficient" ReLU.
@@ -2381,12 +2381,12 @@ bp_out_kb_str = MnistTrainingMemoryCalc.bp_out_kb_str
#### Step 1: Model Parameters
| **Layer** | **Weights** | **Biases** | **Total Parameters** |
|:--------------------|------------------------------------------------:|------------------------------:|--------------------------------------------:|
| **Layer** | **Weights** | **Biases** | **Total Parameters** |
|:--------------------|-----------------------------------------------:|------------------------------:|--------------------------------------------:|
| **Input→Hidden1** | $784\times128$ = `{python} MNISTMemory.w1_str` | `{python} MNISTMemory.b1_str` | `{python} MNISTMemory.p1_str` |
| **Hidden1→Hidden2** | $128\times64$ = `{python} MNISTMemory.w2_str` | `{python} MNISTMemory.b2_str` | `{python} MNISTMemory.p2_str` |
| **Hidden2→Output** | $64\times10$ = `{python} MNISTMemory.w3_str` | `{python} MNISTMemory.b3_str` | `{python} MNISTMemory.p3_str` |
| **Total** | | | **`{python} MNISTMemory.total_params_str`** |
| **Total** | | | **`{python} MNISTMemory.total_params_str`** |
**Parameter memory**: `{python} MNISTMemory.total_params_str`$\times$ 4 bytes = **`{python} param_kib_str` KB**
@@ -2394,10 +2394,10 @@ bp_out_kb_str = MnistTrainingMemoryCalc.bp_out_kb_str
| **Layer** | **Activation Shape** | **Values** | **Memory** |
|:------------|---------------------:|-------------------------------:|------------------------------------:|
| **Input** | $32\times784$ | `{python} act_in_count_str` | `{python} act_in_kib_str` KB |
| **Hidden1** | $32\times128$ | `{python} act_h1_count_str` | `{python} act_h1_kib_str` KB |
| **Hidden2** | $32\times64$ | `{python} act_h2_count_str` | `{python} act_h2_kib_str` KB |
| **Output** | $32\times10$ | `{python} act_out_count_str` | `{python} act_out_kib_str` KB |
| **Input** | $32\times784$ | `{python} act_in_count_str` | `{python} act_in_kib_str` KB |
| **Hidden1** | $32\times128$ | `{python} act_h1_count_str` | `{python} act_h1_kib_str` KB |
| **Hidden2** | $32\times64$ | `{python} act_h2_count_str` | `{python} act_h2_kib_str` KB |
| **Output** | $32\times10$ | `{python} act_out_count_str` | `{python} act_out_kib_str` KB |
| **Total** | | `{python} total_act_count_str` | **`{python} total_act_kib_str` KB** |
#### Step 3: Training-Only Memory
@@ -2932,15 +2932,15 @@ double_ratio_exact_str = MnistFlopsCalc.double_ratio_exact_str
**Solution**:
| **Layer** | **Operation** | **Dimensions** | **Ops** |
|:------------|:---------------|-------------------------------------------:|------------------------------------------------------------:|
| **Layer** | **Operation** | **Dimensions** | **Ops** |
|:------------|:---------------|-----------------------------------------:|-----------------------------------------------------------:|
| **Layer 1** | MatMul | ($32\times784$)$\times$ ($784\times128$) | 2$\times$ 32$\times$ $784\times128$ = `{python} l1_mm_str` |
| **Layer 1** | Bias + ReLU | $32\times128$ | $2\times4,096$ = `{python} l1_bias_str` |
| **Layer 1** | Bias + ReLU | $32\times128$ | $2\times4,096$ = `{python} l1_bias_str` |
| **Layer 2** | MatMul | ($32\times128$)$\times$ ($128\times64$) | 2$\times$ 32$\times$ $128\times64$ = `{python} l2_mm_str` |
| **Layer 2** | Bias + ReLU | $32\times64$ | $2\times2,048$ = `{python} l2_bias_str` |
| **Layer 2** | Bias + ReLU | $32\times64$ | $2\times2,048$ = `{python} l2_bias_str` |
| **Layer 3** | MatMul | ($32\times64$)$\times$ ($64\times10$) | 2$\times$ 32$\times$ $64\times10$ = `{python} l3_mm_str` |
| **Layer 3** | Bias + Softmax | $32\times10$ | ~`{python} l3_bias_str` (simplified) |
| **Total** | | | **~`{python} MNISTMemory.total_mops_str` MOps** |
| **Layer 3** | Bias + Softmax | $32\times10$ | ~`{python} l3_bias_str` (simplified) |
| **Total** | | | **~`{python} MNISTMemory.total_mops_str` MOps** |
**Per-image cost**: `{python} MNISTMemory.total_mops_str` MOps ÷ 32 = **~`{python} MNISTMemory.per_image_kops_str` KOps per image**
@@ -3944,7 +3944,7 @@ The same neural network computation that required industrial-scale infrastructur
|:----------------------|----------------------------------:|----------------------------------------------:|-----------------------:|
| **Hardware cost** | ~\$50,000 (Sun-4 workstation) | ~\$50 (Raspberry Pi 5) | 1,000$\times$ |
| **Inference latency** | ~100 ms/digit | ~0.1 ms/digit | 1,000$\times$ |
| **Power consumption** | 50100 W | 5 W | 10--20$\times$ |
| **Power consumption** | 50100 W | 5 W | 10--20$\times$ |
| **Training time** | 3 days | ~30 seconds | 8,640$\times$ |
| **Model storage** | ~`{python} lenet_1_mem_kb_str` KB | ~`{python} lenet_1_mem_kb_str` KB (unchanged) | 1$\times$ (same model) |
| **Energy/inference** | ~10 J | ~0.5 mJ | 20,000$\times$ |

View File

@@ -938,7 +938,9 @@ Deployment is the point of no return.
### Monitoring and Incident Response {#sec-responsible-engineering-monitoring-incident-response-54f4}
When Zillow's algorithmic home-buying program lost USD 304 million in a single quarter partly due to model prediction errors that went undetected until financial losses accumulated, the failure was not in the model itself but in the monitoring infrastructure surrounding it. Planning for system failures before they occur is a core responsibility engineering practice.\index{Incident Response!ML system failures}\index{Monitoring!responsible operations} Building on the incident severity classification and response framework from @sec-ml-operations-incident-response-ml-systems-c637, @tbl-incident-response extends the general framework with fairness-specific detection and response criteria, structuring preparation into five components with both requirements and pre-deployment verification criteria.
When Zillow's algorithmic home-buying program lost USD 304 million[^fn-zillow-dam] in a single quarter partly due to model prediction errors that went undetected until financial losses accumulated, the failure was not in the model itself but in the monitoring infrastructure surrounding it. Planning for system failures before they occur is a core responsibility engineering practice.\index{Incident Response!ML system failures}\index{Monitoring!responsible operations} Building on the incident severity classification and response framework from @sec-ml-operations-incident-response-ml-systems-c637, @tbl-incident-response extends the general framework with fairness-specific detection and response criteria, structuring preparation into five components with both requirements and pre-deployment verification criteria.
[^fn-zillow-dam]: **Zillow's D·A·M Failure**: Zillow's $304M write-down in 2021 was not a model accuracy failure — the Zestimate algorithm's published MAE was within normal ranges. It was a **D**ata failure: the training distribution (historical listings) diverged from the deployment distribution (pandemic-era price volatility) faster than the monitoring system detected. The **A**lgorithm was optimized for price prediction, not for predicting its own prediction confidence under distribution shift. The **M**achine (the iBuying automation pipeline) had no circuit breaker — it committed capital at full automation rates while the model's reliability was silently degrading. Each axis of failure was individually detectable; the systems failure was the absence of cross-axis monitoring. \index{Zillow!distribution shift failure}
| **Component** | **Requirements** | **Pre-Deployment Verification** |
|:------------------|:------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|

View File

@@ -1145,10 +1145,9 @@ The system implications of Adam are more substantial than previous methods. The
#### Optimization Algorithm System Implications {#sec-model-training-optimization-algorithm-system-implications-f9f2}
\index{Convergence!etymology}
The choice of optimization algorithm creates specific patterns of computation and memory access that influence training efficiency. Memory requirements increase progressively from SGD (1$\times$ model size) through Momentum (2$\times$) to Adam (3$\times$), as quantified in @tbl-optimizer-properties. These memory costs must be balanced against convergence[^fn-convergence-training] benefits. While Adam often requires fewer iterations to reach convergence, its per-iteration memory and computation overhead may impact training speed on memory-constrained systems. The concrete scale of these *GPT-2 optimizer memory requirements* illustrates just how significant this overhead becomes for large models.
[^fn-convergence-training]: **Convergence**: From Latin *convergere* ("to incline together"). Training converges when the loss stops decreasing meaningfully, typically after `{python} TrainingScenarios.sgd_iterations_min_str`--`{python} TrainingScenarios.sgd_iterations_max_str` iterations for large models. The systems consequence: faster convergence (fewer iterations) directly reduces wall-clock time and cost, but the optimizer that converges fastest (Adam) requires 3$\times$ the memory of the cheapest alternative (SGD)---a trade-off between time and memory that shapes every training budget. \index{Convergence!time-memory trade-off}
[^fn-convergence-training]: **Convergence**: Training converges when the loss stops decreasing meaningfully, typically after `{python} TrainingScenarios.sgd_iterations_min_str`--`{python} TrainingScenarios.sgd_iterations_max_str` iterations for large models. The systems consequence: faster convergence (fewer iterations) directly reduces wall-clock time and cost, but the optimizer that converges fastest (Adam) requires 3$\times$ the memory of the cheapest alternative (SGD)---a trade-off between time and memory that shapes every training budget. \index{Convergence!time-memory trade-off}
| **Property** | **SGD** | **Momentum** | **RMSprop** | **Adam** |
|:-------------------------|:-----------|:---------------|:------------------|:------------------------------------|
@@ -2641,7 +2640,7 @@ plt.show()
```
:::
[^fn-hyperparameter-training]: **Hyperparameter**: From Greek *hyper* ("over, beyond") + *parameter*---literally "parameters about parameters." While weights are learned during training, hyperparameters (learning rate, batch size, layer count) are set before training and control the learning process itself. Each hyperparameter choice has direct systems consequences: batch size determines memory footprint, learning rate interacts with numerical precision, and layer count multiplies activation storage. Tuning them typically requires multiple full training runs, multiplying total compute cost. \index{Hyperparameter!systems impact}
[^fn-hyperparameter-training]: **Hyperparameter**: While weights are learned during training, hyperparameters (learning rate, batch size, layer count) are set before training and control the learning process itself. Each hyperparameter choice has direct systems consequences: batch size determines memory footprint, learning rate interacts with numerical precision, and layer count multiplies activation storage. Tuning them typically requires multiple full training runs, multiplying total compute cost. \index{Hyperparameter!systems impact}
Beyond the convergence effects, batch size interacts with distributed training strategies: larger batches reduce the frequency of gradient synchronization across devices (fewer optimizer steps per epoch), but each synchronization transfers more data. In distributed settings, batch size often determines the degree of data parallelism, impacting how gradient computations and parameter updates are distributed. Gradient accumulation (@sec-model-training-gradient-accumulation-checkpointing-0c47) decouples the effective batch size from memory constraints, enabling optimal batch sizes without requiring the memory to hold all samples simultaneously.