mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
footnotes: enrich pass for ml_ops (8) and model_compression (8)
ml_ops rewrites: - fn-telemetry-mlops: drop etymology, add distribution-shift detection consequence - fn-model-registry-ops: reframe to failure mode prevented (shadow deployment, 30-90min rollback) - fn-entropy-model-decay: add empirical λ ranges by domain + infrastructure cadence consequence - fn-staging-validation-ops: sharpen ML-vs-conventional distinction (probabilistic vs. deterministic) - fn-shadow-deploy-ml: replace body-restatement with cost/benefit threshold + asymmetric risk framing - fn-drift-types-ops: redirect to detection lag asymmetry (feature drift vs. concept drift) - fn-drift-covariate-shift: drop etymology, focus on Shimodaira support-assumption failure mode - fn-ray-distributed-ml: sharpen tether to training-serving skew via silent format-translation bugs model_compression rewrites: - fn-pruning-lecun-1989: anchor on memory efficiency first, Hessian as mechanism - fn-heuristic-pruning: quantify the trap (90%+ from early layers, bottlenecks preserved) - fn-kl-divergence-distillation: add asymmetric KL consequence (calibration transfer) - fn-nas-hardware-aware: add FLOPs-vs-latency divergence (3-5x for same FLOP count) - fn-nas-reinforcement-learning: explain inner-loop cost mechanism (12,800-22,400 candidates) - fn-nas-evolutionary: add mechanism bridge + weight-sharing necessity consequence - fn-quantization-shannon: quantify tolerance (INT8 <1%, INT4 1-3%, per-model validation) - fn-ste-gradient-trick: explain zero-gradient mechanism, STE identity substitution error
This commit is contained in:
@@ -91,7 +91,7 @@ Operationalizing machine learning requires coordinating three distinct system bo
|
||||
The infrastructure components, production operations, and maturity frameworks that follow address these three interfaces systematically.
|
||||
:::
|
||||
|
||||
[^fn-telemetry-mlops]: **Telemetry** (from Greek *tele*, "far," and *metron*, "measure"): The signal layer that flows through the three critical interfaces (Data-Model, Model-Infrastructure, Production-Monitoring) to close the operational feedback loop. In ML systems, telemetry must capture statistical signals (feature distributions, prediction confidence, drift indicators) alongside infrastructure signals because the former detect the silent accuracy degradation that latency and uptime metrics cannot. \index{Telemetry!ML observability}
|
||||
[^fn-telemetry-mlops]: **Telemetry**: The only feedback path that makes model degradation visible before it becomes a business failure. Unlike traditional software, where crashes and error codes surface problems immediately, ML systems degrade silently -- distribution shift can go undetected for weeks or months without statistical telemetry (feature distributions, prediction confidence, drift indicators). By that point the model has been making degraded predictions at full automation rate, accumulating compounding errors in downstream systems that no infrastructure metric would have flagged. \index{Telemetry!ML observability}
|
||||
|
||||
The telemetry[^fn-telemetry-mlops] flowing through these interfaces provides the data needed for informed operational decisions. With this operational scope in view, we begin by formalizing the discipline itself: what distinguishes MLOps from traditional DevOps, what foundational principles govern all operational decisions, and what debt patterns accumulate when those principles are ignored.
|
||||
|
||||
@@ -888,7 +888,7 @@ def validate_no_skew(
|
||||
|
||||
Data versioning allows teams to snapshot datasets at specific points in time and associate them with particular model runs, including both raw data and processed artifacts. Model versioning registers trained models as immutable artifacts\index{Immutable Artifacts!model registration} alongside metadata such as training parameters, evaluation metrics, and environment specifications. Model registries[^fn-model-registry-ops]\index{Model Registry!version promotion} provide structured interfaces for promoting, deploying, and rolling back model versions, with some supporting lineage visualization tracing the full dependency graph from raw data to deployed prediction.
|
||||
|
||||
[^fn-model-registry-ops]: **Model Registry**: A registry provides the structured interface for promotion and rollback by treating models not as files, but as versioned API objects with queryable metadata and state (e.g., staging, production). This replaces brittle deployment scripts with an API-driven workflow that prevents accidental deployments, enables rollbacks in seconds, and provides the dependency graph for the full lineage tracing mentioned. \index{Model Registry!version management}
|
||||
[^fn-model-registry-ops]: **Model Registry**: Prevents "shadow deployment" -- the failure mode where the undocumented production model diverges from the trained artifact through different preprocessing, stale serialization formats, or manual hotfixes applied directly to the serving endpoint. Without a registry enforcing versioned, immutable artifacts with queryable metadata and state, rollbacks require locating the correct weights from an ad-hoc artifact store, which under incident conditions takes 30--90 minutes instead of seconds. \index{Model Registry!version management}
|
||||
|
||||
These complementary practices form the lineage layer of an ML system. This layer enables introspection, experimentation, and governance. When a deployed model underperforms, lineage tools help teams answer questions such as:
|
||||
|
||||
@@ -1319,7 +1319,7 @@ The choice among these strategies depends on domain characteristics: scheduled r
|
||||
|
||||
The decision to retrain a model\index{Retraining Economics!cost-benefit optimization} is not a matter of intuition but an engineering optimization that balances the cost of **System Entropy**[^fn-entropy-model-decay] (accuracy decay) against the cost of **Infrastructure** (retraining expense). We can think of model accuracy as a decaying quantity, analogous to radioactive decay, with a measurable rate of decline. In production, a model behaves like a radioactive isotope: it has a measurable **Half-Life**[^fn-half-life-model] after which its predictive value becomes toxic to the business.
|
||||
|
||||
[^fn-entropy-model-decay]: **System Entropy** (from thermodynamics, via information theory): In ML operations, system entropy quantifies the rate at which model accuracy decays as the gap between training and production distributions widens. The metaphor is precise: like heat dissipating into the environment, prediction quality disperses irreversibly without active intervention. The retraining economics that follow formalize this decay rate as a measurable half-life, making the cost of inaction calculable. \index{Entropy!model decay}
|
||||
[^fn-entropy-model-decay]: **System Entropy**: The decay rate $\lambda$ varies by orders of magnitude across domains. Fast-moving domains (social media recommendations, financial fraud) exhibit half-lives of days to weeks; slower domains (medical imaging, industrial inspection) decay over months to years. This range determines the minimum infrastructure investment: a model with a 3-day half-life requires continuous training infrastructure, not a scheduled batch job, while a model with a 6-month half-life can retrain weekly at a fraction of the cost. \index{Entropy!model decay}
|
||||
|
||||
[^fn-half-life-model]: **Half-Life** (from nuclear physics, where it measures the time for half of a radioactive sample to decay): The metaphor is mathematically precise, not merely suggestive -- the accuracy decay model that follows uses the same exponential function $e^{-\lambda t}$ that governs radioactive decay. The key insight for ML operations is that half-life is a *measurable property* of a deployed model, determinable from historical accuracy data, transforming "when should we retrain?" from a judgment call into a calculation. \index{Half-Life!model decay}
|
||||
|
||||
@@ -1599,7 +1599,7 @@ Production deployment requires frameworks that handle model packaging, versionin
|
||||
|
||||
Before full-scale rollout, teams deploy updated models to staging or QA environments[^fn-staging-validation-ops] to rigorously test performance.
|
||||
|
||||
[^fn-staging-validation-ops]: **Staging Validation**: Staging environments test system performance by creating a full replica to surface integration bugs that unit tests miss. For ML, this validation is critical for ensuring statistical equivalence, as subtle differences in data processing libraries can cause silent prediction drift even when functional tests pass. A rollout is typically blocked if key prediction statistics diverge by more than 1-5% between the staging and production model. \index{Staging!ML validation}
|
||||
[^fn-staging-validation-ops]: **Staging Validation**: The key difference from conventional software staging: conventional staging validates deterministic correctness (does the code produce the right output?), while ML staging validates probabilistic adequacy (is the model's accuracy distribution acceptable given current data?). This makes ML staging fundamentally harder -- a model can pass all unit tests and still fail in production because the test data does not reflect the deployment distribution, making statistical equivalence checks (typically blocking rollout if prediction statistics diverge by more than 1--5%) the only reliable gate. \index{Staging!ML validation}
|
||||
|
||||
Techniques such as shadow deployments[^fn-shadow-deploy-ml]\index{Shadow Deployment!production validation}, canary testing\index{Canary Deployment!risk-controlled rollout}[^fn-canary-deploy-ml], and blue-green deployment\index{Blue-Green Deployment!zero-downtime updates} validate new models incrementally. These controlled deployment strategies enable safe model validation in production. Robust rollback procedures are essential to handle unexpected issues, reverting systems to the previous stable model version to ensure minimal disruption.
|
||||
|
||||
@@ -1613,7 +1613,7 @@ Techniques such as shadow deployments[^fn-shadow-deploy-ml]\index{Shadow Deploym
|
||||
**The Systems Lesson**: Deployment is a systems problem, not just a code problem. Configuration drift and partial rollouts are catastrophic failure modes in automated systems [@sec2013knight].
|
||||
:::
|
||||
|
||||
[^fn-shadow-deploy-ml]: **Shadow Deployment**: This technique validates a new model by running it in parallel with the current production model, processing a copy of live traffic without serving its predictions to users. It directly addresses the gap between offline metrics and production performance by safely testing the new model on real-world data patterns and latency constraints. The primary trade-off is cost, as this approach effectively doubles the required serving compute for the duration of the validation. \index{Shadow Deployment!production validation}
|
||||
[^fn-shadow-deploy-ml]: **Shadow Deployment**: Economically justified when the cost of a bad rollout (user-facing errors $\times$ user count $\times$ business impact per error) exceeds the cost of running shadow infrastructure, typically 10--20% compute overhead for duplicating inference without serving results. Below this threshold, canary deployment is preferred because it validates with real traffic at lower cost. The key insight is that shadow deployment's value is asymmetric -- it eliminates a catastrophic tail risk, not average-case error, making it essential for high-stakes models where a single bad rollout can cause irreversible damage. \index{Shadow Deployment!production validation}
|
||||
|
||||
[^fn-canary-deploy-ml]: **Canary Deployment** (after the 19th-century practice of lowering caged canaries into coal mines to detect toxic gas before it harmed miners): Routes 1--5% of live traffic to a candidate model, using it as a sentinel for production health. The ML-specific challenge is that a "failure" is statistical degradation, not a deterministic crash: detecting a 2% accuracy difference with 95% confidence requires thousands of inferences, creating a tension between decision speed and statistical power that determines minimum canary duration. \index{Canary Deployment!statistical validation}
|
||||
|
||||
@@ -1718,7 +1718,7 @@ To maintain lineage and auditability, teams track model artifacts, including scr
|
||||
|
||||
These tools and practices, along with distributed orchestration frameworks like Ray[^fn-ray-distributed-ml], enable teams to deploy ML models resiliently, ensuring smooth transitions between versions, maintaining production stability, and optimizing performance across diverse use cases.
|
||||
|
||||
[^fn-ray-distributed-ml]: **Ray**: A distributed computing framework from UC Berkeley that treats training, tuning, and serving as tasks within a single, unified scheduler. This design directly mitigates a primary source of production instability—the code and configuration "drift" that occurs when translating a model from a training framework to a separate serving framework. By managing the entire inference graph in one system, its shared-memory object store eliminates the serialization overhead of inter-container network calls, directly reducing the data-movement term that dominates latency in pipelined serving deployments. \index{Ray!distributed ML}
|
||||
[^fn-ray-distributed-ml]: **Ray**: A distributed computing framework from UC Berkeley (2018) that treats training, tuning, and serving as tasks within a single unified scheduler. The consequence of fragmented training/serving infrastructure is not just latency -- it is that bugs introduced during format translation (preprocessing logic, normalization constants, tokenizer versions) are silent until they cause prediction errors in production. This is the training-serving skew failure mode, and unified scheduling eliminates it by keeping the entire pipeline under one code path and one shared-memory object store. \index{Ray!distributed ML}
|
||||
|
||||
#### Model Format Optimization {#sec-ml-operations-model-format-optimization-c9d6}
|
||||
|
||||
@@ -2102,11 +2102,11 @@ days_needed_low_str = DriftDetectionDelay.days_needed_low_str
|
||||
|
||||
Production ML systems face two distinct forms of model drift[^fn-drift-types-ops] that monitoring must distinguish. *Concept drift*\index{Concept Drift!changing relationships}[^fn-covid-concept-drift] occurs when the underlying relationship between features and targets evolves: the function $P(Y|X)$ changes even though the inputs look similar. During the COVID-19 pandemic, for example, purchasing behavior shifted dramatically, invalidating many previously accurate recommendation models. *Data drift*[^fn-drift-covariate-shift], by contrast, refers to shifts in the input distribution $P(X)$ itself. In applications such as self-driving cars, this may result from seasonal changes in weather, lighting, or road conditions, all of which alter the model's inputs without changing the underlying physics of driving.
|
||||
|
||||
[^fn-drift-types-ops]: **Drift in Practice**: Data drift shifts $P(X)$ (e.g., webcam purchases surging in 2020) while concept drift shifts $P(Y|X)$ (e.g., users changing click behavior for the same content). The distinction determines the monitoring strategy: data drift is detectable without ground truth labels through distributional tests, whereas concept drift requires labeled samples or proxy metrics, making it slower and more expensive to detect. See @sec-data-engineering for the formal treatment. \index{Drift!operational distinction}
|
||||
[^fn-drift-types-ops]: **Drift Detection Lag**: Feature drift (covariate shift on $P(X)$) is detectable immediately from input distributions -- no labels needed. Concept drift ($P(Y|X)$ changing) is invisible until ground truth arrives, which in high-stakes domains (medical diagnosis, fraud detection, legal decisions) can take days, weeks, or months. This asymmetry means the most dangerous drift is also the slowest to detect, requiring proxy metrics (prediction confidence distributions, output entropy) as imperfect early warning systems that trade false alarm rate for detection speed. \index{Drift!operational distinction}
|
||||
|
||||
[^fn-covid-concept-drift]: **COVID-19 ML Impact**: COVID-era behavior changes provide a canonical example of abrupt concept drift: demand patterns and user behavior shifted faster than any retraining pipeline could respond. Many recommendation and pricing systems required emergency manual intervention because their scheduled retraining cadences assumed gradual drift, not discontinuous distribution shifts, exposing a gap in cost-aware automation planning. \index{COVID-19!concept drift}
|
||||
|
||||
[^fn-drift-covariate-shift]: **Drift** (from meteorology, where it describes gradual movement off course due to external forces): The statistical concept of "covariate shift" was formalized by Shimodaira in 2000, but "drift" gained adoption in ML operations because it captures the essential operational insight: production data moves away from training distributions continuously and silently, making time-to-detection the critical operational metric. \index{Drift!covariate shift}
|
||||
[^fn-drift-covariate-shift]: **Covariate Shift**: Shimodaira's importance weighting correction (2000) assumes the support of the training distribution covers the deployment distribution -- that every deployment input *could* have appeared in training, just with different probability. When deployment contains genuinely out-of-distribution inputs (new product categories, new demographics, adversarial inputs), the correction fails entirely and the model produces confidently wrong outputs with no warning signal, making support coverage the hidden assumption that determines whether drift correction or full retraining is required. \index{Drift!covariate shift}
|
||||
|
||||
Both forms of drift motivate a formal definition:
|
||||
|
||||
|
||||
@@ -684,7 +684,7 @@ Consider a MobileNet trained for image classification on a wearable health monit
|
||||
|
||||
Pruning[^fn-pruning-lecun-1989] directly addresses memory efficiency constraints by eliminating redundant parameters. Because neural networks carry far more weights than any single task demands (as established above), we can remove a significant fraction without substantial performance degradation. The central questions are *what* to prune (individual weights versus entire structures), *how* to decide what is expendable (magnitude, gradients, or activations), and *when* to prune (after training, during training, or even at initialization). As we will explore in @sec-hardware-acceleration, specialized hardware can further exploit the resulting sparse structures.
|
||||
|
||||
[^fn-pruning-lecun-1989]: **Optimal Brain Damage (LeCun, 1989)**: Yann LeCun, John Denker, and Sara Solla at Bell Labs introduced pruning as a principled technique by using second-derivative (Hessian) information to identify which weights could be removed with minimal impact on the loss function. The method achieved 4$\times$ parameter reduction in a handwriting recognizer without accuracy loss. Hessian computation costs $O(n^2)$ for $n$ parameters, which is why magnitude-based pruning --- despite its theoretical inferiority --- became the practical standard at modern scale. \index{Pruning!Optimal Brain Damage}
|
||||
[^fn-pruning-lecun-1989]: **Optimal Brain Damage (LeCun, 1989)**: LeCun, Denker, and Solla at Bell Labs achieved 4$\times$ parameter reduction --- and proportional memory savings --- in a handwriting recognizer by using second-derivative (Hessian) information to identify weights whose memory cost exceeded their accuracy contribution. The Hessian measures how much the loss increases when a weight is zeroed, directly ranking weights by their information-per-byte efficiency. However, Hessian computation costs $O(n^2)$ for $n$ parameters, which is why magnitude-based pruning --- despite its theoretical inferiority --- became the practical standard at modern scale, where computing the Hessian itself would exceed the memory budget of the model it aims to compress. \index{Pruning!Optimal Brain Damage}
|
||||
|
||||
::: {.callout-definition title="Pruning"}
|
||||
|
||||
@@ -705,7 +705,7 @@ $$
|
||||
\index{NP-hard!pruning optimization}
|
||||
where $\|\hat{W}\|_0$ is the **L0-norm** (the count of non-zero parameters). Since minimizing the L0-norm is NP-hard, we use heuristics[^fn-heuristic-pruning] like **magnitude-based pruning**. @lst-pruning_example demonstrates this approach, removing weights with small absolute values to transform a dense weight matrix into the sparse representation visualized in @fig-sparse-matrix.
|
||||
|
||||
[^fn-heuristic-pruning]: **Heuristic**: From Greek *heuriskein* (to discover), the same root as Archimedes' "eureka." In pruning, the dominant heuristic --- larger magnitude means more important --- works well empirically but creates a systems trap: counterexamples exist where small weights in early layers carry disproportionate influence, and pruning them collapses accuracy. This is why iterative prune-retrain cycles outperform one-shot approaches: each cycle lets the network redistribute importance before the next cut. \index{Heuristic!pruning}
|
||||
[^fn-heuristic-pruning]: **Heuristic**: From Greek *heuriskein* (to discover), the same root as Archimedes' "eureka." In pruning, the dominant heuristic --- larger magnitude means more important --- works well empirically but creates a systems trap: magnitude-based pruning applied globally removes 90%+ of parameters from overparameterized early layers while leaving critical bottleneck layers largely intact, giving the appearance of aggressive compression while preserving most of the compute and memory cost in the layers that matter. This is why iterative prune-retrain cycles with per-layer budgets outperform naive global magnitude pruning: each cycle lets the network redistribute importance before the next cut. \index{Heuristic!pruning}
|
||||
|
||||
\index{Pruning!binary mask}
|
||||
\index{Hadamard Product!pruning mask}
|
||||
@@ -2234,7 +2234,7 @@ anchor=north]{Teacher model};
|
||||
1. **Distillation Loss**\index{Knowledge Distillation!distillation loss}: Typically the Kullback-Leibler (KL) divergence\index{Kullback-Leibler divergence}[^fn-kl-divergence-distillation] between the teacher's softened output distribution and the student's distribution.
|
||||
2. **Student Loss**\index{Knowledge Distillation!student loss}: Standard cross-entropy loss against the ground-truth hard labels.
|
||||
|
||||
[^fn-kl-divergence-distillation]: **Kullback-Leibler (KL) Divergence**: Introduced by Kullback and Leibler at the NSA in 1951 for cryptanalysis, KL(P||Q) quantifies the extra bits needed to encode samples from distribution P using a code optimized for Q. In distillation, KL divergence is the natural loss because it measures exactly how much of the teacher's probability structure the student fails to capture. Its asymmetry (KL(teacher||student) $\neq$ KL(student||teacher)) creates a design choice: the standard direction penalizes the student for missing modes the teacher covers, preventing overconfident compression. \index{Kullback-Leibler Divergence!distillation loss}
|
||||
[^fn-kl-divergence-distillation]: **Kullback-Leibler (KL) Divergence**: Introduced by Kullback and Leibler at the NSA in 1951 for cryptanalysis, KL(P||Q) quantifies the extra bits needed to encode samples from distribution P using a code optimized for Q. The key asymmetric consequence: KL(teacher||student) penalizes the student heavily for assigning zero probability to teacher-probable outputs, forcing the student to maintain broad coverage of the teacher's distribution --- including low-probability "soft labels" that carry the teacher's learned uncertainty. This is why distillation transfers *calibration* as well as accuracy, while standard cross-entropy training against hard labels produces poorly calibrated models that are overconfident on ambiguous inputs. \index{Kullback-Leibler Divergence!distillation loss}
|
||||
|
||||
##### Distillation Mathematics {#sec-model-compression-distillation-mathematics-4af6}
|
||||
|
||||
@@ -2678,7 +2678,7 @@ Pruning, knowledge distillation, and other techniques explored in previous secti
|
||||
|
||||
The three-stage feedback loop in @fig-nas-flow captures the essence of how NAS works. NAS[^fn-nas-hardware-aware] operates through three interconnected stages: defining the search space (architectural components and constraints), applying search strategies (reinforcement learning [@zoph2017neural], evolutionary algorithms, or gradient-based methods) to explore candidate architectures, and evaluating performance to ensure discovered designs satisfy accuracy and efficiency objectives. The key insight is that this feedback loop allows the search to learn from each evaluation, progressively focusing on promising regions of the architecture space. This automation enables the discovery of novel architectures that often match or surpass human-designed models while requiring substantially less expert effort.
|
||||
|
||||
[^fn-nas-hardware-aware]: **Hardware-Aware NAS**: The three-stage loop becomes most powerful when the evaluation stage measures *actual* hardware latency rather than proxy metrics like FLOPs. MnasNet (Google, 2019) fed measured on-device latency back into the search, discovering architectures 1.8$\times$ faster than MobileNetV2 at higher accuracy. The insight: optimal architectures differ across mobile CPUs, GPUs, and TPUs because each platform has different memory hierarchies, so a FLOPs-optimal architecture can be latency-suboptimal on specific hardware. \index{NAS!hardware-aware}
|
||||
[^fn-nas-hardware-aware]: **Hardware-Aware NAS**: The critical systems decision in the three-stage loop is what the evaluation stage measures. Hardware-unaware NAS optimizes for FLOPs (a proxy), while hardware-aware NAS optimizes for actual latency on the target device --- and the two can diverge by 3--5$\times$ for the same FLOP count due to memory access patterns and operator support. MnasNet (Google, 2019) fed measured on-device latency back into the search, discovering architectures 1.8$\times$ faster than MobileNetV2 at higher accuracy, because depthwise separable convolutions that look efficient in FLOPs are memory-bandwidth-bound on mobile CPUs. \index{NAS!hardware-aware}
|
||||
|
||||
::: {#fig-nas-flow fig-env="figure" fig-pos="htb" fig-cap="**Neural Architecture Search Flow**: Three components form a feedback loop: a Search Space defines candidate operations, a Search Strategy selects architectures, and a Performance Estimation Strategy evaluates each candidate. The strategy iterates by feeding performance estimates back into the search until convergence." fig-alt="Three-box flowchart showing NAS process. Search Space box feeds into Search Strategy box, which exchanges Architecture and Performance estimate with Performance Estimation Strategy box in a feedback loop."}
|
||||
```{.tikz}
|
||||
@@ -2824,9 +2824,9 @@ learning model architecture parameters and weights together};
|
||||
|
||||
The effectiveness of NAS depends on three design decisions: what architectures to search over (the search space), how to explore that space efficiently (the search strategy[^fn-nas-reinforcement-learning][^fn-nas-evolutionary]), and how to evaluate each candidate's fitness for deployment. The following subsections formalize each decision, beginning with the optimization problem that NAS must solve.
|
||||
|
||||
[^fn-nas-reinforcement-learning]: **Reinforcement Learning NAS**: Uses an RL controller network to generate architectures, with accuracy as the reward signal. Google's NASNet controller required 22,400 GPU-hours (800 GPUs for 28 days) but discovered architectures achieving 82.7% ImageNet accuracy, 28% better than human-designed ResNet at similar FLOP budgets. The search cost itself is a systems constraint: at roughly \$50,000--\$100,000 in 2017 cloud prices, NAS was initially accessible only to well-funded labs. \index{NAS!reinforcement learning cost}
|
||||
[^fn-nas-reinforcement-learning]: **Reinforcement Learning NAS**: An RL controller network generates architecture descriptions, each candidate is trained to convergence, and the resulting validation accuracy serves as the reward signal. The expense comes from this inner loop: each reward evaluation requires training a full candidate network, and Zoph and Le (2017) evaluated 12,800--22,400 candidates --- totaling 22,400 GPU-hours (800 GPUs for 28 days) at roughly \$50,000--\$100,000 in 2017 cloud prices. This cost made NAS initially accessible only to well-funded labs and drove the development of weight-sharing methods that amortize training across candidates. \index{NAS!reinforcement learning cost}
|
||||
|
||||
[^fn-nas-evolutionary]: **Evolutionary NAS**: Treats architectures as genomes evolved through mutation (adding/removing layers) and crossover (combining parents). AmoebaNet required 3,150 GPU-days to reach 83.9% ImageNet accuracy, and regularized evolution outperformed RL-based NAS in head-to-head comparisons. Modern weight-sharing approaches reduce search cost by 1,000$\times$, transforming NAS from a datacenter-scale experiment into a practitioner-accessible tool. \index{NAS!evolutionary search}
|
||||
[^fn-nas-evolutionary]: **Evolutionary NAS**: Maintains a population of architecture candidates, selects parents by validation accuracy, and generates offspring by mutation (adding/removing layers, changing filter sizes) and crossover (combining parent subgraphs). The key efficiency insight: good components --- skip connections, depthwise separable convolutions --- are preserved and recombined across generations rather than rediscovered from scratch, making evolutionary search more sample-efficient than random search. AmoebaNet required 3,150 GPU-days to reach 83.9% ImageNet accuracy, outperforming RL-based NAS in head-to-head comparisons, but the thousands of candidate evaluations remain impractical without weight-sharing or proxy tasks to reduce the inner-loop cost. \index{NAS!evolutionary search}
|
||||
|
||||
#### The NAS Optimization Problem {#sec-model-compression-nas-optimization-problem-7f8e}
|
||||
|
||||
@@ -2943,7 +2943,7 @@ A `{python} llm_7b_str` billion parameter language model stored in FP16 consumes
|
||||
\index{Shannon, Claude!quantization theory}
|
||||
Quantization[^fn-quantization-shannon] affects every neural network weight and activation stored at some numerical precision: FP32 (32 bits), FP16 (16 bits), INT8 (8 bits), or lower.
|
||||
|
||||
[^fn-quantization-shannon]: **Quantization**: Rooted in Shannon's theory of representing continuous signals with discrete values, reducing FP32 to INT8 collapses over 4 billion representable values to just 256. The reason this applies to *every* weight and activation without catastrophic accuracy loss is that trained neural networks concentrate their information in relative magnitudes, not absolute precision --- a property that makes them uniquely tolerant of quantization error compared to other numerical software. \index{Quantization!etymology}
|
||||
[^fn-quantization-shannon]: **Quantization**: Rooted in Shannon's theory of representing continuous signals with discrete values, reducing FP32 to INT8 collapses over 4 billion representable values to just 256. Neural networks tolerate this because trained weights concentrate information in relative magnitudes, not absolute precision: INT8 typically causes <1% accuracy drop, while INT4 causes 1--3% for standard tasks. However, this tolerance is not universal --- models trained with large learning rates or sparse activations are less tolerant, and vision tasks tolerate quantization better than language generation. The systems consequence: quantization viability must be validated per-model and per-task, not assumed from aggregate benchmarks. \index{Quantization!etymology}
|
||||
|
||||
This choice directly impacts three system properties. Memory shrinks because an INT8 model is 4$\times$ smaller than FP32, enabling deployment on devices that could never hold the full-precision weights. Bandwidth demand drops proportionally: loading INT8 weights requires 4$\times$ less memory traffic, directly accelerating the bandwidth-bound inference that dominates LLM generation. Compute cost falls as well, since INT8 arithmetic is faster and cheaper than FP32 on most hardware with dedicated low-precision units [@gupta2015deep; @wang2019benchmarking].
|
||||
|
||||
@@ -4650,7 +4650,7 @@ where $q$ represents the simulated quantized value, $x$ denotes the full-precisi
|
||||
\index{Bengio, Yoshua!straight-through estimator}
|
||||
Although the forward pass utilizes quantized values, gradient calculations during backpropagation remain in full precision. The Straight-Through Estimator (STE) accomplishes this\index{Straight-Through Estimator (STE)}[^fn-ste-gradient-trick], which approximates the gradient of the quantized function by treating the rounding operation as if it had a derivative of one. In effect, the STE pretends quantization is the identity function during backpropagation, allowing gradients to flow unchanged through otherwise non-differentiable operations. This approach prevents the gradient from being obstructed due to the non-differentiable nature of the quantization operation, thereby allowing effective model training [@bengio2013estimating].
|
||||
|
||||
[^fn-ste-gradient-trick]: **Straight-Through Estimator (STE)**: Proposed by Bengio et al. (2013), the STE is mathematically unjustified --- rounding has true gradient zero almost everywhere --- yet it works because neural network loss landscapes are smooth enough that the identity approximation produces useful update directions. The systems consequence: without this trick, quantization-aware training would be impossible, and every quantized deployment would rely on post-training methods with their larger accuracy gaps. \index{STE!quantization-aware training}
|
||||
[^fn-ste-gradient-trick]: **Straight-Through Estimator (STE)**: Proposed by Bengio et al. (2013), the STE substitutes the identity function for the true gradient of rounding, which is zero almost everywhere (rounding is piecewise constant). This approximation is correct in magnitude but wrong in direction for weights near quantization boundaries --- a weight at 0.499 that should round to 0.0 receives the same gradient as one at 0.001, despite their opposite fates after rounding. QAT compensates by letting the model adapt to these systematic gradient errors during training, which is why QAT recovers accuracy that post-training quantization cannot. \index{STE!quantization-aware training}
|
||||
|
||||
Integrating quantization effects during training enables the model to learn weight and activation distributions that minimize numerical precision loss. The resulting model, when deployed using true low-precision arithmetic (e.g., INT8 inference), maintains significantly higher accuracy than one that is quantized post hoc [@krishnamoorthi2018quantizing].
|
||||
|
||||
|
||||
Reference in New Issue
Block a user