footnotes: Group C citation integrity pass (Vol1)

- model_compression/fn-int8-energy-deployment: add [@horowitz2014computing] for 200× DRAM/MAC energy claim - ml_ops/fn-ray-distributed-ml: replace unverifiable "10x" with mechanism-based framing (serialization overhead removal) - ml_ops/fn-youtube-feedback-loop: replace unverifiable "2 years" with qualitative multi-year framing - hw_acceleration/fn-hbm-bandwidth-cost: replace unverifiable "50% of BOM" with qualitative "dominant cost component"
2026-03-09 07:15:51 -05:00 · 2026-02-24 08:43:34 -05:00
parent 704f7555fe
commit 43b8f35f85
3 changed files with 4 additions and 4 deletions
--- a/book/quarto/contents/vol1/hw_acceleration/hw_acceleration.qmd
+++ b/book/quarto/contents/vol1/hw_acceleration/hw_acceleration.qmd
@@ -716,7 +716,7 @@ Neural networks are characterized by three unique properties that drive this shi

 The primary engineering challenge is no longer "how fast can we calculate?" but "how close can we keep the data to the calculation?" In modern accelerators, accessing data from external memory (DRAM)\index{DRAM!energy cost} can consume 100$\times$ more energy than the actual arithmetic operation. This disparity is precisely why the accelerator architecture in @fig-accelerator-anatomy prioritizes high-bandwidth memory (HBM)\index{HBM!High Bandwidth Memory}[^fn-hbm-bandwidth-cost] and large on-chip scratchpads\index{Scratchpad Memory!accelerator design} over simply adding more compute units.

-[^fn-hbm-bandwidth-cost]: **HBM (High Bandwidth Memory)**: Achieves 2--3 TB/s bandwidth through 3D die stacking with thousands of through-silicon vias (TSVs), compared to 500--700 GB/s for GDDR6X. This 3--5$\times$ bandwidth advantage transforms memory-bound ML workloads toward compute-bound performance, which is why every datacenter AI accelerator (H100, A100, TPUv4) uses HBM. The trade-off is cost: HBM accounts for roughly 50% of an accelerator's bill of materials, limiting it to datacenter hardware where the bandwidth-per-dollar justifies the premium over consumer-grade GDDR. \index{HBM!bandwidth-cost trade-off}
+[^fn-hbm-bandwidth-cost]: **HBM (High Bandwidth Memory)**: Achieves 2--3 TB/s bandwidth through 3D die stacking with thousands of through-silicon vias (TSVs), compared to 500--700 GB/s for GDDR6X. This 3--5$\times$ bandwidth advantage transforms memory-bound ML workloads toward compute-bound performance, which is why every datacenter AI accelerator (H100, A100, TPUv4) uses HBM. The trade-off is cost: HBM is a dominant cost component in datacenter AI accelerators, limiting it to applications where the bandwidth-per-dollar justifies the substantial premium over consumer-grade GDDR. \index{HBM!bandwidth-cost trade-off}

 To see how accelerators address this integration bottleneck in practice, examine the architectural blueprint in @fig-accelerator-anatomy. Notice how every design decision, from the processing element grid to the multi-level cache hierarchy, targets data movement reduction rather than raw compute multiplication.

--- a/book/quarto/contents/vol1/ml_ops/ml_ops.qmd
+++ b/book/quarto/contents/vol1/ml_ops/ml_ops.qmd
@@ -663,7 +663,7 @@ The debt patterns described above are not theoretical constructs. They have play

 \index{Feedback Loops!YouTube case study}YouTube's recommendation engine has faced repeated criticism for promoting sensational or polarizing content[^fn-youtube-feedback-loop]. Much of this stems from feedback loop debt: recommendations influence user behavior, which in turn becomes training data. Over time, this led to unintended content amplification. Mitigating this required substantial architectural overhauls, including cohort-based evaluation, delayed labeling, and more explicit disentanglement between engagement metrics and ranking logic.

-[^fn-youtube-feedback-loop]: **YouTube Recommendation System**: The feedback loop debt originated from the system's primary goal to maximize watch time, which mistook high engagement with sensational content for a signal of quality. This architectural flaw amplified such content because the model was optimizing a proxy metric (engagement) instead of a true objective like user satisfaction. Fixing this required over 2 years of engineering work to disentangle ranking logic from raw engagement metrics. \index{YouTube!feedback loop debt}
+[^fn-youtube-feedback-loop]: **YouTube Recommendation System**: The feedback loop debt originated from the system's primary goal to maximize watch time, which mistook high engagement with sensational content for a signal of quality. This architectural flaw amplified such content because the model was optimizing a proxy metric (engagement) instead of a true objective like user satisfaction. Disentangling ranking logic from raw engagement metrics required substantial multi-year engineering effort — a corrective cost that could not begin until the proxy-metric problem was diagnosed, not just observed. \index{YouTube!feedback loop debt}

 #### Zillow: Correction Cascade Failure {#sec-ml-operations-zillow-correction-cascade-failure-3dd8}

@@ -1724,7 +1724,7 @@ To maintain lineage and auditability, teams track model artifacts, including scr

 These tools and practices, along with distributed orchestration frameworks like Ray[^fn-ray-distributed-ml], enable teams to deploy ML models resiliently, ensuring smooth transitions between versions, maintaining production stability, and optimizing performance across diverse use cases.

-[^fn-ray-distributed-ml]: **Ray**: A distributed computing framework from UC Berkeley that treats training, tuning, and serving as tasks within a single, unified scheduler. This design directly mitigates a primary source of production instability—the code and configuration "drift" that occurs when translating a model from a training framework to a separate serving framework. By managing the entire inference graph in one system, its shared-memory object store can reduce the latency of data handoffs between pipeline stages (e.g., pre-processing and model execution) by over 10x compared to multi-container deployments relying on network calls. \index{Ray!distributed ML}
+[^fn-ray-distributed-ml]: **Ray**: A distributed computing framework from UC Berkeley that treats training, tuning, and serving as tasks within a single, unified scheduler. This design directly mitigates a primary source of production instability—the code and configuration "drift" that occurs when translating a model from a training framework to a separate serving framework. By managing the entire inference graph in one system, its shared-memory object store eliminates the serialization overhead of inter-container network calls, directly reducing the data-movement term that dominates latency in pipelined serving deployments. \index{Ray!distributed ML}

 #### Model Format Optimization {#sec-ml-operations-model-format-optimization-c9d6}

--- a/book/quarto/contents/vol1/optimizations/model_compression.qmd
+++ b/book/quarto/contents/vol1/optimizations/model_compression.qmd
@@ -3135,7 +3135,7 @@ In FP32, even the compact DS-CNN architecture consumes 4$\times$ more memory ban

 Beyond direct compute savings, reducing numerical precision has a significant impact on memory energy consumption, which often dominates total system power. Lower-precision representations reduce data storage requirements and memory bandwidth usage, leading to fewer and more efficient memory accesses. Accessing memory, particularly off-chip DRAM, is far more energy-intensive than performing arithmetic operations: DRAM accesses require orders of magnitude more energy (1.3-2.6 nJ) compared to cache accesses (e.g., 10 pJ for an 8 KB L1 cache access). An instruction's total energy can therefore be dominated by memory access patterns rather than computation[^fn-int8-energy-deployment].

-[^fn-int8-energy-deployment]: **INT8 Energy Impact**: The energy dominance of memory access is extreme: a single 64-bit DRAM read costs roughly 200$\times$ the energy of an INT8 multiply-accumulate. Quantizing from FP32 to INT8 attacks this disparity on both fronts --- 4$\times$ fewer bytes moved *and* cheaper arithmetic per operation. Concretely, quantizing MobileNetV2 from FP32 to INT8 reduces total energy per inference by 6.6$\times$, with most savings coming from reduced memory traffic rather than cheaper arithmetic. \index{Quantization!energy impact}
+[^fn-int8-energy-deployment]: **INT8 Energy Impact**: The energy dominance of memory access is extreme: a single 64-bit DRAM read costs roughly 200$\times$ the energy of an INT8 multiply-accumulate [@horowitz2014computing]. Quantizing from FP32 to INT8 attacks this disparity on both fronts --- 4$\times$ fewer bytes moved *and* cheaper arithmetic per operation. Concretely, quantizing MobileNetV2 from FP32 to INT8 reduces total energy per inference by 6.6$\times$, with most savings coming from reduced memory traffic rather than cheaper arithmetic. \index{Quantization!energy impact}

 Reducing numerical precision thus improves efficiency on two fronts: faster computation and less data movement. This dual benefit is especially valuable for hardware accelerators and edge devices, where memory bandwidth and power efficiency are binding constraints.