diff --git a/book/quarto/contents/vol1/backmatter/appendix_machine.qmd b/book/quarto/contents/vol1/backmatter/appendix_machine.qmd index 3a8cb8e62..6b18f99d4 100644 --- a/book/quarto/contents/vol1/backmatter/appendix_machine.qmd +++ b/book/quarto/contents/vol1/backmatter/appendix_machine.qmd @@ -182,7 +182,7 @@ These relationships are governed by physics or arithmetic—they will still be t #### Energy Hierarchy {.unnumbered} -@tbl-energy-ratios-ref quantifies the energy cost of data movement versus computation—the fundamental reason why arithmetic intensity dominates ML performance optimization. +@tbl-energy-ratios-ref quantifies the energy cost of data movement versus computation—the fundamental reason why arithmetic intensity dominates ML performance optimization.[^fn-horowitz-energy] | Relationship | Ratio | Why It's Stable | |:-----------------------------|-------------------------------------:|:--------------------------------------| @@ -191,7 +191,7 @@ These relationships are governed by physics or arithmetic—they will still be t | FP32 vs. FP16 energy | ~`{python} fp32_vs_fp16`$\times$ | Halving bits roughly halves energy | | L1 SRAM vs. register | ~`{python} l1_vs_reg`$\times$ | Distance to ALU | -: **The Energy Wall.** Moving data costs ~580$\times$ more energy than computing on it. This ratio is physics, not engineering.[^fn-horowitz-energy] {#tbl-energy-ratios-ref} +: **The Energy Wall.** Moving data costs ~580$\times$ more energy than computing on it. This ratio is physics, not engineering. {#tbl-energy-ratios-ref} [^fn-horowitz-energy]: Energy numbers from Horowitz's classic "Computing's Energy Problem" (ISSCC 2014, 45nm process). While absolute values scale with process node, the *ratios* between memory access and compute remain remarkably stable because wire capacitance (distance) dominates. @@ -210,7 +210,7 @@ These relationships are governed by physics or arithmetic—they will still be t #### Scaling Laws {.unnumbered} -@tbl-scaling-rules-ref collects the arithmetic relationships that govern memory and compute requirements for training and inference. +@tbl-scaling-rules-ref collects the arithmetic relationships that govern memory and compute requirements for training and inference.[^fn-training-memory] | Rule | Formula | Example | |:------------------------------|:---------------------------------------|:----------------------------------------| @@ -221,7 +221,7 @@ These relationships are governed by physics or arithmetic—they will still be t | Training FLOPs | ~6$\times$ parameters$\times$ tokens | 7B on 1T tokens → $4 \times 10^{22}$ FLOPs | | Datacenter vs. edge compute | ~`{python} dc_mobile_ratio`$\times$ | Compute per watt$\times$ power budget | -: **Scaling Rules.** These are arithmetic, not hardware-specific. Training memory includes FP16 weights (2B), FP32 master weights (4B), and Adam optimizer states (8B for momentum + variance).[^fn-training-memory] {#tbl-scaling-rules-ref} +: **Scaling Rules.** These are arithmetic, not hardware-specific. Training memory includes FP16 weights (2B), FP32 master weights (4B), and Adam optimizer states (8B for momentum + variance). {#tbl-scaling-rules-ref} [^fn-training-memory]: The 16 bytes/parameter rule assumes mixed-precision training with Adam. ZeRO optimization can reduce per-GPU memory by sharding optimizer states across GPUs, but the total memory across all GPUs remains ~16$\times$ parameters. diff --git a/book/quarto/contents/vol2/backmatter/references.bib b/book/quarto/contents/vol2/backmatter/references.bib index 86b902546..502f26750 100644 --- a/book/quarto/contents/vol2/backmatter/references.bib +++ b/book/quarto/contents/vol2/backmatter/references.bib @@ -7536,6 +7536,21 @@ archiveprefix = {arXiv}, } +@article{rumelhart1986learning, + title = {Learning representations by back-propagating errors}, + author = {Rumelhart, David E. and Hinton, Geoffrey E. and Williams, Ronald J.}, + journal = {Nature}, + publisher = {Springer Science and Business Media LLC}, + volume = {323}, + number = {6088}, + pages = {533--536}, + doi = {10.1038/323533a0}, + issn = {0028-0836,1476-4687}, + url = {https://doi.org/10.1038/323533a0}, + source = {Crossref}, + date = {1986-10}, +} + @article{ryan2000self, title = { Self-determination theory and the facilitation of intrinsic motivation, social development, and diff --git a/book/quarto/contents/vol2/fault_tolerance/fault_tolerance.qmd b/book/quarto/contents/vol2/fault_tolerance/fault_tolerance.qmd index ec0e4f5b1..0d3ac05f9 100644 --- a/book/quarto/contents/vol2/fault_tolerance/fault_tolerance.qmd +++ b/book/quarto/contents/vol2/fault_tolerance/fault_tolerance.qmd @@ -1178,7 +1178,7 @@ A common example of a transient fault is a bit flip in the main memory. If an im These general impacts become particularly pronounced in ML systems, where transient faults can have significant implications during the training phase [@he2023understanding]. ML training involves iterative computations and updates to model parameters based on large datasets. If a transient fault occurs in the memory storing the model weights or gradients[^fn-gradients], it can lead to incorrect updates and compromise the convergence and accuracy of the training process. For example, a bit flip in the weight matrix of a neural network can cause the model to learn incorrect patterns or associations, leading to degraded performance [@wan2021analyzing]. Transient faults in the data pipeline, such as corruption of training samples or labels, can also introduce noise and affect the quality of the learned model. -[^fn-gradients]: **Gradients**: In ML training, gradients are partial derivatives of the loss function with respect to model parameters, computed via backpropagation. Introduced by Rumelhart, Hinton, and Williams in 1986, gradients indicate how to adjust each weight to minimize prediction error. Modern models compute billions of gradient values per training step; a single corrupted gradient can propagate incorrect updates across thousands of parameters, potentially causing training divergence or converging to suboptimal solutions. +[^fn-gradients]: **Gradients**: In ML training, gradients are partial derivatives of the loss function with respect to model parameters, computed via backpropagation. Introduced by @rumelhart1986learning, gradients indicate how to adjust each weight to minimize prediction error. Modern models compute billions of gradient values per training step; a single corrupted gradient can propagate incorrect updates across thousands of parameters, potentially causing training divergence or converging to suboptimal solutions. During the inference phase, transient faults can impact the reliability and trustworthiness of ML predictions. If a transient fault occurs in the memory storing the trained model parameters or during the computation of inference results, it can lead to incorrect or inconsistent predictions. For instance, a bit flip in the activation values of a neural network can alter the final classification or regression output [@mahmoud2020pytorchfi]. In safety-critical applications[^fn-safety-critical], these faults can have severe consequences, resulting in incorrect decisions or actions that may compromise safety or lead to system failures [@li2017understanding; @jha2019ml].