Fix image reference and pre-commit auto-fixes

- Rename _regression_testing.png to regression_testing.png for fault_tolerance.qmd
- Collapse extra blank lines (security_privacy, fault_tolerance)
- Prettify pipe tables (appendix_machine)
This commit is contained in:
Vijay Janapa Reddi
2026-03-02 17:21:56 -05:00
parent 5ec92f5e6a
commit 38ec2d66fb
4 changed files with 11 additions and 32 deletions

View File

@@ -922,17 +922,17 @@ node[align=center,left]{Larger Capacity\\ Lower Cost}(L);
The memory hierarchy is the fundamental physical constraint of machine learning systems. @tbl-physical-hierarchy-ref consolidates the physical properties—latency, bandwidth, and energy—across the entire stack.
| **Layer** | **Technology** | **Latency** | **Bandwidth** | **Energy (per 32b)** |
|:---------------------|:---------------|-----------------------------------------------:|--------------------------------------------------:|---------------------------------------------:|
| **Registers** | Flip-Flops | ~0.3 ns | — | `{python} AppendixMachineSetup.e_reg` pJ |
| **L1 Cache** | SRAM | ~`{python} AppendixMachineSetup.l1_ns` ns | — | `{python} AppendixMachineSetup.e_l1` pJ |
| **L2 Cache** | SRAM | ~`{python} AppendixMachineSetup.l2_ns` ns | — | `{python} AppendixMachineSetup.e_l2` pJ |
| **Memory (Local)** | HBM3 | ~`{python} AppendixMachineSetup.hbm_ns` ns | `{python} AppendixMachineSetup.bw_hbm_str` GB/s | `{python} AppendixMachineSetup.e_dram` pJ |
| **Interconnect** | NVLink 4.0 | ~`{python} AppendixMachineSetup.nvlink_ns` ns | `{python} AppendixMachineSetup.bw_nvlink_str` GB/s| ~`{python} AppendixMachineSetup.e_dram` pJ |
| **Host Link** | PCIe Gen5 | ~`{python} AppendixMachineSetup.pcie_ns` ns | `{python} AppendixMachineSetup.bw_pcie_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
| **System RAM** | DDR5 | ~100 ns | `{python} AppendixMachineSetup.bw_dram_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
| **Network (Fabric)** | InfiniBand NDR | ~`{python} AppendixMachineSetup.ib_ns` ns | `{python} AppendixMachineSetup.bw_net_str` GB/s | `{python} AppendixMachineSetup.e_net` pJ |
| **Storage (Local)** | NVMe SSD | ~`{python} AppendixMachineSetup.ssd_ns` ns | `{python} AppendixMachineSetup.bw_ssd_str` GB/s | `{python} AppendixMachineSetup.e_ssd` pJ |
| **Layer** | **Technology** | **Latency** | **Bandwidth** | **Energy (per 32b)** |
|:---------------------|:---------------|----------------------------------------------:|---------------------------------------------------:|-------------------------------------------:|
| **Registers** | Flip-Flops | ~0.3 ns | — | `{python} AppendixMachineSetup.e_reg` pJ |
| **L1 Cache** | SRAM | ~`{python} AppendixMachineSetup.l1_ns` ns | — | `{python} AppendixMachineSetup.e_l1` pJ |
| **L2 Cache** | SRAM | ~`{python} AppendixMachineSetup.l2_ns` ns | — | `{python} AppendixMachineSetup.e_l2` pJ |
| **Memory (Local)** | HBM3 | ~`{python} AppendixMachineSetup.hbm_ns` ns | `{python} AppendixMachineSetup.bw_hbm_str` GB/s | `{python} AppendixMachineSetup.e_dram` pJ |
| **Interconnect** | NVLink 4.0 | ~`{python} AppendixMachineSetup.nvlink_ns` ns | `{python} AppendixMachineSetup.bw_nvlink_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
| **Host Link** | PCIe Gen5 | ~`{python} AppendixMachineSetup.pcie_ns` ns | `{python} AppendixMachineSetup.bw_pcie_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
| **System RAM** | DDR5 | ~100 ns | `{python} AppendixMachineSetup.bw_dram_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
| **Network (Fabric)** | InfiniBand NDR | ~`{python} AppendixMachineSetup.ib_ns` ns | `{python} AppendixMachineSetup.bw_net_str` GB/s | `{python} AppendixMachineSetup.e_net` pJ |
| **Storage (Local)** | NVMe SSD | ~`{python} AppendixMachineSetup.ssd_ns` ns | `{python} AppendixMachineSetup.bw_ssd_str` GB/s | `{python} AppendixMachineSetup.e_ssd` pJ |
: **Physical Properties of the Memory Hierarchy (c. 2024)**: Consolidating latency, bandwidth, and energy across the memory hierarchy. The hierarchy spans five orders of magnitude in latency and six orders of magnitude in energy per access. For the ML engineer, this table defines the "Silicon Contract": every optimization that moves data one layer higher in the hierarchy delivers an order-of-magnitude dividend in performance. {#tbl-physical-hierarchy-ref}

View File

@@ -666,7 +666,6 @@ yshift=2mm,fill=none,fit=(CPU)(B1),line width=2.5pt](BB1){};
\end{tikzpicture}
```
:::
Real-world evidence of SDC in production systems confirms these risks. @fig-sdc-jeffdean shows corrupted data blocks accumulating in a shuffle and merge database at Google, where even a small fraction of corrupted blocks can cascade into significant data quality degradation.
@@ -805,7 +804,6 @@ keep name/.style={prefix after command={\pgfextra{\let\fixname\tikzlastnode}}},
\end{tikzpicture}
```
:::
::: {.callout-checkpoint title="Detecting Silent Corruption"}
@@ -831,7 +829,6 @@ Byzantine failures are particularly dangerous in distributed training because th
::: {#fig-failure-types fig-env="figure" fig-pos="htb" fig-cap="**Fail-Stop vs. Byzantine Failures**. In the fail-stop model (left), a failed worker simply ceases to send messages, which is easily detected by timeouts. In the Byzantine model (right), a failed worker continues to participate but sends incorrect data (e.g., corrupted gradients reported as valid), which can poison the global model state if not detected by validational redundancy." fig-alt="Side-by-side diagrams. Left: fail-stop failure with worker W2 silent, dashed timeout arrow to coordinator. Right: Byzantine failure with W2 sending incorrect gradient 9.9 while W1 sends valid 0.5, resulting in poisoned update warning."}
![](images/svg/failure-types.svg)
:::
Detection of Byzantine failures requires redundant computation. Multiple workers computing gradients for the same data enable comparison of results. Statistical outlier detection can identify workers consistently producing anomalous gradients. These detection mechanisms add computational overhead and may not catch subtle corruption.
@@ -879,7 +876,6 @@ Verify your understanding of how failure domains nest and their operational impa
::: {#fig-failure-domains fig-env="figure" fig-pos="htb" fig-cap="**Hierarchy of Failure Domains**. Failure domains are often nested or overlapping. A GPU failure affects one device. A node failure affects 8 GPUs. A rack switch failure affects 32-64 GPUs. A power distribution unit (PDU) failure may affect multiple racks. Effective fault tolerance requires placing replicas across independent domains (e.g., different racks or rows) to survive correlated failures." fig-alt="Nested rectangles showing failure domain hierarchy. Region contains Zone A and Zone B. Each zone contains a rack with switch and PDU. Each rack contains a node with OS and PCIe. Each node contains multiple GPUs. Annotation explains containment."}
![](images/svg/failure-domains.svg)
:::
### The Bathtub Curve and Hardware Lifecycle {#sec-fault-tolerance-reliability-reliability-bathtub-curve-hardware-lifecycle-7d8a}
@@ -901,7 +897,6 @@ The practical implication for ML systems is that fleet-wide failure rates depend
::: {#fig-bathtub-curve fig-env="figure" fig-pos="htb" fig-cap="**The Bathtub Curve**. Hardware failure rates $\lambda(t)$ vary over time. (1) **Infant Mortality**: High failure rate initially due to manufacturing defects. (2) **Useful Life**: Constant, low failure rate where random failures dominate. (3) **Wear-Out**: Increasing failure rate as components age. Burn-in testing aims to filter out infant mortality failures before deployment." fig-alt="Line graph of failure rate versus component age showing bathtub shape. Three phases: infant mortality with high decreasing rate, useful life with constant low rate, and wear-out with increasing rate. Vertical dashed line marks burn-in period."}
![](images/svg/bathtub-curve.svg)
:::
Proactive maintenance strategies aim to replace components approaching wear-out before they fail in production. Predictive analytics using GPU telemetry can identify components likely to fail soon. Temperature trends, error counts, and performance degradation enable scheduled replacement during maintenance windows rather than unplanned outages during training runs.
@@ -1038,7 +1033,6 @@ cell/.style={draw=BrownLine,line width=0.5pt, minimum size=\cellsize,
\end{tikzpicture}}
```
:::
Transient faults encompass several distinct categories: Single Event Upsets (SEUs) from cosmic rays and ionizing radiation, voltage fluctuations [@reddi2013resilient] from power supply instability, electromagnetic interference (EMI), electrostatic discharge (ESD), crosstalk, ground bounce, timing violations, and soft errors in combinational logic [@mukherjee2005soft].
@@ -1165,7 +1159,6 @@ node[below,pos=0.91](ULDD4){SAO \textcolor{red}{SA1}}(DD4);
\end{tikzpicture}
```
:::
For ML systems, permanent faults during training cause gradient calculation errors and parameter corruption that persist until hardware replacement, requiring more sophisticated recovery strategies than transient faults demand [@he2023understanding]. Permanent faults in storage can compromise entire training datasets or saved models [@zhang2018analyzing]. Mitigating permanent faults requires integrated fault-tolerant design combining hardware redundancy and error-correcting codes [@kim2015bamboo] with checkpoint and restart mechanisms[^fn-checkpoint-restart-training] [@egwutuoha2013survey]. The **Young-Daly formula**, derived in @sec-data-storage, balances checkpoint overhead against lost computation; the key insight is that increasing \text{MTBF} through hardware hardening yields diminishing returns due to the square-root relationship, so systems must balance investment in hardware reliability against investment in fast checkpointing infrastructure.
@@ -1299,7 +1292,6 @@ font=\usefont{T1}{phv}{m}{n}\small,bluegraph](PBE){Parity bit examples};
\end{tikzpicture}}
```
:::
[^fn-hamming-ecc-origin]: **Hamming Codes (1950)**: Richard Hamming invented error-correcting codes at Bell Labs after repeated frustration with relay computer failures corrupting weekend batch jobs. His SECDED (single-error-correcting, double-error-detecting) scheme uses parity bits at power-of-2 positions to locate errors with $O(\log n)$ overhead. Every modern ECC DRAM module descends from this design, protecting the terabytes of model weights and optimizer state in ML training from the soft errors that would otherwise accumulate silently. \index{Hamming Code!ECC origin}
@@ -1826,7 +1818,6 @@ yshift=-6mm,fill=cyan!10,fit=(PERSON2)(DISPLAY3),line width=0.75pt](BB2){};
\end{tikzpicture}
```
:::
## Fault Injection Tools and Frameworks {#sec-ft-fault-injection-tools-frameworks}
@@ -1922,7 +1913,6 @@ anchor=north]{\textbf{System-level masking effect analysis}};
\end{tikzpicture}
```
:::
### Hardware-Based Fault Injection {#sec-ft-hardwarebased-fault-injection}
@@ -2164,7 +2154,6 @@ When assumptions are violated, the optimal interval may shift significantly. As
::: {#fig-checkpoint-recovery-timeline fig-env="figure" fig-pos="htb" fig-cap="**Checkpoint-Recovery Timeline**. A training run proceeds through alternating phases of computation (green) and checkpoint writes (blue). When a failure occurs (red lightning bolt), all work since the last completed checkpoint is lost (hatched gray). Recovery involves job restart overhead, checkpoint loading, and pipeline warmup before productive training resumes. The total cost of a failure includes both the lost work and the recovery latency." fig-alt="Horizontal Gantt chart showing training phases in green, checkpoint writes in blue, a failure point with red marker, gray hatched lost work region, and orange recovery phase before training resumes."}
![](images/svg/checkpoint-recovery-timeline.svg)
:::
The timeline in @fig-checkpoint-recovery-timeline reveals why $T_{\text{restart}}$ matters as much as $T_{\text{save}}$: the total failure cost is the sum of lost work (bounded by $\tau_{\text{opt}}$) and recovery time, which includes job scheduling, checkpoint loading, and pipeline warmup. Production systems where $T_{\text{restart}}$ exceeds $T_{\text{save}}$ by 3--5$\times$ should use the modified formula that accounts for both terms.
@@ -2638,7 +2627,6 @@ Suppose a 1,024-GPU training job loses an 8-GPU node to a hardware fault, but th
::: {#fig-elastic-flow fig-env="figure" fig-pos="htb" fig-cap="**Elastic Training Recovery**. Unlike static training which aborts on failure, elastic training adapts. When a worker fails, the job pauses, redistributes the dataset and model shards across the remaining $N-1$ workers, and resumes training from the last consistent state. This capability transforms hard failures into temporary throughput degradations." fig-alt="Flowchart with 5 steps and decision diamond. Training on N GPUs flows to monitor alert diamond. On failure: pause training, rescale batch and learning rate to N-1 GPUs, resume training. Loop returns to monitoring. Annotation highlights key step."}
![](images/svg/elastic-flow.svg)
:::
Elastic training provides several advantages. For fault tolerance, failures reduce worker count rather than stopping training. For resource efficiency, training can use variable resource allocations. For preemption handling, systems gracefully handle preemption in shared clusters. For cost optimization, systems scale based on spot instance availability.
@@ -2836,7 +2824,6 @@ In **active-active replication** (@fig-serving-redundancy, left), all replicas a
::: {#fig-serving-redundancy fig-env="figure" fig-pos="htb" fig-cap="**Serving Redundancy Strategies**. Comparison of Active-Active vs. Active-Passive replication. Active-Active (left) distributes load across all replicas, maximizing utilization but requiring capacity headroom to absorb failures. Active-Passive (right) keeps a standby replica idle and synchronized via heartbeat, simplifying failover logic at the cost of idle resource utilization." fig-alt="Two diagrams comparing replication strategies. Left: active-active with load balancer sending 50% to each of two green replicas. Right: active-passive with load balancer sending 100% to primary while dashed standby receives heartbeat sync."}
![](images/svg/serving-redundancy.svg)
:::
Load is distributed across replicas. Failure of one replica increases load on remaining replicas. This approach maximizes resource utilization but requires sufficient capacity in remaining replicas to handle increased load.

View File

@@ -840,7 +840,6 @@ node[left=5pt,pos=0.85,black,fill=magenta!20,circle,inner sep=1pt]{2}(EKV4.north
\end{tikzpicture}
```
:::
### Insufficient Isolation: Jeep Cherokee Hack {#sec-security-privacy-insufficient-isolation-jeep-cherokee-hack-6a7c}
@@ -1068,7 +1067,6 @@ anchor=north]{Lifecycle};
\end{tikzpicture}}
```
:::
Machine learning models are not solely passive victims of attack; in some cases, they can be employed as components of an attack strategy. Pretrained models, particularly large generative or discriminative networks, may be adapted to automate tasks such as adversarial example generation, phishing content synthesis[^fn-phishing-ai], or protocol subversion. Open-source or publicly accessible models can be fine-tuned for malicious purposes, including impersonation, surveillance, or reverse-engineering of secure systems.
@@ -1150,7 +1148,6 @@ anchor=north]{\textbf{Exact Model Theft}};
\end{tikzpicture}
```
:::
::: {.callout-war-story title="The BERT Model Extraction"}
@@ -1652,7 +1649,6 @@ Box4/.style={Box, draw=OrangeLine,fill=OrangeL!50,text width=43mm}
\end{tikzpicture}}
```
:::
With the attack taxonomy established, the following *knowledge check* tests your ability to distinguish between *model attack* types.
@@ -1878,7 +1874,6 @@ Ultimately, supply chain risks must be treated as a first-class concern in ML sy
::: {#fig-hw-supply-chain fig-env="figure" fig-pos="htb" fig-cap="**Hardware Supply Chain Attack Surface**. Vulnerabilities exist at every stage of the hardware lifecycle. Unlike software, which can be patched remotely, hardware compromises often require physical replacement. Attackers can introduce design flaws, insert Trojan circuits during fabrication, substitute inferior components during assembly, or tamper with devices during distribution." fig-alt="Linear flowchart with five stages: Design, Fabrication, Assembly, Distribution, Deployment. Attack vectors labeled above each stage show vulnerabilities at each point in the hardware lifecycle."}
![](images/svg/hw-supply-chain.svg)
:::
### Case Study: Supermicro Controversy {#sec-security-privacy-case-study-supermicro-controversy-72b7}
@@ -2070,7 +2065,6 @@ A structured framework for layered defense in ML systems progresses from data-ce
::: {#fig-defense-stack fig-env="figure" fig-pos="htb" fig-cap="**Layered Defense Stack**: Machine learning systems require multi-faceted security strategies that progress from foundational hardware protections to data-centric privacy techniques, building trust across all layers. This architecture integrates safeguards at the data, model, runtime, and infrastructure levels to mitigate threats and ensure robust deployment in production environments." fig-alt="Four-layer defense stack. Bottom to top: hardware security (TEEs, HSMs, PUFs), system security (integrity, monitoring), model security (encryption), data privacy (differential privacy, federated learning)."}
![](images/svg/defense-stack.svg)
:::
### Privacy-Preserving Data Techniques {#sec-security-privacy-privacypreserving-data-techniques-64f8}
@@ -2603,7 +2597,6 @@ channelcolor=red!79!black!90, Linewidth=1.0pt}};
\end{tikzpicture}
```
:::
This architecture underpins the secure deployment of machine learning applications on consumer devices. For example, Apple's Face ID system uses a secure enclave to perform facial recognition entirely within a hardware-isolated environment. The face embedding model is executed inside the enclave, and biometric templates are stored in secure nonvolatile memory accessible only via the enclave's I²C interface. During authentication, input data from the infrared camera is processed locally, and no facial features or predictions ever leave the secure region. Even if the application processor or operating system is compromised, the enclave prevents access to sensitive model inputs, parameters, and outputs—ensuring that biometric identity remains protected end to end.
@@ -2680,7 +2673,6 @@ Box6/.style={Box,draw=OrangeLine,fill=OrangeL!50},
\end{tikzpicture}
```
:::
A well-known real-world implementation of Secure Boot appears in Apple's Face ID system, which uses advanced machine learning for facial recognition. For Face ID to operate securely, the entire device stack, from the initial power-on to the execution of the model, must be verifiably trusted.