Renames Volume II parts and refines content for clarity

Renames Volume II parts from V-VIII to I-IV, updating all corresponding references in the about section, volume introduction, and individual part principle files.

Refines various textual elements across the book for improved conciseness and readability. Cleans up markdown formatting, including removal of unnecessary horizontal rules and empty code blocks. Adjusts footnote placement for better consistency.

Adds new reliability calculation parameters and corrects a tikz diagram rendering issue.
This commit is contained in:
Vijay Janapa Reddi
2026-02-27 18:00:41 -05:00
parent b256ce1296
commit 30f4cb1453
13 changed files with 72 additions and 68 deletions

View File

@@ -34,10 +34,10 @@ This textbook is organized into two volumes following the **Hennessy & Patterson
**Volume II: Machine Learning Systems at Scale** extends these foundations into production-scale systems:
- **The Fleet** (Part V): Master the physical infrastructure, from datacenter architecture and accelerators to high-performance networking and storage.
- **Distributed ML** (Part VI): Address the algorithms of scale, including distributed training paradigms, collective communication, and fault tolerance.
- **Deployment at Scale** (Part VII): Navigate the shift to production serving, edge intelligence, and operational lifecycles for massive fleets.
- **The Responsible Fleet** (Part VIII): Explore security, privacy, robustness, and the governance frameworks required for global-scale AI.
- **The Fleet** (Part I): Master the physical infrastructure, from datacenter architecture and accelerators to high-performance networking and storage.
- **Distributed ML** (Part II): Address the algorithms of scale, including distributed training paradigms, collective communication, and fault tolerance.
- **Deployment at Scale** (Part III): Navigate the shift to production serving, edge intelligence, and operational lifecycles for massive fleets.
- **The Responsible Fleet** (Part IV): Explore security, privacy, robustness, and the governance frameworks required for global-scale AI.
**Laboratory exercises** complement the core content, allowing you to apply concepts with hands-on experience across multiple embedded platforms. Throughout both volumes, **quizzes** provide quick self-checks to reinforce understanding at key learning milestones.
@@ -58,7 +58,7 @@ This textbook assumes the following background:
**Volume II Additional Prerequisites**:
- Volume II assumes completion of Volume I or equivalent background in single-machine ML systems.
- Familiarity with distributed systems concepts (networking, parallelism) is helpful for Part V onward.
- Familiarity with distributed systems concepts (networking, parallelism) is helpful for Part I onward.
### Why ML Systems Requires a New Approach {#sec-book-why-ml-systems-different}
@@ -172,13 +172,13 @@ This work takes you from understanding ML systems conceptually to building and d
1. **Part V: The Fleet**
*Build the physical computer.* Master the physical layer of ML systems: datacenter architecture, specialized accelerators (TPUs, H100s), high-bandwidth networking (NVLink, InfiniBand), and massive-scale storage.
2. **Part VI: Distributed ML**
2. **Part II: Distributed ML**
*Master the algorithms of scale.* Coordinate computation across thousands of devices. Understand distributed training strategies (3D parallelism), collective communication primitives, and fault tolerance mechanisms.
3. **Part VII: Deployment at Scale**
3. **Part III: Deployment at Scale**
*Serve the world.* Navigate the shift from single-node serving to global-scale fleets. Deploy intelligence to the edge, manage production lifecycles, and optimize for both throughput and tail latency.
4. **Part VIII: The Responsible Fleet**
4. **Part IV: The Responsible Fleet**
*Shape the future.* Explore security, privacy, robustness, and the governance frameworks required to ensure that global-scale AI systems are safe, fair, and sustainable.
**Hands-On Learning:**
@@ -205,22 +205,22 @@ Building an accurate model represents only part of the challenge. This Part addr
**Volume II: Machine Learning Systems at Scale**
**Part V: The Fleet** asks: *What physical infrastructure powers distributed AI?*
**Part I: The Fleet** asks: *What physical infrastructure powers distributed AI?*
Distributed algorithms need physical hardware. This Part builds the "supercomputer" required for modern ML, examining the datacenter architecture, specialized accelerators (TPUs, H100s, B200s), high-bandwidth interconnects, and massive-scale storage systems that fuel the training fleet.
**Part VI: Distributed ML** asks: *How do we coordinate computation across thousands of devices?*
**Part II: Distributed ML** asks: *How do we coordinate computation across thousands of devices?*
Scaling requires fundamentally different algorithms. This Part establishes the foundations of distributed training strategies (3D parallelism), the collective communication primitives that synchronize them, and the fault tolerance mechanisms that keep them running despite routine hardware failures.
**Part VII: Deployment at Scale** asks: *How do we serve predictions to millions of users globally?*
**Part III: Deployment at Scale** asks: *How do we serve predictions to millions of users globally?*
Training is only half the battle. This Part addresses the challenges of serving models at scale, from low-latency datacenter inference using paged attention and continuous batching to resource-constrained edge intelligence, and the MLOps practices needed to manage massive production fleets.
**Part VIII: The Responsible Fleet** asks: *How do we ensure global-scale AI systems are secure, robust, and sustainable?*
**Part IV: The Responsible Fleet** asks: *How do we ensure global-scale AI systems are secure, robust, and sustainable?*
Scale amplifies risks. This Part tackles the critical requirements of production at the "fleet" level: protecting privacy and security, ensuring robustness against distribution shifts, and governing AI to ensure beneficial societal impact.
### Suggested Reading Paths {#sec-book-suggested-reading-paths-10b8}
- **Beginners**: Start with Volume I, Parts I and II to build conceptual understanding and workflow skills. Add laboratory exercises for hands-on experience before advancing to optimization topics.
- **Practitioners**: Complete Volume I for core competencies, then focus on Volume II Parts V through VII for distributed systems and production deployment insights relevant to industry work.
- **Practitioners**: Complete Volume I for core competencies, then focus on Volume II Parts I through III for distributed systems and production deployment insights relevant to industry work.
- **Researchers**: Work through Volume I for foundations, then explore Volume II for advanced topics including distributed training, robust systems, and emerging frontiers.
- **Graduate Courses**: Volume I serves as a complete semester-length course. Volume II provides material for an advanced seminar or second course.
- **Hands-On Learners**: Combine reading from either volume with the laboratory exercises across Arduino, Seeed, Grove Vision, and Raspberry Pi platforms.

View File

@@ -2591,8 +2591,6 @@ The principles and frameworks established in this introduction provide the conce
This book makes a stronger claim: ML systems engineering is not merely a collection of best practices but a *distinct engineering discipline* with its own governing laws. The Iron Law decomposes every inference into data movement, computation, and overhead, three terms rooted in silicon physics that no software optimization can repeal. The Conservation of Complexity guarantees that cost is relocated, never eliminated. The Statistical Drift Invariant ensures that every deployed model decays at a rate determined by the distance between its training distribution and the live world. The Memory Wall sets bandwidth ceilings that faster arithmetic cannot overcome. These are not guidelines; they are physical constraints as permanent as Ohm's law or the speed of light. The twelve invariants developed across the chapters that follow constitute the field's first principled vocabulary, a shared analytical language for reasoning about ML systems from physics rather than from intuition.
::: {.callout-chapter-connection title="From Vision to Architecture"}
```{python}
#| label: deployment-archetypes
#| echo: false
@@ -2671,8 +2669,6 @@ tiny_power_str = DeploymentSystems.tiny_power_str
mem_span_str = DeploymentSystems.mem_span_str
compute_span_str = DeploymentSystems.compute_span_str
```
```
::: {.callout-perspective #perspective-deployment-archetypes title="The ML Systems Landscape: Four Archetypes"}
@@ -2689,6 +2685,8 @@ The machine learning systems landscape spans nine orders of magnitude in computa
:::
::: {.callout-chapter-connection title="From Vision to Architecture"}
Where should an ML model actually run? The answer is not "wherever is most convenient." Physical laws dictate what is possible.
The speed of light makes distant cloud servers useless for emergency braking. Thermodynamics prevents datacenter-class models from running in your pocket. Memory physics creates bandwidth ceilings that faster chips cannot overcome. @sec-ml-systems introduces the four deployment paradigms (Cloud, Edge, Mobile, and TinyML) that span nine orders of magnitude in power and memory, explaining why each exists and how to choose among them.

View File

@@ -799,8 +799,6 @@ This is not a failure of engineering---it is the physics of fleet-scale computat
:::
---
## Thermal and Power Physics {#sec-fleet-foundations-thermal-power}
Compute performance ultimately converts to heat, and heat must be removed. At fleet scale, thermal and power constraints are not secondary concerns---they determine where you can build, how densely you can pack accelerators, and what your operating costs will be. A cluster that is architecturally sound but thermally infeasible cannot be built.

View File

@@ -91,6 +91,15 @@ class ReliabilityFoundations:
t_detect = HEARTBEAT_TIMEOUT_S
t_reschedule = RESCHEDULE_TIME_S
ckpt_write_bw = CHECKPOINT_WRITE_BW_GBS
model_sizes_params = [7e9, 13e9, 70e9, 175e9, 1e12]
model_labels = ["7B", "13B", "70B", "175B", "1T"]
overhead_ckpt = 0.05
overhead_failure = 0.05
avail_replicas = [1, 2, 3, 4, 5]
avail_single = 0.99
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# Step 1: Level 4: Ratios & Rework (MTBF Cascade)
@@ -162,6 +171,7 @@ tau_opt_min_str = fmt(R.tau_opt_min, precision=1)
ckpt_175b_gb_str = fmt(R.ckpt_175b_gb, precision=0)
ckpt_write_time_str = fmt(R.ckpt_write_time_s, precision=1)
t_recovery_str = fmt(R.t_recovery_total_s, precision=1)
```
## Failure Probability at Scale {#sec-reliability-foundations-failure-probability}
@@ -312,7 +322,7 @@ dur_labels = ["1 Day", "1 Week", "30 Days"]
fp_data = {}
for n_gpus in R.cluster_sizes:
row = []
for dur_h in R.job_durations_hours:
for dur_h in [24, 168, 720]:
p = R.p_failure(n_gpus, dur_h)
if p > 0.999:
row.append("> 99.9%")

View File

@@ -273,9 +273,6 @@ nvlink_bw_gbs = f"{NVLINK_H100_BW.m_as(GB/second):.0f}"
pcie5_bw_gbs = f"{PCIE_GEN5_BW.m_as(GB/second):.0f}"
```
# (Calculations handled in StorageEconomics class above)
```
## The Fuel Line {#sec-storage-fuel-line}
The previous two chapters built the engine and wired it together. @sec-compute-infrastructure established that a modern accelerator node packs eight GPUs delivering petaFLOPS of aggregate compute, and @sec-network-fabrics showed how InfiniBand fabrics connect thousands of such nodes at hundreds of gigabits per second. An engine without fuel is expensive sculpture. The fleet also needs *fuel*: training data, model weights, optimizer state, and intermediate checkpoints. This chapter asks how to deliver that fuel fast enough that 1,000 accelerators never starve.

View File

@@ -60,10 +60,10 @@ This book makes that engineering visible. It gives it vocabulary, principles, an
This book extends the foundations into production-scale systems through four parts that follow the **Fleet Stack** from bottom to top:
- **Part V: The Fleet** — Build the physical computer. Architect the datacenter infrastructure, high-bandwidth network fabrics, and scalable data storage that form the foundation of every distributed ML deployment.
- **Part VI: Distributed ML** — Master the algorithms of scale. Learn how to coordinate computation across thousands of devices using parallelism strategies, collective communication primitives, fault tolerance mechanisms, and fleet orchestration.
- **Part VII: Deployment at Scale** — Serve the world. Navigate the shift from training to inference, optimize performance across the serving stack, push intelligence to the edge, and manage the operational lifecycle of production fleets.
- **Part VIII: The Responsible Fleet** — Harden and govern the system. Address security, robustness, environmental sustainability, and responsible engineering in large-scale operations.
- **Part I: The Fleet** — Build the physical computer. Architect the datacenter infrastructure, high-bandwidth network fabrics, and scalable data storage that form the foundation of every distributed ML deployment.
- **Part II: Distributed ML** — Master the algorithms of scale. Learn how to coordinate computation across thousands of devices using parallelism strategies, collective communication primitives, fault tolerance mechanisms, and fleet orchestration.
- **Part III: Deployment at Scale** — Serve the world. Navigate the shift from training to inference, optimize performance across the serving stack, push intelligence to the edge, and manage the operational lifecycle of production fleets.
- **Part IV: The Responsible Fleet** — Harden and govern the system. Address security, robustness, environmental sustainability, and responsible engineering in large-scale operations.
## Prerequisites {.unnumbered}

View File

@@ -22,9 +22,7 @@ engine: jupyter
_Why do the engineering principles that work on single machines break down at production scale?_
Machine learning at scale has a **physics of its own**. In a single node, performance is governed by the memory wall; in a distributed cluster, it is governed by the **Bisection Bandwidth Wall**[^fn-bisection-bandwidth-etymology]. Data must move not just through local hierarchies, but across optical fabrics governed by the speed of light between racks. Hardware failures transition from rare exceptions to routine statistical certainties. *When* a single-GPU training job fails, it is an inconvenience; *when* one node in a 10,000-GPU cluster fails, it can stall the entire "Machine Learning Fleet." This discontinuity explains why mastery of single-machine ML is no longer sufficient for production. Scale is not more of the same—it is fundamentally different engineering terrain requiring different principles, different architectures, and different ways of thinking about what makes systems work. At the same time, large-scale systems have a societal property that small models do not: their impact is amplified by the billions of users they serve. *When* a local model exhibits bias, the harm is contained; *when* a foundation model exhibits bias, it propagates that harm through the fabric of digital life. What is needed is a discipline grounded in the **physics of distribution**, where decisions at the algorithmic level must account for network topology, fault tolerance, and the security of the global "Control Plane."
[^fn-bisection-bandwidth-etymology]: **Bisection Bandwidth** (from graph theory): The minimum aggregate bandwidth of all links that, if cut, would partition the network into two equal-sized sets of nodes. In distributed ML, this "worst-case" cut determines the cluster's synchronization bottleneck: an AllReduce synchronization can move no faster than the bisection bandwidth, regardless of how many GPUs are added. \index{Bisection Bandwidth!etymology}
Machine learning at scale has a **physics of its own**. In a single node, performance is governed by the memory wall; in a distributed cluster, it is governed by the **Bisection Bandwidth Wall**. Data must move not just through local hierarchies, but across optical fabrics governed by the speed of light between racks. Hardware failures transition from rare exceptions to routine statistical certainties. *When* a single-GPU training job fails, it is an inconvenience; *when* one node in a 10,000-GPU cluster fails, it can stall the entire "Machine Learning Fleet." This discontinuity explains why mastery of single-machine ML is no longer sufficient for production. Scale is not more of the same—it is fundamentally different engineering terrain requiring different principles, different architectures, and different ways of thinking about what makes systems work. At the same time, large-scale systems have a societal property that small models do not: their impact is amplified by the billions of users they serve. *When* a local model exhibits bias, the harm is contained; *when* a foundation model exhibits bias, it propagates that harm through the fabric of digital life. What is needed is a discipline grounded in the **physics of distribution**, where decisions at the algorithmic level must account for network topology, fault tolerance, and the security of the global "Control Plane."
::: {.content-visible when-format="pdf"}
\newpage
@@ -329,7 +327,9 @@ The transition from single-machine to distributed training introduces qualitativ
:::
As systems scale beyond a single node, a fundamental physical constraint emerges: the *bisection bandwidth wall*, which limits how fast data can cross the network midpoint. This constraint explains why networking, not just compute, often determines model throughput.
As systems scale beyond a single node, a fundamental physical constraint emerges: the *bisection bandwidth wall*[^fn-bisection-bandwidth-etymology], which limits how fast data can cross the network midpoint. This constraint explains why networking, not just compute, often determines model throughput.
[^fn-bisection-bandwidth-etymology]: **Bisection Bandwidth** (from graph theory): The minimum aggregate bandwidth of all links that, if cut, would partition the network into two equal-sized sets of nodes. In distributed ML, this "worst-case" cut determines the cluster's synchronization bottleneck: an AllReduce synchronization can move no faster than the bisection bandwidth, regardless of how many GPUs are added. \index{Bisection Bandwidth!etymology}
On a single GPU, training proceeds deterministically: the same code, data, and random seed produce identical results. At the scale of thousands of GPUs, new phenomena emerge. Network partitions can split clusters into groups that train independently, causing model divergence. Stragglers (workers that process data slower than peers due to hardware variation or thermal throttling) can bottleneck entire training runs. Hardware failures that occur once per machine-year become daily events when operating 10,000 machines[^fn-failure-rates-fleet]. Systems must checkpoint frequently enough that losing a day's progress becomes acceptable rather than catastrophic.
@@ -1470,27 +1470,27 @@ To bridge abstract principles and concrete engineering, this volume employs a lo
This textbook organizes around the **Fleet Stack**\index{Fleet Stack!textbook structure}, progressing from the physical substrate through the logic of distribution to societal governance. Each Part addresses a fundamental "Scale Impediment" that prevents a single-machine solution from working at production scale.
### Part V: The Fleet (Core Infrastructure)
### Part I: The Fleet (Core Infrastructure)
**The Impediment**: *Physical Limits*. No single server has enough memory, power, or cooling to train a frontier model.
* **Compute Infrastructure (@sec-compute-infrastructure)**: Building the engine. Mastering the physics of high-density silicon, liquid cooling, and megawatt-scale power ramp rates.
* **Network Fabrics (@sec-network-fabrics)**: The transmission. Connecting thousands of accelerators through a high-bandwidth "Gradient Bus" that acts as the cluster-scale system bus.
* **Scalable Data Storage (@sec-data-storage)**: The fuel line. Architecting storage hierarchies that can feed terabytes of data to hungry accelerators without stalling the math.
### Part VI: Distributed ML (The Logic of Scale)
### Part II: Distributed ML (The Logic of Scale)
**The Impediment**: *The Coordination Tax*. Splitting math across machines creates synchronization bottlenecks and frequent failures.
* **Distributed Training (@sec-distributed-training-systems)**: Splitting the math. Strategies for partitioning 100--trillion-parameter models across thousands of GPUs.
* **Collective Communication (@sec-collective-communication)**: The traffic control. Implementing the coordination algorithms (AllReduce, AllGather) that bind independent nodes into a coherent computer.
* **Fault Tolerance (@sec-fault-tolerance-reliability)**: The immune system. Engineering for a regime where hardware fails every few hours, making recovery speed more important than uptime.
* **Fleet Orchestration (@sec-fleet-orchestration)**: The resource negotiator. Managing multi-tenant clusters where "Gang Scheduling" is required to prevent deadlocks. Unified frameworks like **Ray**\index{Ray} [@moritz2018ray] enable flexible scaling of these distributed workloads across diverse hardware resources.
### Part VII: Deployment at Scale (The Serving Pipeline)
### Part III: Deployment at Scale (The Serving Pipeline)
**The Impediment**: *Operational Economics*. Inference costs eventually dwarf training costs, requiring a fundamental shift from throughput to latency optimization.
* **Inference at Scale (@sec-inference-scale)**: The interface. Serving models to millions of users simultaneously while managing the "KV Cache Wall."
* **Performance Engineering (@sec-performance-engineering)**: The efficiency frontier. Closing the gap between hardware peak and actual throughput through kernel fusion and compilation. Innovations like **FlashAttention**\index{FlashAttention} [@dao2022flashattention] demonstrate how I/O-aware kernels can improve performance 24$\times$ on memory-bound workloads.
* **Edge Intelligence (@sec-edge-intelligence)**: The frontier. Moving intelligence from the datacenter to the user's device, constrained by milliwatt power budgets.
* **Operations at Scale (@sec-ops-scale)**: The control plane. Monitoring the fleet's health, drift, and performance across global deployments.
### Part VIII: The Responsible Fleet (The Governance Layer)
### Part IV: The Responsible Fleet (The Governance Layer)
**The Impediment**: *Societal Impact*. At global scale, technical bugs become societal hazards, requiring governance as a first-class engineering invariant.
* **Security & Privacy (@sec-security-privacy)**: The armor. Defending the fleet against adversaries who seek to poison data or extract proprietary weights.
* **Robustness (@sec-robust-ai)**: The resilience. Ensuring models survive the chaotic, non-I.I.D. reality of the open world.

View File

@@ -43,7 +43,7 @@ start_chapter("vol2:ops_scale")
\mlfleetstack{25}{35}{100}{40}
\end{marginfigure}
_Why do the practices that work for managing one model collapse when organizations deploy hundreds?_
_The practices that work for managing one model collapse when organizations deploy hundreds._
One model is a project. A hundred models is a system of systems, where interactions, dependencies, and failures cascade in ways that per-model practices cannot anticipate or contain. A data pipeline change affects twelve models built by four teams, but no single team owns the impact assessment. A deployment failure requires coordinating rollbacks across interconnected services. Monitoring dashboards multiply until alert fatigue makes them useless. The practices that let a single team manage a single model—manual deployment, ad-hoc monitoring, spreadsheet tracking—become organizational liabilities at scale. MLOps at scale is the recognition that model management must become infrastructure: shared platforms with consistent APIs, automated pipelines that enforce quality gates, monitoring systems that aggregate signals across the fleet, and governance frameworks that track dependencies between artifacts nobody remembers creating. Without this infrastructure, organizations drown in operational complexity while their ML investments depreciate.
@@ -537,21 +537,21 @@ Recommendation systems operate at the opposite end of the operational spectrum.
In response to these dynamics, operational patterns for recommendation systems emphasize continuous training pipelines that produce daily or weekly model updates, interleaving experiments that compare multiple model variants on the same requests, rapid iteration cycles where changes can reach production within hours, and statistically rigorous A/B testing infrastructure. A key metric that captures this operational urgency is *feature freshness latency*, which measures how quickly user actions propagate into the model's predictions.
::: {.callout-example title="Feature Freshness Latency"}
**The Problem**: A user clicks a "Basketball" video. How long until their feed shows more basketball content? This delay is the **Feature Freshness Latency**.
The problem: A user clicks a "Basketball" video. The time required for their feed to show more basketball content is the **Feature Freshness Latency**.
**Formula**:
The formula is:
$$ L_{\text{freshness}} = T_{\text{available}} - T_{\text{event}} $$
**Scenario**:
Consider the scenario:
* **Batch Pipeline (Daily)**: Events are aggregated at midnight.
* Batch Pipeline (Daily): Events are aggregated at midnight.
* $L_{\text{freshness}} \approx 12\text{--}24 \text{ hours}$.
* **Impact**: User leaves session before recommendations update.
* **Streaming Pipeline (Real-time)**: Events flow through Kafka/Flink to Feature Store.
* Streaming Pipeline (Real-time): Events flow through Kafka/Flink to Feature Store.
* $L_{\text{freshness}} \approx 1\text{--}5 \text{ seconds}$.
* **Impact**: Next page load reflects the interest.
**Conclusion**: For session-based recommendations, moving from Batch ($L \approx 24h$) to Streaming ($L \approx 5s$) often yields a **10--20% lift in engagement**, justifying the increased infrastructure cost.
Therefore, for session-based recommendations, moving from Batch ($L \approx 24h$) to Streaming ($L \approx 5s$) often yields a **10--20% lift in engagement**, justifying the increased infrastructure cost.
:::
The key insight is that recommendation operations is fundamentally about ensemble management. A single recommendation request might invoke 10 to 50 distinct models, each requiring its own update cadence while maintaining coherent behavior as a system.
@@ -684,7 +684,7 @@ The decision to establish a platform team typically occurs when organizations re
Platform operations provide superlinear returns on investment. As model count grows, the value of shared infrastructure increases faster than its cost, creating increasingly favorable economics for platform investments. Organizations that delay platform investment accumulate operational debt that becomes progressively more expensive to address.
:::
While the economic justification for platform operations becomes clear at scale, the technical implementation begins with a deceptively difficult problem: how do we prevent independent models from colliding? Multi-model management requires untangling the hidden dependencies that emerge when hundreds of models share the same data, infrastructure, and user experiences.
While the economic justification for platform operations becomes clear at scale, the technical implementation begins with a deceptively difficult problem: preventing independent models from colliding. Multi-model management requires untangling the hidden dependencies that emerge when hundreds of models share the same data, infrastructure, and user experiences.
## Multi-Model Management {#sec-ml-operations-scale-multimodel-management-d8ac}
@@ -1267,7 +1267,7 @@ Canary deployment[^fn-canary-deploy] routes a small percentage of traffic to the
Typical progression: 1% → 5% → 25% → 50% → 100%
The key question is: how long should each stage last? @eq-canary-duration relates stage duration to sample requirements, request rate, and traffic percentage, enabling precise calculation of minimum canary durations for statistical validity:
Determining the duration of each stage is critical. @eq-canary-duration relates stage duration to sample requirements, request rate, and traffic percentage, enabling precise calculation of minimum canary durations for statistical validity:
$$t_{stage} = \frac{n_{samples\_needed}}{r_{requests} \times p_{stage}}$$ {#eq-canary-duration}
@@ -1954,7 +1954,7 @@ A mature CI/CD pipeline ensures that only healthy, verified models reach product
Successful navigation through CI/CD pipelines marks the beginning, not the end, of operational responsibility. Models that pass validation gates and survive canary deployment enter a production environment where gradual degradation, data drift, and emergent interactions can erode performance over weeks or months. The staged rollout strategies and rollback triggers examined in CI/CD detect acute failures during deployment; monitoring systems must detect chronic degradation during operation. At platform scale, where hundreds of models operate simultaneously, this monitoring challenge transforms fundamentally.
@sec-compute-infrastructure established infrastructure monitoring at the hardware and network level, tracking GPU cluster health, network throughput, and power consumption. ML operations monitoring must additionally capture model quality, data drift, and business impact. Monitoring machine learning systems at scale presents challenges fundamentally different from monitoring individual models. When an organization operates hundreds of models, the naive approach of applying single-model monitoring practices to each model independently leads to alert fatigue, missed correlations, and operational chaos. This section develops monitoring strategies appropriate for enterprise-scale ML platforms.
@sec-compute-infrastructure established infrastructure monitoring at the hardware and network level, tracking GPU cluster health, network throughput, and power consumption. ML operations monitoring must additionally capture model quality, data drift, and business impact. Monitoring machine learning systems at scale presents challenges fundamentally different from monitoring individual models. When an organization operates hundreds of models, the naive approach of applying single-model monitoring practices to each model independently leads to alert fatigue, missed correlations, and operational chaos. Monitoring strategies appropriate for enterprise-scale ML platforms require hierarchical aggregation and systemic governance.
### The Alert Fatigue Problem {#sec-ml-operations-scale-alert-fatigue-problem-038c}
@@ -2006,7 +2006,7 @@ daily_alerts_fat_str = FalseAlarmTax.daily_alerts_str
```
::: {.callout-notebook title="The False Alarm Tax"}
**Problem**: You monitor **`{python} n_models_fat` models**. Each model has **`{python} n_metrics_fat` metrics** (latency, accuracy, drift, etc.). You set alert thresholds at **`{python} specificity_fat`** (99.7% specificity). How many false alarms does the on-call engineer wake up to per week?
The problem: Consider a system monitoring **`{python} n_models_fat` models**. Each model has **`{python} n_metrics_fat` metrics** (latency, accuracy, drift, etc.). Alert thresholds are set at **`{python} specificity_fat`** (99.7% specificity). This configuration generates a massive volume of false alarms that the on-call engineer must address per week.
**The Math**:
@@ -2015,7 +2015,7 @@ daily_alerts_fat_str = FalseAlarmTax.daily_alerts_str
3. **Checks per Day**: Assume checks every 5 minutes (**288 checks/day**).
4. **Daily False Alarms**: $1,000 \times 288 \times 0.003 \approx \mathbf{`{python} daily_alerts_fat_str` alerts/day}$.
**The Systems Conclusion**: Even with "high precision" (3--sigma) alerts, scale kills you. You cannot alert on raw metrics. You *must* use **hierarchical aggregation** (e.g., "Cluster Health" instead of "Node Health") to survive the false alarm tax.
The systems conclusion is that Even with "high precision" (3--sigma) alerts, scale becomes problematic. It is not feasible to alert on raw metrics. One *must* use **hierarchical aggregation** (e.g., "Cluster Health" instead of "Node Health") to survive the false alarm tax.
:::
@eq-false-alert-rate reveals the mathematical inevitability of alert fatigue at scale: for a single metric with false positive rate $\alpha$, the probability of at least one false alert grows exponentially with the number of independent tests $N$:
@@ -2327,7 +2327,7 @@ Interactive analysis tools for incident response:
### Cost Monitoring and Anomaly Detection {#sec-ml-operations-scale-cost-monitoring-anomaly-detection-1976}
Infrastructure costs represent one of the largest operational expenditures for ML platforms, yet cost monitoring often receives less attention than performance or quality metrics. At scale, undetected cost anomalies can accumulate millions of dollars in unexpected charges before manual review catches them. This section develops a quantitative framework for cost anomaly detection that balances sensitivity against false positive rates.
Infrastructure costs represent one of the largest operational expenditures for ML platforms, yet cost monitoring often receives less attention than performance or quality metrics. At scale, undetected cost anomalies can accumulate millions of dollars in unexpected charges before manual review catches them. A quantitative framework for cost anomaly detection balances sensitivity against false positive rates.
#### Cost Anomaly Detection Metrics
@@ -2517,7 +2517,7 @@ Complete ML platforms provide end-to-end capabilities:
Platforms at this level include Vertex AI, SageMaker, and internal platforms at major technology companies such as TFX[^fn-tfx-production] at Google [@baylor2017tfx] and MLflow [@zaharia2018accelerating]. Model teams interact through high-level APIs while the platform manages all operational concerns.
[^fn-tfx-production]: **TensorFlow Extended (TFX)**: Google's production ML platform, open-sourced in 2019 after years of internal use. TFX's key architectural insight is that data validation (via TFDV) gates the pipeline before training begins, catching schema violations and distribution drift that would otherwise produce silently degraded models. This "fail before you train" philosophy prevents the most expensive class of ML waste: GPU-hours spent training on corrupted data. \index{TFX!data validation}
[^fn-tfx-production]: **TensorFlow Extended (TFX)**: Google's production ML platform, open-sourced in 2019 after years of internal use. TFX's key architectural insight is that data validation (via TFDV) gates the pipeline before training begins, catching schema violations and distribution drift that would otherwise produce silently degraded models. This "fail before training" philosophy prevents the most expensive class of ML waste: GPU-hours spent training on corrupted data. \index{TFX!data validation}
### Self-Service Model Deployment {#sec-ml-operations-scale-selfservice-model-deployment-4753}
@@ -2650,7 +2650,7 @@ Recommendations:
### Advanced Fleet Metrics: ML Productivity Goodput (MPG) {#sec-ml-operations-scale-advanced-fleet-metrics-ml-productivity-goodput-mpg-fabf}
While utilization metrics capture resource busyness, they often fail to reflect true engineering value. A GPU spinning at 100% utilization on a hyperparameter tuning job that eventually fails due to a configuration error is "efficient" in terms of hardware but "wasteful" in terms of productivity. To address this, we introduce **ML Productivity Goodput (MPG)**, a comprehensive metric for fleet efficiency [@kumar2024machine].
While utilization metrics capture resource busyness, they often fail to reflect true engineering value. A GPU spinning at 100% utilization on a hyperparameter tuning job that eventually fails due to a configuration error is "efficient" in terms of hardware but "wasteful" in terms of productivity. To address this, **ML Productivity Goodput (MPG)** provides a comprehensive metric for fleet efficiency [@kumar2024machine].
This metric establishes what we term the **Iron Law of Machine Learning Fleet Efficiency**. Just as the classic Iron Law of Processor Performance [@hennessy2011computer] decomposes CPU execution time, this formulation decomposes ML fleet efficiency into three orthogonal components:
@@ -2878,13 +2878,13 @@ Fine-Tuning injects knowledge into the *model weights* (or adapter weights) duri
Use **RAG** when:
1. **Knowledge Volatility is High**: Facts change hourly/daily (e.g., stock prices, news). Retraining is too slow.
2. **Attribution is Critical**: You need citations for every claim.
2. **Attribution is Critical**: Citations are needed for every claim.
3. **Scale is Small**: The context window cost is manageable.
Use **Fine-Tuning** when:
1. **Domain *Language* is Specialized**: The model needs to learn a new syntax (e.g., medical coding, proprietary SQL dialect), not just facts.
2. **Latency is Critical**: You cannot afford the latency of processing 10k context tokens per query.
2. **Latency is Critical**: It is not feasible to afford the latency of processing 10k context tokens per query.
3. **Style/Voice Consistency**: The model must adhere to strict brand guidelines.
Most mature platforms converge on a hybrid of RAG and Fine-Tuning: Fine-Tuning teaches the model *how* to use the retrieved tools and documents efficiently, while RAG provides the *facts*.
@@ -3782,7 +3782,7 @@ Spotify's ML platform, largely built on top of Kubeflow, is designed to solve th
A central component of their architecture is the **"Paved Road"** for model orchestration. Spotify uses a centralized Model Registry that enforces strict lineage tracking. Every model artifact in production is cryptographically linked to the specific dataset snapshot, code commit, and hyperparameter set used to create it. This rigorous tracking is essential for debugging "silent regressions." If a user complains about repetitive recommendations, an engineer can trace the exact lineage of the responsible model to find that a specific data pipeline upstream was delayed. This lineage system also enables automated "canary deployments." When a new model version is registered, the platform automatically deploys it to a small subset of users (e.g., 1%) and compares business metrics (streams per session) against the control group. If the metrics drop, the rollback is automatic, preventing bad models from ever reaching the full user base.
Latency constraints at Spotify are non-negotiable; playback must feel instantaneous. This forces a design tradeoff where complex inference is often pre-computed. For the "Discover Weekly" playlist, the platform runs massive batch inference jobs on weekends, storing the results in a low-latency key-value store. However, for the "Home" screen, which must react to the song you just listened to, they employ a hybrid approach. User embeddings are updated in near real-time using a streaming pipeline, but the heavy item-item similarity matrices are computed offline. This split architecture allows them to achieve sub-100 ms latency for the Home screen while still incorporating the user's immediate history. By optimizing the "Time to Interactivity," Spotify's platform proves that the best ML infrastructure is invisible---delivering complex personalization so fast that it feels like a static page load.
Latency constraints at Spotify are non-negotiable; playback must feel instantaneous. This forces a design tradeoff where complex inference is often pre-computed. For the "Discover Weekly" playlist, the platform runs massive batch inference jobs on weekends, storing the results in a low-latency key-value store. However, for the "Home" screen, which must react to the song just listened to, they employ a hybrid approach. User embeddings are updated in near real-time using a streaming pipeline, but the heavy item-item similarity matrices are computed offline. This split architecture allows them to achieve sub-100 ms latency for the Home screen while still incorporating the user's immediate history. By optimizing the "Time to Interactivity," Spotify's platform proves that the best ML infrastructure is invisible---delivering complex personalization so fast that it feels like a static page load.
At Spotify, non-negotiable latency constraints force a design tradeoff where complex inference is pre-computed, demonstrating how strict operational requirements shape system architecture. Yet even the most mature platforms, from Uber's Michelangelo to Spotify's orchestrators, face catastrophic failures that require rigorous, systematic incident response.
@@ -4105,12 +4105,12 @@ ML Operations at scale is the "nervous system" of the Machine Learning Fleet. Th
The transition from managing a single model to operating an organizational platform represents a qualitative shift in complexity. We established that per-model operational practices do not compose; instead, they create combinatorial debt that can only be resolved through platform abstractions like centralized registries, ensemble-aware CI/CD, and hierarchical monitoring.
We examined how operational cadences must match model risk profiles, from the staged, weeks-long rollouts of LLMs to the seconds-fast rollbacks of adversarial fraud detection. The TCO framework ($\text{TCO}_{\text{ML}} = C_{\text{train}} + C_{\text{infer}} + C_{\text{data}} + C_{\text{iter}}$) provides quantitative foundations for strategic investment decisions, revealing how cost structure shifts from iteration-dominated (early stage) to inference-dominated (production scale) and how optimization priorities must evolve accordingly. Finally, we extended the MLOps vision to the edge, addressing the "Fleet Version Skew" and "Hardware-in-the-Loop" validation requirements essential for managing intelligence on millions of heterogeneous devices.
Operational cadences must match model risk profiles, from the staged, weeks-long rollouts of LLMs to the seconds-fast rollbacks of adversarial fraud detection. The TCO framework ($\text{TCO}_{\text{ML}} = C_{\text{train}} + C_{\text{infer}} + C_{\text{data}} + C_{\text{iter}}$) provides quantitative foundations for strategic investment decisions, revealing how cost structure shifts from iteration-dominated (early stage) to inference-dominated (production scale) and how optimization priorities must evolve accordingly. Finally, the MLOps vision extends to the edge, addressing the "Fleet Version Skew" and "Hardware-in-the-Loop" validation requirements essential for managing intelligence on millions of heterogeneous devices.
::: {.callout-takeaways title="Platforms, Not Pipelines"}
* **Platform ROI is Superlinear**: The value of shared ML infrastructure grows faster than the model count. Organizations that defer platform investment until "at scale" often find themselves paralyzed by accumulated operational debt.
* **TCO Drives Strategy**: ML systems exhibit a four-component cost structure (training, inference, data, iteration) that evolves with scale. Early-stage systems are iteration-dominated (optimize for velocity); production systems are inference-dominated (optimize for efficiency). Match optimization investments to your current cost structure using breakeven analysis.
* **TCO Drives Strategy**: ML systems exhibit a four-component cost structure (training, inference, data, iteration) that evolves with scale. Early-stage systems are iteration-dominated (optimize for velocity); production systems are inference-dominated (optimize for efficiency). Match optimization investments to the current cost structure using breakeven analysis.
* **Management of the Edge Fleet**: MLOps extends beyond the datacenter. Managing edge intelligence requires handling extreme version skew (weeks-long rollouts) and Hardware-in-the-Loop (HIL) CI/CD to ensure models do not crash on diverse NPU/DSP architectures.
* **Ensembles are the Unit of Management**: In production (especially recommendation), the atomic unit is rarely a single model but an ensemble of 10-50 components. Management must be "dependency-aware" to prevent upstream updates from breaking downstream consumers.
* **Aggregation over Enumeration**: Monitoring 100 models with independent alerts is mathematically guaranteed to cause alert fatigue. Effective platforms use hierarchical monitoring and fleet-wide anomaly detection to find the "signal" across the portfolio.
@@ -4126,7 +4126,7 @@ The practitioner who internalizes this lesson gains a strategic advantage. Under
We have built the operational machinery for managing ML systems at scale, from platform economics and fleet monitoring to edge deployment and FinOps governance. Yet a fleet that is globally distributed and autonomously adapting also presents an expansive attack surface.
In @sec-security-privacy, we shift from *how to manage* the fleet to *how to defend* it. We examine the adversarial threats unique to ML systems, including data poisoning, model extraction, and membership inference, alongside the differential privacy frameworks that govern how we can safely handle the data fueling our fleet.
The next chapter addresses how to defend the fleet, examining adversarial threats unique to ML systems, including data poisoning, model extraction, and membership inference, alongside the differential privacy frameworks that govern how we can safely handle the data fueling our fleet.
:::

View File

@@ -1,6 +1,6 @@
# Part VII: Deployment at Scale {.unnumbered}
# Part III: Deployment at Scale {.unnumbered}
Parts V and VI built the Fleet and trained the model. Now the system must "earn its keep." Part VII shifts the focus from the single, massive training run to the **Global Inference Fleet**: the thousands of geographically distributed servers and billions of edge devices that serve model outputs to users in real-time. This transition from training to production fundamentally changes the engineering constraints: from maximizing throughput on a static dataset to minimizing latency on a dynamic stream of requests, all within the **physics of global serving**.
Parts I and II built the Fleet and trained the model. Now the system must "earn its keep." Part III shifts the focus from the single, massive training run to the **Global Inference Fleet**: the thousands of geographically distributed servers and billions of edge devices that serve model outputs to users in real-time. This transition from training to production fundamentally changes the engineering constraints: from maximizing throughput on a static dataset to minimizing latency on a dynamic stream of requests, all within the **physics of global serving**.
In this regime, engineering becomes a negotiation with the user's experience. We can optimize for high batch throughput in the datacenter, but we risk violating the user's latency budget. We can deploy a model to the edge to eliminate network latency, but we are then constrained by the tight memory and power envelopes of mobile and IoT hardware. These principles govern the performance engineering and operational economics of deploying intelligence at scale.

View File

@@ -1,6 +1,6 @@
# Part VI: Distributed ML {.unnumbered}
# Part II: Distributed ML {.unnumbered}
Coordinating the **Fleet** requires solving problems that a single machine does not. If Part V built the physical substrate, Part VI establishes the **Logic of Distribution**: the algorithms, protocols, and coordination strategies that transform a collection of independent accelerators into a singular, unified machine. This transition is governed by the **physics of distributed learning**—the first-principles reasoning that explains why adding more GPUs does not always yield linear speedups, and why communication, not computation, is often the hidden bottleneck.
Coordinating the **Fleet** requires solving problems that a single machine does not. If Part I built the physical substrate, Part II establishes the **Logic of Distribution**: the algorithms, protocols, and coordination strategies that transform a collection of independent accelerators into a singular, unified machine. This transition is governed by the **physics of distributed learning**—the first-principles reasoning that explains why adding more GPUs does not always yield linear speedups, and why communication, not computation, is often the hidden bottleneck.
At this scale, the engineering challenge is a trade-off between parallelization and coordination. We can partition a model to reduce per-device memory pressure, but we necessarily increase the communication volume required to keep the weights synchronized. We can scale to thousands of nodes to reduce training time, but we simultaneously decrease the Mean Time Between Failures (MTBF), making fault tolerance a mandatory system component rather than an operational luxury. These principles quantify the "Communication Tax" and the "Reliability Tax" of scale.

View File

@@ -1,6 +1,6 @@
# Part V: The Fleet {.unnumbered}
# Part I: The Fleet {.unnumbered}
The **Machine Learning Fleet** is the warehouse-scale computer where the network is the bus, power density is the speed limit, and failure is a statistical certainty. Continuing the curriculum's focus on the **physics of AI engineering**, Part V builds the **physics of scale**: the silicon, the wires, the cooling systems, and the storage hierarchies that make distributed ML possible. We shift from the single accelerator to the datacenter-scale machine, where the engineering challenge is no longer just "how do I compute?" but "how do I move energy and information at a scale that challenges the limits of the physical infrastructure?"
The **Machine Learning Fleet** is the warehouse-scale computer where the network is the bus, power density is the speed limit, and failure is a statistical certainty. Continuing the curriculum's focus on the **physics of AI engineering**, Part I builds the **physics of scale**: the silicon, the wires, the cooling systems, and the storage hierarchies that make distributed ML possible. We shift from the single accelerator to the datacenter-scale machine, where the engineering challenge is no longer just "how do I compute?" but "how do I move energy and information at a scale that challenges the limits of the physical infrastructure?"
This transition requires a fundamental shift in perspective. At scale, the individual GPU is merely a component in a larger, tightly coupled system. The principles of the Fleet are not best practices for cluster management; they are the physical invariants that dictate what kind of models can be trained and how they can be served. From the thermodynamic limits of heat dissipation to the bisection bandwidth of the network fabric, these constraints define the boundaries of the "Fleet Stack."

View File

@@ -1,6 +1,6 @@
# Part VIII: The Responsible Fleet {.unnumbered}
# Part IV: The Responsible Fleet {.unnumbered}
Parts V through VII built the Machine Learning Fleet, coordinated its training, and optimized its deployment. Part VIII confronts **The Hardened System**: the final, essential layer of engineering that determines whether the fleet serves its users safely or harms them. This final part establishes the **Governance of the Fleet**—where security, privacy, robustness, and sustainability are not optional "features," but fundamental physical and mathematical constraints in the **physics of responsible engineering**.
Parts I through III built the Machine Learning Fleet, coordinated its training, and optimized its deployment. Part IV confronts **The Hardened System**: the final, essential layer of engineering that determines whether the fleet serves its users safely or harms them. This final part establishes the **Governance of the Fleet**—where security, privacy, robustness, and sustainability are not optional "features," but fundamental physical and mathematical constraints in the **physics of responsible engineering**.
In this concluding part, we confront the most challenging engineering domain: the sociotechnical feedback loop. We cannot simply "fix" bias or "secure" a model with a single algorithm; instead, we must engineer the monitoring, verification, and governance systems that surround the fleet. A system that ignores these constraints will fail, not just ethically, but operationally: through regulatory shutdown, security breach, or environmental exhaustion. These principles define the boundaries of responsible engineering at scale.
@@ -38,7 +38,7 @@ $$ P_{t+1}(X) = g(P_t(X), f_t(X)) $$
**The Implication**: Systems require **Closed-Loop Governance**. A model that maximizes accuracy on static test data can still destroy its own ecosystem. Reliability requires modeling the *feedback loop*, not just the *feed-forward inference*.
:::
## Part VIII Roadmap: Governance and Responsibility {#sec-principles-responsible-part-viii-roadmap-governance-responsibility-e8f1}
## Part IV Roadmap: Governance and Responsibility {#sec-principles-responsible-part-viii-roadmap-governance-responsibility-e8f1}
This final part secures the Fleet and governs its impact on the world:

View File

@@ -1998,6 +1998,7 @@ Imagine a hedge fund training a sentiment analysis model on financial tweets. If
::: {#fig-adversarial-attack-injection fig-env="figure" fig-pos="htb" fig-cap="**Data Poisoning Attack**: Adversaries inject malicious data into the training set to manipulate model behavior, potentially causing misclassification or performance degradation during deployment. This attack emphasizes the vulnerability of machine learning systems to compromised data integrity and the need for robust data validation techniques. *Source: [li](HTTPS://www.mdpi.com/2227-7390/12/2/247)*" fig-alt="Matrix showing user-item ratings with attacker injecting red malicious rows. Defender analyzes and cleans data before training. Poisoned model cube shows compromised output."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\usetikzlibrary{calc,positioning}
\colorlet{Green}{GreenL}
\colorlet{Red}{RedL}
\colorlet{Blue}{BlueL}
@@ -2006,7 +2007,7 @@ Imagine a hedge fund training a sentiment analysis model on financial tweets. If
doubleL/.style={-{Triangle[length=0.44cm,width=0.62cm]},line width=1.0pt,text=black},
Line/.style={line width=1.0pt,VioletLine,text=black},
DLine/.style={dashed,line width=1.0pt,VioletLine,text=black},
cell/.style={draw=BrownLine, fill=GreenL, minimum width=#1, minimum height=#2, line width=0.75pt},
cell/.style 2 args={draw=BrownLine, fill=GreenL, minimum width=#1, minimum height=#2, line width=0.75pt},
face/.style={fill=OrangeL},
head/.style={rounded corners=1.5mm, fill=#1},
eye/.style={circle, inner sep=0pt, minimum width=2pt, minimum height=2pt, fill=#1},
@@ -2029,7 +2030,7 @@ Imagine a hedge fund training a sentiment analysis model on financial tweets. If
\foreach \x in {1,...,\columns}{
\foreach \y in {1,...,\rows}{
\node[cell=\cellsize,\cellheight] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
\node[cell={\cellsize}{\cellheight}] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
}
}
@@ -2112,7 +2113,7 @@ Imagine a hedge fund training a sentiment analysis model on financial tweets. If
\foreach \x in {1,...,\columns}{
\foreach \y in {1,...,\rows}{
\node[cell=\cellsize,\cellheight] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
\node[cell={\cellsize}{\cellheight}] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
}
}
\end{scope}