mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
Reorganizes Introduction chapter content and prose
Moves the 'Scaling the Machine: From Node to Fleet' section to a more logical position within the chapter, following the discussion on defining ML systems. Refines various sentences for improved clarity, conciseness, and a more formal, impersonal tone. Adds an introductory sentence to better outline the chapter's structure and movements.
This commit is contained in:
@@ -66,6 +66,8 @@ Machine learning systems have a *physics*. Data must move through memory hierarc
|
||||
|
||||
:::
|
||||
|
||||
This chapter lays the foundation in three movements: what ML systems are, what makes them different from traditional software, and how to organize the engineering effort.
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: ai-moment-stats
|
||||
@@ -123,82 +125,12 @@ class AIMomentStats:
|
||||
|
||||
## AI Moment {#sec-introduction-ai-moment-37f1}
|
||||
|
||||
Artificial intelligence has moved from research laboratories to the fabric of daily life. Consider asking your phone a question: an AI system converts your speech to text, interprets your intent, and generates a response. Scrolling through social media, AI systems decide which posts appear and in what order. Applying for a loan, AI systems assess your creditworthiness. Driving a modern car, AI systems monitor lane position, detect pedestrians, and adjust cruise control. In each case, the system is not merely retrieving information but making decisions under uncertainty, often controlling physical outcomes that affect safety, finances, or access to opportunity. These are not future possibilities; they are present realities affecting billions of people daily.
|
||||
Artificial intelligence has moved from research laboratories to the fabric of daily life. Consider asking one's phone a question: an AI system converts speech to text, interprets intent, and generates a response. Scrolling through social media, AI systems decide which posts appear and in what order. Applying for a loan, AI systems assess creditworthiness. Driving a modern car, AI systems monitor lane position, detect pedestrians, and adjust cruise control. In each case, the system is not merely retrieving information but making decisions under uncertainty, often controlling physical outcomes that affect safety, finances, or access to opportunity. These are not future possibilities; they are present realities affecting billions of people daily.
|
||||
|
||||
*What* makes building these systems an engineering challenge distinct from traditional software? The answer lies in a **Dual Mandate**\index{Dual Mandate}. Every ML system must simultaneously manage statistical uncertainty, because the model's predictions are probabilistic, and physical constraints, because executing those predictions requires moving terabytes of data and performing quintillions of arithmetic operations, often within milliseconds. The difference becomes clearest at failure boundaries. When a traditional program crashes, an engineer traces the bug to specific lines of code. When an ML system's accuracy drops by five percentage points, there may be no bug to find: the code executes correctly, but the learned behavior has changed\index{Silent Degradation}. Concretely, a code bug causes a crash (a loud failure), whereas a data bug causes a wrong prediction (a silent failure). The training data may have shifted. The hardware may have run out of memory mid-training. The model may not have converged. Debugging, testing, and architectural design all change when a system's behavior is defined by data rather than by code.
|
||||
|
||||
This dual mandate is visible in every large-scale AI deployment. ChatGPT coordinates thousands of GPUs[^fn-gpu-parallel] across data centers, executing trillions of operations per query while managing memory, network bandwidth, and thermal constraints. Tesla's collision avoidance relies on dozens of neural networks processing data from cameras, radar, and ultrasonic sensors simultaneously, fusing their outputs into a control decision within milliseconds. Google processes `{python} AIMomentStats.google_search_b_str` billion searches per day, each one triggering multiple AI systems for ranking, knowledge extraction, and spell-checking, all while meeting strict latency targets on globally distributed infrastructure. These systems do not merely run algorithms. They orchestrate data, computation, and hardware under tight physical constraints to deliver statistically reliable results at scale.
|
||||
|
||||
## Scaling the Machine: From Node to Fleet {#sec-introduction-scaling-regimes}
|
||||
|
||||
\index{Single-Node Stack!scaling regimes}\index{Distributed Fleet!scaling regimes}The "Machine" axis of the AI Triad operates across two distinct scaling regimes, each governed by its own physical bottlenecks. Engineering a robust ML system requires mastering the first before attempting to scale to the second.
|
||||
|
||||
@fig-system-scaling-regimes visualizes this transition. In the **Single-Node** regime (the focus of this textbook), we optimize for 1–8 GPUs connected by shared memory and high-speed intra-node interconnects like **NVLink**. Here, the binding constraint is the **Memory Wall**—the rate at which we can move data from local HBM to compute units. As applications grow beyond the capacity of a single machine, we enter the **Distributed Fleet** regime: thousands of nodes coordinated across a high-speed switch fabric. There, the bottleneck shifts to the **Bisection Bandwidth Wall**, where network congestion and message-passing latency dominate. The engineering principles covered in these chapters—maximizing hardware utilization and mitigating data movement costs on a single node—are the essential prerequisites for eventually engineering the fleet.
|
||||
|
||||
::: {#fig-system-scaling-regimes fig-env="figure" fig-pos="htb" fig-cap="**The Scaling Regimes of ML Systems**: Machine learning engineering is partitioned into two distinct physical regimes. Single-node systems are limited by local memory bandwidth (**Memory Wall**), while distributed fleets are limited by network communication (**Bisection Bandwidth Wall**). Mastery of intra-node data movement is the prerequisite for distributed scaling." fig-alt="Diagram comparing Single-Node Stack (App, Framework, OS, HW) to Distributed Fleet (Governance, Serving, Distribution, Infra)."}
|
||||
```{.tikz}
|
||||
\begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}\footnotesize, x=1cm, y=1.1cm]
|
||||
\tikzset{
|
||||
stack/.style={rectangle, rounded corners=2pt, draw, align=center, minimum width=3.5cm, minimum height=0.8cm, line width=0.8pt},
|
||||
label/.style={font=\usefont{T1}{phv}{b}{n}\small, align=center},
|
||||
sublabel/.style={font=\usefont{T1}{phv}{m}{n}\scriptsize, color=black!70, align=center},
|
||||
bottleneck/.style={rectangle, rounded corners=2pt, draw, dashed, align=center, minimum width=3.5cm, minimum height=0.6cm, line width=0.6pt, fill=RedLine!5},
|
||||
arrow/.style={-latex, line width=1.5pt, draw=GreenLine}
|
||||
}
|
||||
|
||||
% Single Node Stack
|
||||
\node[label] at (0, 4.5) {Single-Node Stack};
|
||||
\node[sublabel] at (0, 4.1) {1–8 GPUs, Shared Memory};
|
||||
|
||||
\node[stack, fill=RedLine!15, draw=RedLine] (app1) at (0, 3.2) {Application};
|
||||
\node[sublabel] at (0, 3.2) {\\[0.4em]Training Loop / Inference};
|
||||
|
||||
\node[stack, fill=OrangeLine!15, draw=OrangeLine] (fw1) at (0, 2.2) {ML Framework};
|
||||
\node[sublabel] at (0, 2.2) {\\[0.4em]PyTorch / JAX / Kernels};
|
||||
|
||||
\node[stack, fill=GreenLine!15, draw=GreenLine] (os1) at (0, 1.2) {Operating System};
|
||||
\node[sublabel] at (0, 1.2) {\\[0.4em]CUDA / PCIe DMA};
|
||||
|
||||
\node[stack, fill=BlueLine!15, draw=BlueLine] (hw1) at (0, 0.2) {Hardware};
|
||||
\node[sublabel] at (0, 0.2) {\\[0.4em]HBM / NVLink (900 GB/s)};
|
||||
|
||||
\node[bottleneck, draw=RedLine] (bn1) at (0, -0.8) {\textbf{Bottleneck: Memory Wall}};
|
||||
|
||||
% Scaling Arrow
|
||||
\draw[arrow] (2, 1.7) -- (3, 1.7) node[midway, above, color=GreenLine, font=\usefont{T1}{phv}{b}{n}] {Scaling};
|
||||
|
||||
% Distributed Fleet Stack
|
||||
\node[label] at (5, 4.5) {Distributed Fleet Stack};
|
||||
\node[sublabel] at (5, 4.1) {1,000–100,000+ GPUs};
|
||||
|
||||
\node[stack, fill=VioletLine!15, draw=VioletLine] (app2) at (5, 3.2) {Governance};
|
||||
\node[sublabel] at (5, 3.2) {\\[0.4em]Responsible AI / Security};
|
||||
|
||||
\node[stack, fill=OrangeLine!15, draw=OrangeLine] (fw2) at (5, 2.2) {Serving / Ops};
|
||||
\node[sublabel] at (5, 2.2) {\\[0.4em]Orchestration / CI/CD};
|
||||
|
||||
\node[stack, fill=GreenLine!15, draw=GreenLine] (os2) at (5, 1.2) {Distribution};
|
||||
\node[sublabel] at (5, 1.2) {\\[0.4em]NCCL / RDMA / Comms};
|
||||
|
||||
\node[stack, fill=BlueLine!15, draw=BlueLine] (hw2) at (5, 0.2) {Infrastructure};
|
||||
\node[sublabel] at (5, 0.2) {\\[0.4em]Fabric / RDMA (InfiniBand)};
|
||||
|
||||
\node[bottleneck, draw=RedLine] (bn2) at (5, -0.8) {\textbf{Bottleneck: Network Wall}};
|
||||
|
||||
\end{tikzpicture}
|
||||
```
|
||||
:::
|
||||
|
||||
[^fn-gpu-parallel]: **GPU (Graphics Processing Unit)**: Originally designed for rendering video game graphics, a workload requiring thousands of simple, parallel pixel calculations. This hardware-algorithm alignment proved decisive for neural networks, where the same massively parallel arithmetic structure maps directly onto matrix multiplication, making GPUs the primary physical enabler of modern training scale (see @sec-hardware-acceleration). \index{GPU!etymology}
|
||||
|
||||
[^fn-neural-network-origin]: **Neural Network**: A differentiable function approximator whose compute and memory demands scale with its learned parameter count. Running *dozens* of them simultaneously, as Tesla's perception stack does, multiplies the memory footprint and scheduling complexity on fixed vehicle hardware, making the system's bottleneck not any single model's accuracy but the aggregate resource budget required to execute all models within a shared latency window. \index{Neural Network!etymology}
|
||||
|
||||
This textbook teaches the engineering principles for building, optimizing, and deploying these systems. At the core of our approach is a simple observation: every ML system is a three-way interaction between the *Algorithm* (what the system is learning), the *Data* (what it is learning from), and the *Machine* (the physical hardware executing the computation). These three elements, which we formalize as the **Data · Algorithm · Machine (D·A·M) taxonomy**\index{D·A·M taxonomy}, are inseparable. Compressing a model to fit on a mobile device changes its accuracy. Doubling the training data demands more compute and storage. Switching from a CPU to an accelerator reshapes which algorithms are practical. Understanding ML systems engineering means learning to reason about all three simultaneously.
|
||||
|
||||
Before we can build that engineering framework, we need precise definitions and a shared analytical vocabulary. This chapter lays the foundation for the entire book in three movements. First, we establish *what* machine learning systems are: we distinguish artificial intelligence as a long-term research vision from machine learning as the engineering methodology we use today, trace the paradigm shifts that brought the field from rule-based expert systems to data-driven deep learning, and examine the **Bitter Lesson**, the empirical finding that general methods using computation ultimately outperform hand-engineered approaches. Second, we establish *what makes ML systems different*: we define ML systems precisely, analyze how they diverge from traditional software in testing, debugging, deployment, and maintenance, and develop the **Iron Law of ML Systems**, a quantitative framework that decomposes performance into data movement, computation, and overhead. Third, we establish *how to organize the engineering effort*: we define ML systems engineering as a discipline, trace the system lifecycle from conception through deployment, examine deployment case studies at both extremes (datacenter and microcontroller), and develop the **Five-Pillar Framework** that structures the rest of this book.
|
||||
|
||||
Machine learning represents a specific approach to artificial intelligence: rather than programming explicit rules, engineers design systems that learn patterns from data. However, this simple description conceals a deep reconception of what software *is*. Understanding the nature of that shift, and *why* it demands entirely new engineering practices, is where we begin.
|
||||
|
||||
## Data-Centric Paradigm Shift {#sec-introduction-datacentric-paradigm-shift-4eca}
|
||||
|
||||
The shift from rule-based to data-driven systems constitutes a deep reconception of computing. Andrej Karpathy[^fn-karpathy-sw2] formalized this distinction as the shift from **Software 1.0** to **Software 2.0**\index{Software 2.0} [@karpathy2017software], a framing that captures *why* ML systems require entirely new engineering approaches. @tbl-software-1-vs-2 summarizes this paradigm shift.
|
||||
@@ -224,7 +156,7 @@ Google researchers quantified these consequences in a landmark 2015 paper.
|
||||
|
||||
**The Insight**: They demonstrated that in mature ML systems, the *ML Code* (the model itself) is only a tiny fraction ($\approx 5\%$) of the total code base. The rest is *Glue Code*: data collection, verification, feature extraction, resource management, monitoring, and serving infrastructure.
|
||||
|
||||
**The Systems Lesson**: "Machine Learning" is easy; **"Machine Learning Systems"** are hard. The friction in deployment rarely comes from the matrix multiplication (the 5%); it comes from the interface between that math and the messy reality of the other 95%. If you optimize only the model, you are optimizing the smallest part of the problem.
|
||||
**The Systems Lesson**: "Machine Learning" is easy; **"Machine Learning Systems"** are hard. The friction in deployment rarely comes from the matrix multiplication (the 5%); it comes from the interface between that math and the messy reality of the other 95%. Optimizing only the model optimizes the smallest part of the problem.
|
||||
:::
|
||||
|
||||
The critical implication: *Data is Source Code* (Principle \ref{pri-data-as-code}).\index{Software 2.0} In traditional software, a programmer writes explicit logic (`if x > 0 then y`). In machine learning, the programmer writes the *optimization meta-logic* (the training algorithm), but the actual operational logic is "compiled" from the training dataset through stochastic gradient descent[^fn-sgd-sampling] and related optimization methods. The dataset serves as source code, the training pipeline as compiler, and the model weights[^fn-model-weights-inference] as binary executable.
|
||||
@@ -1138,7 +1070,7 @@ The progression through four paradigms reveals a consistent pattern: each era's
|
||||
|
||||
Richard Sutton's[^fn-sutton-bitter] 2019 essay "The Bitter Lesson"\index{Bitter Lesson, The} formalizes the historical pattern we just traced [@sutton2019bitter]. Looking back at seven decades of research, Sutton observed that general methods which use increasing computation consistently outperform approaches that encode human expertise. He writes: "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
|
||||
|
||||
[^fn-sutton-bitter]: **Richard Sutton**: A reinforcement learning pioneer whose 2019 essay crystallized the pattern traced in the preceding sections: from symbolic AI through expert systems to deep learning, general methods leveraging computation consistently outperformed hand-engineered expertise. The lesson is "bitter" because it implies that domain-specific logic is a depreciating asset, while the durable advantage belongs to systems engineering that can absorb the billion-fold increase in raw compute since the 1970s. \index{Sutton, Richard!Bitter Lesson}
|
||||
[^fn-sutton-bitter]: **Richard Sutton**: A reinforcement learning pioneer whose 2019 essay crystallized the pattern traced in the preceding sections: from symbolic AI through expert systems to deep learning, general methods using computation consistently outperformed hand-engineered expertise. The lesson is "bitter" because it implies that domain-specific logic is a depreciating asset, while the durable advantage belongs to systems engineering that can absorb the billion-fold increase in raw compute since the 1970s. \index{Sutton, Richard!Bitter Lesson}
|
||||
|
||||
The shift from expert systems to statistical learning to deep learning has dramatically improved performance on representative tasks, with each transition enabled by increased computational scale rather than cleverer encoding of human knowledge.
|
||||
|
||||
@@ -1219,7 +1151,7 @@ The implication is that realizing the Bitter Lesson's promise requires expertise
|
||||
|
||||
Sutton's bitter lesson explains the motivation for ML systems engineering. If AI progress depends on our ability to scale computation effectively, then understanding *how* to build, deploy, and maintain these computational systems is essential for AI practitioners. Yet this understanding demands more than familiarity with any single technical domain. Computer Science advances ML algorithms, and Electrical Engineering develops specialized AI hardware, but neither discipline alone provides the engineering principles needed to deploy, optimize, and sustain ML systems at scale. The convergence of data management, algorithmic design, and infrastructure optimization into a single engineering challenge has given rise to a new discipline, one we define formally later in this chapter and develop across the entire book.
|
||||
|
||||
The Bitter Lesson tells us *why* scale matters. The natural next question is what kind of systems make that scale practical. We turn to a precise characterization of machine learning systems themselves.
|
||||
The Bitter Lesson tells us *why* scale matters. The natural next question is what kind of systems make that scale practical. A precise characterization begins with a concrete example.
|
||||
|
||||
## Defining ML Systems {#sec-introduction-defining-ml-systems-d4af}
|
||||
|
||||
@@ -1260,7 +1192,7 @@ class EmailScale:
|
||||
gmail_emails_t_str = fmt(gmail_emails_t_value, precision=0)
|
||||
```
|
||||
|
||||
Consider the spam filter protecting your inbox. Every day, it processes millions of emails, deciding in milliseconds which messages deserve your attention and which should be quarantined. Gmail alone processes approximately `{python} EmailScale.gmail_emails_t_str` trillion emails annually, with spam comprising roughly 50% of all email traffic [@statista2024email]. Production spam filters typically target accuracy above 99.9% while processing each email in under 50 ms to avoid noticeable delays.
|
||||
Consider the spam filter protecting a typical inbox. Every day, it processes millions of emails, deciding in milliseconds which messages deserve attention and which should be quarantined. Gmail alone processes approximately `{python} EmailScale.gmail_emails_t_str` trillion emails annually, with spam comprising roughly 50% of all email traffic [@statista2024email]. Production spam filters typically target accuracy above 99.9% while processing each email in under 50 ms to avoid noticeable delays.
|
||||
|
||||
This deceptively simple task reveals *what* distinguishes machine learning systems from traditional software. The challenge begins with data: the filter trains on millions of labeled examples, constantly adapting as spammers evolve their tactics. Traditional software would require programmers to encode rules for every spam pattern manually, but the ML approach learns patterns automatically from data, adapting to new spam techniques without programmer intervention.
|
||||
|
||||
@@ -1471,6 +1403,70 @@ $$ \text{Cost} \propto \frac{\text{Model Size} \times \text{Dataset Size}}{\text
|
||||
Systems engineering is the art of balancing this equation. As a rough illustration: a 10% gain in hardware efficiency allows for a 10% larger dataset, which might yield a 1% gain in accuracy. The engineer's job is to determine if that trade-off is economically viable.
|
||||
:::
|
||||
|
||||
## Scaling the Machine: From Node to Fleet {#sec-introduction-scaling-regimes}
|
||||
|
||||
\index{Single-Node Stack!scaling regimes}\index{Distributed Fleet!scaling regimes}The "Machine" axis of the AI Triad operates across two distinct scaling regimes, each governed by its own physical bottlenecks. Engineering a robust ML system requires mastering the first before attempting to scale to the second.
|
||||
|
||||
@fig-system-scaling-regimes visualizes this transition. In the **Single-Node** regime (the focus of this textbook), we optimize for 1–8 GPUs connected by shared memory and high-speed intra-node interconnects like **NVLink**. Here, the binding constraint is the **Memory Wall**—the rate at which we can move data from local HBM to compute units. As applications grow beyond the capacity of a single machine, we enter the **Distributed Fleet** regime: thousands of nodes coordinated across a high-speed switch fabric. There, the bottleneck shifts to the **Bisection Bandwidth Wall**, where network congestion and message-passing latency dominate. The engineering principles covered in these chapters—maximizing hardware utilization and mitigating data movement costs on a single node—are the essential prerequisites for eventually engineering the fleet.
|
||||
|
||||
::: {#fig-system-scaling-regimes fig-env="figure" fig-pos="htb" fig-cap="**The Scaling Regimes of ML Systems**: Machine learning engineering is partitioned into two distinct physical regimes. Single-node systems are limited by local memory bandwidth (**Memory Wall**), while distributed fleets are limited by network communication (**Bisection Bandwidth Wall**). Mastery of intra-node data movement is the prerequisite for distributed scaling." fig-alt="Diagram comparing Single-Node Stack (App, Framework, OS, HW) to Distributed Fleet (Governance, Serving, Distribution, Infra)."}
|
||||
```{.tikz}
|
||||
\begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}\footnotesize, x=1cm, y=1.1cm]
|
||||
\tikzset{
|
||||
stack/.style={rectangle, rounded corners=2pt, draw, align=center, minimum width=3.5cm, minimum height=0.8cm, line width=0.8pt},
|
||||
label/.style={font=\usefont{T1}{phv}{b}{n}\small, align=center},
|
||||
sublabel/.style={font=\usefont{T1}{phv}{m}{n}\scriptsize, color=black!70, align=center},
|
||||
bottleneck/.style={rectangle, rounded corners=2pt, draw, dashed, align=center, minimum width=3.5cm, minimum height=0.6cm, line width=0.6pt, fill=RedLine!5},
|
||||
arrow/.style={-latex, line width=1.5pt, draw=GreenLine}
|
||||
}
|
||||
|
||||
% Single Node Stack
|
||||
\node[label] at (0, 4.5) {Single-Node Stack};
|
||||
\node[sublabel] at (0, 4.1) {1–8 GPUs, Shared Memory};
|
||||
|
||||
\node[stack, fill=RedLine!15, draw=RedLine] (app1) at (0, 3.2) {Application};
|
||||
\node[sublabel] at (0, 3.2) {\\[0.4em]Training Loop / Inference};
|
||||
|
||||
\node[stack, fill=OrangeLine!15, draw=OrangeLine] (fw1) at (0, 2.2) {ML Framework};
|
||||
\node[sublabel] at (0, 2.2) {\\[0.4em]PyTorch / JAX / Kernels};
|
||||
|
||||
\node[stack, fill=GreenLine!15, draw=GreenLine] (os1) at (0, 1.2) {Operating System};
|
||||
\node[sublabel] at (0, 1.2) {\\[0.4em]CUDA / PCIe DMA};
|
||||
|
||||
\node[stack, fill=BlueLine!15, draw=BlueLine] (hw1) at (0, 0.2) {Hardware};
|
||||
\node[sublabel] at (0, 0.2) {\\[0.4em]HBM / NVLink (900 GB/s)};
|
||||
|
||||
\node[bottleneck, draw=RedLine] (bn1) at (0, -0.8) {\textbf{Bottleneck: Memory Wall}};
|
||||
|
||||
% Scaling Arrow
|
||||
\draw[arrow] (2, 1.7) -- (3, 1.7) node[midway, above, color=GreenLine, font=\usefont{T1}{phv}{b}{n}] {Scaling};
|
||||
|
||||
% Distributed Fleet Stack
|
||||
\node[label] at (5, 4.5) {Distributed Fleet Stack};
|
||||
\node[sublabel] at (5, 4.1) {1,000–100,000+ GPUs};
|
||||
|
||||
\node[stack, fill=VioletLine!15, draw=VioletLine] (app2) at (5, 3.2) {Governance};
|
||||
\node[sublabel] at (5, 3.2) {\\[0.4em]Responsible AI / Security};
|
||||
|
||||
\node[stack, fill=OrangeLine!15, draw=OrangeLine] (fw2) at (5, 2.2) {Serving / Ops};
|
||||
\node[sublabel] at (5, 2.2) {\\[0.4em]Orchestration / CI/CD};
|
||||
|
||||
\node[stack, fill=GreenLine!15, draw=GreenLine] (os2) at (5, 1.2) {Distribution};
|
||||
\node[sublabel] at (5, 1.2) {\\[0.4em]NCCL / RDMA / Comms};
|
||||
|
||||
\node[stack, fill=BlueLine!15, draw=BlueLine] (hw2) at (5, 0.2) {Infrastructure};
|
||||
\node[sublabel] at (5, 0.2) {\\[0.4em]Fabric / RDMA (InfiniBand)};
|
||||
|
||||
\node[bottleneck, draw=RedLine] (bn2) at (5, -0.8) {\textbf{Bottleneck: Network Wall}};
|
||||
|
||||
\end{tikzpicture}
|
||||
```
|
||||
:::
|
||||
|
||||
[^fn-gpu-parallel]: **GPU (Graphics Processing Unit)**: Originally designed for rendering video game graphics, a workload requiring thousands of simple, parallel pixel calculations. This hardware-algorithm alignment proved decisive for neural networks, where the same massively parallel arithmetic structure maps directly onto matrix multiplication, making GPUs the primary physical enabler of modern training scale (see @sec-hardware-acceleration). \index{GPU!etymology}
|
||||
|
||||
[^fn-neural-network-origin]: **Neural Network**: A differentiable function approximator whose compute and memory demands scale with its learned parameter count. Running *dozens* of them simultaneously, as Tesla's perception stack does, multiplies the memory footprint and scheduling complexity on fixed vehicle hardware, making the system's bottleneck not any single model's accuracy but the aggregate resource budget required to execute all models within a shared latency window. \index{Neural Network!etymology}
|
||||
|
||||
The D·A·M taxonomy tells us what an ML system is made of. Understanding the *components* of a system is not the same as understanding *how those components interact under stress*. Traditional software systems share the same basic ingredients (data, logic, infrastructure) yet fail in completely different ways. The distinctive failure mode of ML systems, silent degradation rather than explicit crashes, is what makes them genuinely new from an engineering standpoint.
|
||||
|
||||
## ML vs. Traditional Software {#sec-introduction-ml-vs-traditional-software-e19a}
|
||||
@@ -1679,7 +1675,7 @@ class GPT3Training:
|
||||
**The Systems Insight**: If we improve software efficiency (η) from `{python} GPT3Training.eta_base_pct_str`% to `{python} GPT3Training.eta_opt_pct_str`% through kernel fusion and better scheduling, training time drops to **`{python} GPT3Training.days_optimized_str` days**, saving nearly `{python} GPT3Training.days_saved_str` days of expensive compute time.
|
||||
:::
|
||||
|
||||
The equation is dimensionally consistent: each term resolves to seconds. One cannot add FLOPs to Bytes any more than you can add meters to kilograms; the **Iron Law** adds Time to Time to Time. @sec-machine-foundations-dimensional-analysis-76b3 provides a formal dimensional analysis verifying this consistency and demonstrates how unit tracking prevents common modeling errors.
|
||||
The equation is dimensionally consistent: each term resolves to seconds. One cannot add FLOPs to Bytes any more than one can add meters to kilograms; the **Iron Law** adds Time to Time to Time. @sec-machine-foundations-dimensional-analysis-76b3 provides a formal dimensional analysis verifying this consistency and demonstrates how unit tracking prevents common modeling errors.
|
||||
|
||||
The **Iron Law** governs *time*, but time is not the only constraint. For mobile devices, edge systems, and large-scale training clusters, *energy* often matters more than raw speed.
|
||||
|
||||
@@ -2414,7 +2410,7 @@ Part III addresses optimization for production deployment. @sec-data-selection i
|
||||
|
||||
Part IV ensures optimized systems operate reliably in production. @sec-model-serving covers infrastructure for delivering predictions with low latency. @sec-ml-operations encompasses practices from monitoring and deployment to incident response. @sec-responsible-engineering addresses ethical considerations and governance. @sec-conclusion synthesizes the complete methodology and prepares the reader for the transition from single-node mastery to fleet-scale orchestration.
|
||||
|
||||
For detailed guidance on reading paths, learning outcomes, prerequisites, and how to maximize your experience with this textbook, refer to the [About](../../frontmatter/about/about.qmd) section.
|
||||
For detailed guidance on reading paths, learning outcomes, prerequisites, and how to get the most from this textbook, refer to the [About](../../frontmatter/about/about.qmd) section.
|
||||
|
||||
Before moving forward, we examine the assumptions that trip up practitioners new to ML systems. The frameworks above provide the right mental models, but only if we also shed the wrong ones carried over from adjacent fields. Every discipline accumulates intuitions that work within its boundaries but fail when applied elsewhere. ML systems engineering is particularly vulnerable to such imported assumptions because it draws from software engineering, statistics, and hardware design simultaneously, each of which cultivates subtly different intuitions about how systems should behave.
|
||||
|
||||
@@ -2679,7 +2675,7 @@ The machine learning systems landscape spans nine orders of magnitude in computa
|
||||
::: {.callout-chapter-connection title="From Vision to Architecture"}
|
||||
|
||||
Where should an ML model actually run? The answer is not "wherever is most convenient." Physical laws dictate what is possible.
|
||||
The speed of light makes distant cloud servers useless for emergency braking. Thermodynamics prevents datacenter-class models from running in your pocket. Memory physics creates bandwidth ceilings that faster chips cannot overcome. @sec-ml-systems introduces the four deployment paradigms (Cloud, Edge, Mobile, and TinyML) that span nine orders of magnitude in power and memory, explaining why each exists and how to choose among them.
|
||||
The speed of light makes distant cloud servers useless for emergency braking. Thermodynamics prevents datacenter-class models from running on a mobile device. Memory physics creates bandwidth ceilings that faster chips cannot overcome. @sec-ml-systems introduces the four deployment paradigms (Cloud, Edge, Mobile, and TinyML) that span nine orders of magnitude in power and memory, explaining why each exists and how to choose among them.
|
||||
|
||||
Welcome to AI Engineering.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user