refactor: finalize the 'Engineering Crux' terminology (Hardware -> Systems -> Workloads -> Missions) across both volumes

This commit is contained in:
Vijay Janapa Reddi
2026-02-24 21:08:05 -05:00
parent fdfd91bf03
commit cc7c54e4ed
2 changed files with 12 additions and 12 deletions

View File

@@ -1404,15 +1404,15 @@ ML systems engineering is the discipline of keeping all three axes in balance. @
The D·A·M taxonomy provides the diagnostic lens, but to build systems, we must organize these axes into a reproducible hierarchy. We formalize this throughout the book as the **Engineering Crux**: a four-layer stack that transforms raw physical constraints into functional user applications.
### The Engineering Crux: A Hierarchy of Components {#sec-introduction-engineering-crux}
### The Engineering Crux: A Hierarchy of Architecture {#sec-introduction-engineering-crux}
\index{Engineering Crux!hierarchy}
Every machine learning system analyzed in this text is constructed from four hierarchical layers. By standardizing these components, we ensure that a technical decision made at the silicon level (Machine) is traceable to its impact on the final application (Scenario).
Every machine learning system analyzed in this text is constructed from four hierarchical layers. This **Engineering Crux** transforms raw physical constraints into functional user applications, ensuring that a decision made at the silicon level is traceable to its impact on the final mission.
1. **Hardware (The Silicon)**: The physical foundation. This layer defines the raw capabilities: $R_{peak}$, $BW$, and memory capacity. Throughout the labs and examples, we reference real-world hardware "Twins" such as the **NVIDIA H100**, **Jetson Orin**, and **ESP32-CAM**.
2. **Models (The Weights)**: The algorithmic representation. This layer defines the workload: parameter count, operation count ($O$), and layer architecture. We use **Lighthouse Models** like **GPT-4**, **ResNet-50**, and **Wake Vision** as the standard "Software" for our benchmarks.
3. **Systems (The Envelopes)**: The archetypal configurations. This layer bundles Hardware and Models into standardized deployment "Envelopes" (Cloud, Edge, Mobile, TinyML). A **System Archetype** defines the global constraints, such as power budget and network bandwidth, within which the model must operate.
4. **Scenarios (The Missions)**: The application context. This is the top of the stack, where a system is deployed to solve a specific problem. A **Scenario**—such as the **Smart Doorbell** or **Autonomous Vehicle**—introduces high-level missions (e.g., "1-year battery life") that force specific trade-offs down through the lower layers.
1. **Hardware (The Silicon)**: The physical foundation (The Engine). This layer defines the raw capabilities: $R_{peak}$, $BW$, and memory capacity. We use real-world hardware "Twins" like the **NVIDIA H100** and **ESP32-S3**.
2. **Systems (The Platforms)**: The integrated deployment unit (The Car). This layer defines the "Envelope" in which hardware operates: power budget, thermal limits, and node-level interconnects. Examples include the **Training Cluster Node** or the **Sub-Watt Sensor Node**.
3. **Workloads (The Models)**: The algorithmic demand (The Route). This layer defines the mathematical workload: operation count ($O$), parameter volume ($D_{vol}$), and data layout. We use **Lighthouse Workloads** like **GPT-4** and **Wake Vision**.
4. **Missions (The Scenarios)**: The application context (The Destination). This is the top of the stack, where a system is deployed to solve a specific problem. A **Mission**—such as the **Smart Doorbell**—introduces high-level requirements (e.g., "1-year battery life") that dictate the configuration of every layer below.
This hierarchy ensures that when we build a lab or a case study, we are not starting from scratch. We are "inheriting" the constraints of a System Archetype and applying a Lighthouse Model to a specific mission. For instance, the **Smart Doorbell** scenario (@sec-introduction-deployment-case-studies-636f) inherits the **TinyML Archetype**, uses the **Wake Vision** model, and operates on **ESP32** hardware. This structured approach allows us to reason about the "Physics of ML" across any application domain.

View File

@@ -380,15 +380,15 @@ These scale-induced challenges drive infrastructure investment by the largest AI
Existing distributed systems like Apache Spark and standard web microservices cannot run the Machine Learning Fleet because the **workload characteristics** of ML systems are fundamentally different from traditional distributed systems, even though the underlying hardware—network, compute, storage—is identical.
## The Engineering Crux: A Hierarchy of Scale {#sec-vol2-introduction-engineering-crux}
## The Engineering Crux: A Hierarchy of Architecture {#sec-vol2-introduction-engineering-crux}
\index{Engineering Crux!hierarchy of scale}
Building machine learning systems at fleet scale requires a reproducible hierarchy of components. We formalize this throughout Volume II as the **Engineering Crux**: a four-layer stack that transforms raw cluster resources into global-scale AI applications. This structured approach ensures that when we discuss a distributed algorithm (like Ring AllReduce), we are doing so within a specific physical and application context.
Building machine learning systems at fleet scale requires a reproducible hierarchy of components. We formalize this throughout Volume II as the **Engineering Crux**: a four-layer stack that transforms raw cluster resources into global-scale AI applications.
1. **Hardware (The Silicon)**: The physical foundation of the fleet. This layer defines the raw capabilities of individual nodes ($R_{peak}$, $\text{BW}$) and their internal memory hierarchies. In Volume II, our primary "Hardware Twins" are the **NVIDIA H100** and **B200 (Blackwell)**.
2. **Models (The Workloads)**: The algorithmic representation. This layer defines the mathematical workload sharded across the cluster. We use **Fleet-Scale Models** like **GPT-4 (Archetype A)** and **DLRM (Archetype B)** as our standard benchmarks.
3. **Systems (The Fleets)**: The archetypal clusters. This layer bundles Hardware and Models into standardized "Fleet Envelopes." A **System Archetype** (e.g., **Cloud Cluster**, **Edge Robotics Pod**) defines the global constraints, such as bisection bandwidth, power usage effectiveness (PUE), and mean time between failures (MTBF).
4. **Scenarios (The Missions)**: The global application context. This is the top of the stack, where a fleet is deployed to solve a mission-critical problem. A **Scenario**—such as **Frontier Model Training** or **Autonomous Fleet Management**—introduces high-level requirements (e.g., "99.99% service availability") that dictate the configuration of every layer below.
1. **Hardware (The Silicon)**: The physical foundation (The Engine). This layer defines the raw capabilities of individual nodes ($R_{peak}$, $\text{BW}$). Our primary "Hardware Twins" are the **NVIDIA H100** and **B200**.
2. **Systems (The Platforms)**: The integrated deployment unit (The Car). This layer defines the cluster "Envelope": bisection bandwidth, power usage effectiveness (PUE), and failure rates (MTBF). Examples include the **H100 Training Cluster**.
3. **Workloads (The Models)**: The algorithmic demand (The Route). This layer defines the mathematical workload sharded across the cluster ($O$, $D_{vol}$, $CI$). We use **Lighthouse Workloads** like **GPT-4** and **DLRM**.
4. **Missions (The Scenarios)**: The global application context (The Destination). This is the top of the stack, where a fleet is deployed to solve a mission-critical problem. A **Mission**—such as **Frontier Model Training**—introduces high-level requirements (e.g., "99.99% service availability") that dictate the configuration of every layer below.
This hierarchy ensures that every distributed engineering decision is grounded in its "Mission Context." For example, the **Frontier Training** mission inherits the **Cloud Archetype**, uses the **GPT-4** model, and operates on a cluster of **H100** hardware. By standardizing these protagonists, we ensure that the "Physics of Scale" remains traceable across every chapter.