mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
refactor: finalize the 'Engineering Crux' terminology (Hardware -> Systems -> Workloads -> Missions) across both volumes
This commit is contained in:
@@ -1404,15 +1404,15 @@ ML systems engineering is the discipline of keeping all three axes in balance. @
|
||||
|
||||
The D·A·M taxonomy provides the diagnostic lens, but to build systems, we must organize these axes into a reproducible hierarchy. We formalize this throughout the book as the **Engineering Crux**: a four-layer stack that transforms raw physical constraints into functional user applications.
|
||||
|
||||
### The Engineering Crux: A Hierarchy of Components {#sec-introduction-engineering-crux}
|
||||
### The Engineering Crux: A Hierarchy of Architecture {#sec-introduction-engineering-crux}
|
||||
|
||||
\index{Engineering Crux!hierarchy}
|
||||
Every machine learning system analyzed in this text is constructed from four hierarchical layers. By standardizing these components, we ensure that a technical decision made at the silicon level (Machine) is traceable to its impact on the final application (Scenario).
|
||||
Every machine learning system analyzed in this text is constructed from four hierarchical layers. This **Engineering Crux** transforms raw physical constraints into functional user applications, ensuring that a decision made at the silicon level is traceable to its impact on the final mission.
|
||||
|
||||
1. **Hardware (The Silicon)**: The physical foundation. This layer defines the raw capabilities: $R_{peak}$, $BW$, and memory capacity. Throughout the labs and examples, we reference real-world hardware "Twins" such as the **NVIDIA H100**, **Jetson Orin**, and **ESP32-CAM**.
|
||||
2. **Models (The Weights)**: The algorithmic representation. This layer defines the workload: parameter count, operation count ($O$), and layer architecture. We use **Lighthouse Models** like **GPT-4**, **ResNet-50**, and **Wake Vision** as the standard "Software" for our benchmarks.
|
||||
3. **Systems (The Envelopes)**: The archetypal configurations. This layer bundles Hardware and Models into standardized deployment "Envelopes" (Cloud, Edge, Mobile, TinyML). A **System Archetype** defines the global constraints, such as power budget and network bandwidth, within which the model must operate.
|
||||
4. **Scenarios (The Missions)**: The application context. This is the top of the stack, where a system is deployed to solve a specific problem. A **Scenario**—such as the **Smart Doorbell** or **Autonomous Vehicle**—introduces high-level missions (e.g., "1-year battery life") that force specific trade-offs down through the lower layers.
|
||||
1. **Hardware (The Silicon)**: The physical foundation (The Engine). This layer defines the raw capabilities: $R_{peak}$, $BW$, and memory capacity. We use real-world hardware "Twins" like the **NVIDIA H100** and **ESP32-S3**.
|
||||
2. **Systems (The Platforms)**: The integrated deployment unit (The Car). This layer defines the "Envelope" in which hardware operates: power budget, thermal limits, and node-level interconnects. Examples include the **Training Cluster Node** or the **Sub-Watt Sensor Node**.
|
||||
3. **Workloads (The Models)**: The algorithmic demand (The Route). This layer defines the mathematical workload: operation count ($O$), parameter volume ($D_{vol}$), and data layout. We use **Lighthouse Workloads** like **GPT-4** and **Wake Vision**.
|
||||
4. **Missions (The Scenarios)**: The application context (The Destination). This is the top of the stack, where a system is deployed to solve a specific problem. A **Mission**—such as the **Smart Doorbell**—introduces high-level requirements (e.g., "1-year battery life") that dictate the configuration of every layer below.
|
||||
|
||||
This hierarchy ensures that when we build a lab or a case study, we are not starting from scratch. We are "inheriting" the constraints of a System Archetype and applying a Lighthouse Model to a specific mission. For instance, the **Smart Doorbell** scenario (@sec-introduction-deployment-case-studies-636f) inherits the **TinyML Archetype**, uses the **Wake Vision** model, and operates on **ESP32** hardware. This structured approach allows us to reason about the "Physics of ML" across any application domain.
|
||||
|
||||
|
||||
@@ -380,15 +380,15 @@ These scale-induced challenges drive infrastructure investment by the largest AI
|
||||
|
||||
Existing distributed systems like Apache Spark and standard web microservices cannot run the Machine Learning Fleet because the **workload characteristics** of ML systems are fundamentally different from traditional distributed systems, even though the underlying hardware—network, compute, storage—is identical.
|
||||
|
||||
## The Engineering Crux: A Hierarchy of Scale {#sec-vol2-introduction-engineering-crux}
|
||||
## The Engineering Crux: A Hierarchy of Architecture {#sec-vol2-introduction-engineering-crux}
|
||||
|
||||
\index{Engineering Crux!hierarchy of scale}
|
||||
Building machine learning systems at fleet scale requires a reproducible hierarchy of components. We formalize this throughout Volume II as the **Engineering Crux**: a four-layer stack that transforms raw cluster resources into global-scale AI applications. This structured approach ensures that when we discuss a distributed algorithm (like Ring AllReduce), we are doing so within a specific physical and application context.
|
||||
Building machine learning systems at fleet scale requires a reproducible hierarchy of components. We formalize this throughout Volume II as the **Engineering Crux**: a four-layer stack that transforms raw cluster resources into global-scale AI applications.
|
||||
|
||||
1. **Hardware (The Silicon)**: The physical foundation of the fleet. This layer defines the raw capabilities of individual nodes ($R_{peak}$, $\text{BW}$) and their internal memory hierarchies. In Volume II, our primary "Hardware Twins" are the **NVIDIA H100** and **B200 (Blackwell)**.
|
||||
2. **Models (The Workloads)**: The algorithmic representation. This layer defines the mathematical workload sharded across the cluster. We use **Fleet-Scale Models** like **GPT-4 (Archetype A)** and **DLRM (Archetype B)** as our standard benchmarks.
|
||||
3. **Systems (The Fleets)**: The archetypal clusters. This layer bundles Hardware and Models into standardized "Fleet Envelopes." A **System Archetype** (e.g., **Cloud Cluster**, **Edge Robotics Pod**) defines the global constraints, such as bisection bandwidth, power usage effectiveness (PUE), and mean time between failures (MTBF).
|
||||
4. **Scenarios (The Missions)**: The global application context. This is the top of the stack, where a fleet is deployed to solve a mission-critical problem. A **Scenario**—such as **Frontier Model Training** or **Autonomous Fleet Management**—introduces high-level requirements (e.g., "99.99% service availability") that dictate the configuration of every layer below.
|
||||
1. **Hardware (The Silicon)**: The physical foundation (The Engine). This layer defines the raw capabilities of individual nodes ($R_{peak}$, $\text{BW}$). Our primary "Hardware Twins" are the **NVIDIA H100** and **B200**.
|
||||
2. **Systems (The Platforms)**: The integrated deployment unit (The Car). This layer defines the cluster "Envelope": bisection bandwidth, power usage effectiveness (PUE), and failure rates (MTBF). Examples include the **H100 Training Cluster**.
|
||||
3. **Workloads (The Models)**: The algorithmic demand (The Route). This layer defines the mathematical workload sharded across the cluster ($O$, $D_{vol}$, $CI$). We use **Lighthouse Workloads** like **GPT-4** and **DLRM**.
|
||||
4. **Missions (The Scenarios)**: The global application context (The Destination). This is the top of the stack, where a fleet is deployed to solve a mission-critical problem. A **Mission**—such as **Frontier Model Training**—introduces high-level requirements (e.g., "99.99% service availability") that dictate the configuration of every layer below.
|
||||
|
||||
This hierarchy ensures that every distributed engineering decision is grounded in its "Mission Context." For example, the **Frontier Training** mission inherits the **Cloud Archetype**, uses the **GPT-4** model, and operates on a cluster of **H100** hardware. By standardizing these protagonists, we ensure that the "Physics of Scale" remains traceable across every chapter.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user