mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-24 09:17:53 -05:00
Two issues caused the deployed slide PDFs to be unusable:
1. Every chapter .tex declared `\setsansfont{Helvetica Neue}` — proprietary
to Apple, not installed on the Ubuntu CI runner. xelatex bombed mid-frame,
the workflow's `|| true` swallowed the error, and the resulting PDF had
most text never typeset (blank pages with only logos/rules surviving).
Switch all 35 decks to TeX Gyre Heros (sans) and TeX Gyre Cursor (mono),
both bundled with texlive-fonts-extra — no external font downloads needed.
Drop the JetBrains Mono wget step and fonts-liberation from both slide
workflows accordingly.
2. Vol1 and Vol2 each ship `00_course_overview.pdf` and `01_introduction.pdf`.
The publish workflow uploaded them to a flat GitHub Release namespace, so
the second upload silently overwrote the first — clicking Vol I's Course
Overview actually downloaded Vol II's deck. Stage prefixed copies
(vol1_*.pdf, vol2_*.pdf) before upload, and update slides/vol{1,2}.qmd
plus the mlsysim cross-links to point at the new prefixed URLs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
243 lines
17 KiB
Plaintext
243 lines
17 KiB
Plaintext
---
|
||
title: "The 5-Layer Architecture"
|
||
subtitle: "Progressive Lowering: From Abstract Demand to Concrete Supply"
|
||
---
|
||
|
||
MLSYSIM organizes the full ML systems domain into five composable layers. Abstract workload *demand* (Layer A) is progressively mapped onto concrete hardware *supply* (Layers B--D) through analytical *solvers* (Layer E). Each layer corresponds directly to chapters in the [Machine Learning Systems](https://mlsysbook.ai) textbook and its companion [lecture slides](https://mlsysbook.ai/slides/).
|
||
|
||
Understanding this stack is the key to mastering both this library and the textbook it accompanies.
|
||
|
||
## The Stack at a Glance
|
||
|
||
```{mermaid}
|
||
%%{init: {'theme': 'neutral'}}%%
|
||
%%| fig-cap: "The MLSYSIM 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
|
||
%%| fig-width: 100%
|
||
flowchart TB
|
||
A["<b>Layer A · Workloads</b> — <i>Demand</i><br/>TransformerWorkload · CNNWorkload<br/>Parameters · FLOPs · Arithmetic Intensity"]
|
||
B["<b>Layer B · Hardware</b> — <i>Silicon</i><br/>HardwareNode · ComputeCore · MemoryHierarchy<br/>Peak FLOP/s · Bandwidth · Capacity · TDP"]
|
||
C["<b>Layer C · Infrastructure</b> — <i>Environment</i><br/>GridProfile · Datacenter<br/>Carbon Intensity · PUE · WUE"]
|
||
D["<b>Layer D · Systems</b> — <i>Topology</i><br/>Node · Fleet · NetworkFabric<br/>Topology · Accelerators/Node · Fabric BW"]
|
||
E["<b>Layer E · Solvers</b> — <i>Analysis</i><br/>SingleNode · Distributed · Serving<br/>Economics · Sustainability · Reliability"]
|
||
F["<b>Results</b><br/>PerformanceProfile · Dict[str, Quantity]"]
|
||
|
||
A --> E
|
||
B --> D
|
||
C --> D
|
||
D --> E
|
||
E --> F
|
||
```
|
||
|
||
The flow reads top-to-bottom: a **Workload** (what you want to compute) and a **System** (the hardware, network, and environment you have) feed into a **Solver** (the analytical engine), which returns quantitative results you can act on.
|
||
|
||
---
|
||
|
||
## Layer A: Workloads (Demand) {#sec-layer-a}
|
||
|
||
A **Workload** is a hardware-agnostic description of computational demand. The question is not "How fast is Llama-3?" but rather "How many FLOPs and memory bytes does Llama-3 *require*?"
|
||
|
||
MLSYSIM provides two workload types:
|
||
|
||
- **`TransformerWorkload`** — LLMs and attention-based models (GPT, Llama, BERT). Defined by parameter count, layer count, hidden dimension, attention heads, and sequence length. Supports KV-cache size estimation for autoregressive inference.
|
||
- **`CNNWorkload`** — Vision and convolutional models (ResNet, MobileNet, YOLO). Defined by parameter count and total FLOPs per forward pass.
|
||
|
||
The crucial step happens when a workload is "lowered" at a specific numerical precision (FP32, FP16, INT8, INT4). This precision lowering determines the **Arithmetic Intensity** (FLOPs/Byte) — the single ratio that decides whether a model will be compute-bound or memory-bound on any given hardware target.
|
||
|
||
::: {.callout-note}
|
||
## Textbook and slides
|
||
Layer A concepts are covered in the following lecture decks:
|
||
|
||
| Deck | Topic | Key Concepts |
|
||
|:-----|:------|:-------------|
|
||
| [NN Computation](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 5) | The math behind the model | FLOPs, parameter counting, activation memory, forward/backward pass cost |
|
||
| [NN Architectures](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 6) | Architecture as infrastructure | Computational signatures of CNNs, RNNs, Transformers; quadratic attention scaling |
|
||
| [Model Compression](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 10) | From benchmark winner to production model | Pruning, distillation, quantization (FP32 to INT4); precision as a workload knob |
|
||
| [Model Training](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 8) | The physics of learning | Iron Law of training, memory breakdown (activations, gradients, optimizer states) |
|
||
:::
|
||
|
||
*See the [Model Zoo](zoo/models.qmd) for vetted workloads spanning language, vision, recommendation, and TinyML models.*
|
||
|
||
---
|
||
|
||
## Layer B: Hardware (Silicon) {#sec-layer-b}
|
||
|
||
A **`HardwareNode`** represents a single physical accelerator — an H100 GPU, an Apple M3 chip, a Jetson Orin, or even an ESP32 microcontroller. It provides the raw physical supply:
|
||
|
||
- **Compute:** Peak throughput (TFLOP/s) across precisions (FP32, FP16, INT8), modeled via `ComputeCore`.
|
||
- **Memory:** HBM or SRAM capacity and transfer bandwidth (TB/s), modeled via `MemoryHierarchy`.
|
||
- **Power:** Thermal Design Power (TDP) in watts.
|
||
|
||
Every piece of silicon has a **ridge point** — the ratio of peak compute to memory bandwidth ($I^* = \text{Peak\_FLOPs} / \text{BW}$). When a workload's arithmetic intensity falls below this ridge point, the workload is memory-bound: the compute units starve waiting for data. When above, the workload is compute-bound: memory delivers data faster than the cores can consume it.
|
||
|
||
This is the Roofline model — the diagnostic framework at the heart of Layer E's `SingleNodeModel`.
|
||
|
||
::: {.callout-note}
|
||
## Textbook and slides
|
||
Layer B concepts are covered in the following lecture decks:
|
||
|
||
| Deck | Topic | Key Concepts |
|
||
|:-----|:------|:-------------|
|
||
| [Hardware Acceleration](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 11) | Moving data costs more than computing it | Systolic arrays, tensor cores, Roofline model, accelerator spectrum (GPU to ASIC) |
|
||
| [Compute Infrastructure](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 2) | The physics of the ML fleet | HBM generations, bandwidth hierarchy (HBM to NVLink to InfiniBand), TCO analysis |
|
||
| [Benchmarking](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 12) | Measuring what matters | MLPerf, roofline profiling, statistical rigor, the lab-to-production gap |
|
||
:::
|
||
|
||
*See the [Silicon Zoo](zoo/hardware.qmd) for 18+ vetted hardware specs from cloud GPUs to microcontrollers.*
|
||
|
||
---
|
||
|
||
## Layer C: Infrastructure (Environment) {#sec-layer-c}
|
||
|
||
Hardware does not run in a vacuum. It runs in datacenters plugged into regional power grids, cooled by air or liquid systems, and constrained by physical energy budgets.
|
||
|
||
A **`GridProfile`** captures this physical context:
|
||
|
||
- **Carbon intensity** (g CO₂/kWh) — varies 40x across regions: Quebec's hydroelectric grid produces ~8 g CO₂/kWh; Poland's coal-dominated grid produces ~700 g CO₂/kWh.
|
||
- **Power Usage Effectiveness (PUE)** — the ratio of total facility power to IT equipment power. A PUE of 1.1 means 10% overhead for cooling and infrastructure; 1.6 means 60%.
|
||
- **Water Usage Effectiveness (WUE)** — liters of water consumed per kWh, determined by cooling technology (air, liquid, evaporative).
|
||
|
||
A 1000-watt GPU running identical computations in Quebec versus Poland produces vastly different carbon footprints. This layer makes that difference quantifiable.
|
||
|
||
::: {.callout-note}
|
||
## Textbook and slides
|
||
Layer C concepts are covered in the following lecture deck:
|
||
|
||
| Deck | Topic | Key Concepts |
|
||
|:-----|:------|:-------------|
|
||
| [Sustainable AI](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 15) | Energy as a first-class engineering constraint | Lifecycle carbon accounting, PUE modeling, carbon-aware scheduling, the 4 Ms framework |
|
||
:::
|
||
|
||
*See the [Infrastructure Zoo](zoo/infra.qmd) for regional grid profiles and datacenter configurations.*
|
||
|
||
---
|
||
|
||
## Layer D: Systems (Topology) {#sec-layer-d}
|
||
|
||
A single GPU cannot train a 100-billion-parameter model. Layer D composes individual `HardwareNode`s from Layer B into distributed clusters, connected by the network fabrics that determine communication overhead.
|
||
|
||
Three types define a system:
|
||
|
||
- **`Node`** — A single compute server grouping accelerators within a physical chassis (e.g., 8x H100 GPUs connected by 900 GB/s NVLink). Includes intra-node bandwidth and NIC count.
|
||
- **`NetworkFabric`** — The inter-node interconnect: topology (fat-tree, rail-optimized), bandwidth (e.g., 400 Gbps InfiniBand NDR), latency, and oversubscription ratio.
|
||
- **`Fleet`** — A collection of `Node`s connected by a `NetworkFabric`, deployed in a specific `Datacenter` (Layer C). This is the complete system that solvers operate on.
|
||
|
||
The way you structure this system — how many GPUs per node, what fabric connects them, which parallelism strategy you apply — determines your communication overhead and scaling efficiency.
|
||
|
||
::: {.callout-note}
|
||
## Textbook and slides
|
||
Layer D concepts span four lecture decks covering the full distributed systems stack:
|
||
|
||
| Deck | Topic | Key Concepts |
|
||
|:-----|:------|:-------------|
|
||
| [Network Fabrics](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 3) | The synchronization backbone | $\alpha$-$\beta$ model, topology (fat-tree, dragonfly), RDMA, congestion control, bisection bandwidth |
|
||
| [Distributed Training](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 5) | The physics of scaling | Data/tensor/pipeline parallelism, ZeRO/FSDP, Communication-Computation Ratio (CCR), 3D hybrid strategies |
|
||
| [Collective Communication](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 6) | The traffic patterns of distributed ML | AllReduce algorithms (ring, tree), gradient compression, hierarchical communication, computation-communication overlap |
|
||
| [Fleet Orchestration](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 8) | Extracting useful work from shared infrastructure | Slurm vs. Kubernetes, topology-aware scheduling, elastic training, multi-tenancy |
|
||
:::
|
||
|
||
*See the [Fleet Zoo](zoo/fleets.qmd) for production cluster topologies from 256-GPU research clusters to 8192-GPU frontier fleets.*
|
||
|
||
---
|
||
|
||
## Layer E: Solvers (Analysis) {#sec-layer-e}
|
||
|
||
The previous four layers are static definitions — nouns. **Solvers** are the analytical engines — verbs — that bridge demand and supply to answer specific questions.
|
||
|
||
Each solver implements closed-form equations from peer-reviewed systems literature. No simulation, no benchmarking, no hardware required.
|
||
|
||
| Solver | Bridges | Core Equation | Key Output |
|
||
|:-------|:--------|:--------------|:-----------|
|
||
| **SingleNodeModel** | A $\to$ B | Roofline / Iron Law (Williams et al., 2009) | Latency, throughput, bottleneck classification |
|
||
| **DistributedModel** | A $\to$ D | Ring All-Reduce + Pipeline schedules | Scaling efficiency, communication overhead, bubble fraction |
|
||
| **ServingModel** | A $\to$ B | Pre-fill / Decode phase decomposition | TTFT, inter-token latency, KV-cache memory |
|
||
| **EconomicsModel** | D $\to$ cost | CapEx + OpEx over time horizon | TCO in USD, cost per query |
|
||
| **SustainabilityModel** | D $\to$ C | PUE $\times$ grid carbon intensity | Energy (kWh), carbon (kg CO₂e), water (L) |
|
||
| **ReliabilityModel** | D $\to$ uptime | Young-Daly optimal checkpointing | Fleet MTBF, failure probability, checkpoint interval |
|
||
|
||
Solvers are composable. To answer "What is the most sustainable way to serve Llama-70B?", chain `ServingModel` (feasibility and latency) into `EconomicsModel` (cost) into `SustainabilityModel` (carbon). Each solver's typed output feeds naturally into the next.
|
||
|
||
::: {.callout-note}
|
||
## Textbook and slides
|
||
Layer E concepts draw from multiple lecture decks that teach the diagnostic and optimization frameworks:
|
||
|
||
| Deck | Topic | Key Concepts |
|
||
|:-----|:------|:-------------|
|
||
| [Hardware Acceleration](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 11) | The Roofline model | Arithmetic intensity, ridge points, memory-bound vs. compute-bound diagnosis |
|
||
| [Model Serving](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 13) | Inverting every training priority | Latency budgets, queuing theory (Little's Law), continuous batching, training-serving skew |
|
||
| [Inference at Scale](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 9) | Where ML systems live or die economically | Serving economics, KV-cache bottleneck, batching strategies, autoscaling |
|
||
| [Performance Engineering](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 10) | Match the software to the silicon | Operator fusion, FlashAttention, mixed precision, systematic profiling workflow |
|
||
| [Distributed Training](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 5) | Scaling efficiency analysis | Communication-Computation Ratio, parallelism strategy selection, scaling laws |
|
||
:::
|
||
|
||
*See the [Solver Guide](solver-guide.qmd) for a decision guide on choosing the right solver, and [Math Foundations](math.qmd) for the complete equations.*
|
||
|
||
---
|
||
|
||
## Progressive Lowering in Action
|
||
|
||
The architecture is best understood through a concrete example. Consider the question: **"Can I serve Llama-3-70B on a cluster of 4 H100s within a $50K/year budget while minimizing carbon?"**
|
||
|
||
This single question touches all five layers:
|
||
|
||
```
|
||
1. Layer A — Llama-3-70B workload: 70B parameters, GQA with 8 KV heads,
|
||
~140 GB at FP16 precision
|
||
|
||
2. Layer B — H100 hardware: 990 TFLOP/s (FP16), 3.35 TB/s HBM3,
|
||
80 GB capacity, 700W TDP
|
||
|
||
3. Layer C — Infrastructure: choose between Quebec (8 g CO₂/kWh)
|
||
and US Average (385 g CO₂/kWh)
|
||
|
||
4. Layer D — System: 1 node × 4 H100s, NVLink 900 GB/s intra-node,
|
||
tensor parallelism across 4 GPUs
|
||
|
||
5. Layer E — Chain three solvers:
|
||
ServingModel → TTFT, ITL, KV-cache feasibility
|
||
EconomicsModel → TCO over 1 year
|
||
SustainabilityModel → carbon footprint by region
|
||
```
|
||
|
||
Each layer contributes a different piece of the answer. No single layer is sufficient alone. This is why MLSYSIM separates concerns into five composable layers rather than offering a monolithic "predict performance" function.
|
||
|
||
---
|
||
|
||
## Slide Deck Quick Reference {#sec-slides-reference}
|
||
|
||
For instructors and students using the companion [lecture slides](https://mlsysbook.ai/slides/), the table below maps every MLSYSIM layer to the relevant slide decks.
|
||
|
||
### Volume I: Foundations
|
||
|
||
| Ch | Deck | MLSYSIM Layer(s) | Download |
|
||
|:---|:-----|:------------------|:---------|
|
||
| 5 | NN Computation | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf) |
|
||
| 6 | NN Architectures | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf) |
|
||
| 8 | Model Training | A + B + E | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf) |
|
||
| 10 | Model Compression | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf) |
|
||
| 11 | Hardware Acceleration | B (Hardware) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf) |
|
||
| 12 | Benchmarking | B (Hardware) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf) |
|
||
| 13 | Model Serving | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf) |
|
||
|
||
### Volume II: At Scale
|
||
|
||
| Ch | Deck | MLSYSIM Layer(s) | Download |
|
||
|:---|:-----|:------------------|:---------|
|
||
| 2 | Compute Infrastructure | B (Hardware) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf) |
|
||
| 3 | Network Fabrics | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf) |
|
||
| 5 | Distributed Training | D (Systems) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf) |
|
||
| 6 | Collective Communication | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf) |
|
||
| 8 | Fleet Orchestration | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_08_fleet_orchestration.pdf) |
|
||
| 9 | Inference at Scale | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf) |
|
||
| 10 | Performance Engineering | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf) |
|
||
| 15 | Sustainable AI | C (Infrastructure) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf) |
|
||
|
||
---
|
||
|
||
## Where to Go Next
|
||
|
||
- **[Getting Started](getting-started.qmd)** — Install MLSYSIM and run your first analysis in 5 minutes.
|
||
- **[Solver Guide](solver-guide.qmd)** — Decision guide for choosing the right analytical engine.
|
||
- **[Math Foundations](math.qmd)** — The closed-form equations behind every solver.
|
||
- **[Tutorials](tutorials/index.qmd)** — Hands-on notebooks for roofline analysis, distributed training, LLM serving, and sustainability.
|
||
- **[Zoo Overview](zoo/index.qmd)** — Browse the registries of vetted models, hardware, fleets, and infrastructure.
|