Files
cs249r_book/mlsysim/docs/architecture.qmd
Vijay Janapa Reddi 85a58c65c2 fix(slides): repair blank-pages and Vol1/Vol2 collision in release PDFs
Two issues caused the deployed slide PDFs to be unusable:

1. Every chapter .tex declared `\setsansfont{Helvetica Neue}` — proprietary
   to Apple, not installed on the Ubuntu CI runner. xelatex bombed mid-frame,
   the workflow's `|| true` swallowed the error, and the resulting PDF had
   most text never typeset (blank pages with only logos/rules surviving).
   Switch all 35 decks to TeX Gyre Heros (sans) and TeX Gyre Cursor (mono),
   both bundled with texlive-fonts-extra — no external font downloads needed.
   Drop the JetBrains Mono wget step and fonts-liberation from both slide
   workflows accordingly.

2. Vol1 and Vol2 each ship `00_course_overview.pdf` and `01_introduction.pdf`.
   The publish workflow uploaded them to a flat GitHub Release namespace, so
   the second upload silently overwrote the first — clicking Vol I's Course
   Overview actually downloaded Vol II's deck. Stage prefixed copies
   (vol1_*.pdf, vol2_*.pdf) before upload, and update slides/vol{1,2}.qmd
   plus the mlsysim cross-links to point at the new prefixed URLs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:35:11 -04:00

243 lines
17 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "The 5-Layer Architecture"
subtitle: "Progressive Lowering: From Abstract Demand to Concrete Supply"
---
MLSYSIM organizes the full ML systems domain into five composable layers. Abstract workload *demand* (Layer A) is progressively mapped onto concrete hardware *supply* (Layers B--D) through analytical *solvers* (Layer E). Each layer corresponds directly to chapters in the [Machine Learning Systems](https://mlsysbook.ai) textbook and its companion [lecture slides](https://mlsysbook.ai/slides/).
Understanding this stack is the key to mastering both this library and the textbook it accompanies.
## The Stack at a Glance
```{mermaid}
%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSYSIM 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
A["<b>Layer A · Workloads</b> — <i>Demand</i><br/>TransformerWorkload · CNNWorkload<br/>Parameters · FLOPs · Arithmetic Intensity"]
B["<b>Layer B · Hardware</b> — <i>Silicon</i><br/>HardwareNode · ComputeCore · MemoryHierarchy<br/>Peak FLOP/s · Bandwidth · Capacity · TDP"]
C["<b>Layer C · Infrastructure</b> — <i>Environment</i><br/>GridProfile · Datacenter<br/>Carbon Intensity · PUE · WUE"]
D["<b>Layer D · Systems</b> — <i>Topology</i><br/>Node · Fleet · NetworkFabric<br/>Topology · Accelerators/Node · Fabric BW"]
E["<b>Layer E · Solvers</b> — <i>Analysis</i><br/>SingleNode · Distributed · Serving<br/>Economics · Sustainability · Reliability"]
F["<b>Results</b><br/>PerformanceProfile · Dict[str, Quantity]"]
A --> E
B --> D
C --> D
D --> E
E --> F
```
The flow reads top-to-bottom: a **Workload** (what you want to compute) and a **System** (the hardware, network, and environment you have) feed into a **Solver** (the analytical engine), which returns quantitative results you can act on.
---
## Layer A: Workloads (Demand) {#sec-layer-a}
A **Workload** is a hardware-agnostic description of computational demand. The question is not "How fast is Llama-3?" but rather "How many FLOPs and memory bytes does Llama-3 *require*?"
MLSYSIM provides two workload types:
- **`TransformerWorkload`** — LLMs and attention-based models (GPT, Llama, BERT). Defined by parameter count, layer count, hidden dimension, attention heads, and sequence length. Supports KV-cache size estimation for autoregressive inference.
- **`CNNWorkload`** — Vision and convolutional models (ResNet, MobileNet, YOLO). Defined by parameter count and total FLOPs per forward pass.
The crucial step happens when a workload is "lowered" at a specific numerical precision (FP32, FP16, INT8, INT4). This precision lowering determines the **Arithmetic Intensity** (FLOPs/Byte) — the single ratio that decides whether a model will be compute-bound or memory-bound on any given hardware target.
::: {.callout-note}
## Textbook and slides
Layer A concepts are covered in the following lecture decks:
| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [NN Computation](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 5) | The math behind the model | FLOPs, parameter counting, activation memory, forward/backward pass cost |
| [NN Architectures](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 6) | Architecture as infrastructure | Computational signatures of CNNs, RNNs, Transformers; quadratic attention scaling |
| [Model Compression](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 10) | From benchmark winner to production model | Pruning, distillation, quantization (FP32 to INT4); precision as a workload knob |
| [Model Training](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 8) | The physics of learning | Iron Law of training, memory breakdown (activations, gradients, optimizer states) |
:::
*See the [Model Zoo](zoo/models.qmd) for vetted workloads spanning language, vision, recommendation, and TinyML models.*
---
## Layer B: Hardware (Silicon) {#sec-layer-b}
A **`HardwareNode`** represents a single physical accelerator — an H100 GPU, an Apple M3 chip, a Jetson Orin, or even an ESP32 microcontroller. It provides the raw physical supply:
- **Compute:** Peak throughput (TFLOP/s) across precisions (FP32, FP16, INT8), modeled via `ComputeCore`.
- **Memory:** HBM or SRAM capacity and transfer bandwidth (TB/s), modeled via `MemoryHierarchy`.
- **Power:** Thermal Design Power (TDP) in watts.
Every piece of silicon has a **ridge point** — the ratio of peak compute to memory bandwidth ($I^* = \text{Peak\_FLOPs} / \text{BW}$). When a workload's arithmetic intensity falls below this ridge point, the workload is memory-bound: the compute units starve waiting for data. When above, the workload is compute-bound: memory delivers data faster than the cores can consume it.
This is the Roofline model — the diagnostic framework at the heart of Layer E's `SingleNodeModel`.
::: {.callout-note}
## Textbook and slides
Layer B concepts are covered in the following lecture decks:
| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Hardware Acceleration](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 11) | Moving data costs more than computing it | Systolic arrays, tensor cores, Roofline model, accelerator spectrum (GPU to ASIC) |
| [Compute Infrastructure](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 2) | The physics of the ML fleet | HBM generations, bandwidth hierarchy (HBM to NVLink to InfiniBand), TCO analysis |
| [Benchmarking](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 12) | Measuring what matters | MLPerf, roofline profiling, statistical rigor, the lab-to-production gap |
:::
*See the [Silicon Zoo](zoo/hardware.qmd) for 18+ vetted hardware specs from cloud GPUs to microcontrollers.*
---
## Layer C: Infrastructure (Environment) {#sec-layer-c}
Hardware does not run in a vacuum. It runs in datacenters plugged into regional power grids, cooled by air or liquid systems, and constrained by physical energy budgets.
A **`GridProfile`** captures this physical context:
- **Carbon intensity** (g CO₂/kWh) — varies 40x across regions: Quebec's hydroelectric grid produces ~8 g CO₂/kWh; Poland's coal-dominated grid produces ~700 g CO₂/kWh.
- **Power Usage Effectiveness (PUE)** — the ratio of total facility power to IT equipment power. A PUE of 1.1 means 10% overhead for cooling and infrastructure; 1.6 means 60%.
- **Water Usage Effectiveness (WUE)** — liters of water consumed per kWh, determined by cooling technology (air, liquid, evaporative).
A 1000-watt GPU running identical computations in Quebec versus Poland produces vastly different carbon footprints. This layer makes that difference quantifiable.
::: {.callout-note}
## Textbook and slides
Layer C concepts are covered in the following lecture deck:
| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Sustainable AI](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 15) | Energy as a first-class engineering constraint | Lifecycle carbon accounting, PUE modeling, carbon-aware scheduling, the 4 Ms framework |
:::
*See the [Infrastructure Zoo](zoo/infra.qmd) for regional grid profiles and datacenter configurations.*
---
## Layer D: Systems (Topology) {#sec-layer-d}
A single GPU cannot train a 100-billion-parameter model. Layer D composes individual `HardwareNode`s from Layer B into distributed clusters, connected by the network fabrics that determine communication overhead.
Three types define a system:
- **`Node`** — A single compute server grouping accelerators within a physical chassis (e.g., 8x H100 GPUs connected by 900 GB/s NVLink). Includes intra-node bandwidth and NIC count.
- **`NetworkFabric`** — The inter-node interconnect: topology (fat-tree, rail-optimized), bandwidth (e.g., 400 Gbps InfiniBand NDR), latency, and oversubscription ratio.
- **`Fleet`** — A collection of `Node`s connected by a `NetworkFabric`, deployed in a specific `Datacenter` (Layer C). This is the complete system that solvers operate on.
The way you structure this system — how many GPUs per node, what fabric connects them, which parallelism strategy you apply — determines your communication overhead and scaling efficiency.
::: {.callout-note}
## Textbook and slides
Layer D concepts span four lecture decks covering the full distributed systems stack:
| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Network Fabrics](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 3) | The synchronization backbone | $\alpha$-$\beta$ model, topology (fat-tree, dragonfly), RDMA, congestion control, bisection bandwidth |
| [Distributed Training](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 5) | The physics of scaling | Data/tensor/pipeline parallelism, ZeRO/FSDP, Communication-Computation Ratio (CCR), 3D hybrid strategies |
| [Collective Communication](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 6) | The traffic patterns of distributed ML | AllReduce algorithms (ring, tree), gradient compression, hierarchical communication, computation-communication overlap |
| [Fleet Orchestration](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 8) | Extracting useful work from shared infrastructure | Slurm vs. Kubernetes, topology-aware scheduling, elastic training, multi-tenancy |
:::
*See the [Fleet Zoo](zoo/fleets.qmd) for production cluster topologies from 256-GPU research clusters to 8192-GPU frontier fleets.*
---
## Layer E: Solvers (Analysis) {#sec-layer-e}
The previous four layers are static definitions — nouns. **Solvers** are the analytical engines — verbs — that bridge demand and supply to answer specific questions.
Each solver implements closed-form equations from peer-reviewed systems literature. No simulation, no benchmarking, no hardware required.
| Solver | Bridges | Core Equation | Key Output |
|:-------|:--------|:--------------|:-----------|
| **SingleNodeModel** | A $\to$ B | Roofline / Iron Law (Williams et al., 2009) | Latency, throughput, bottleneck classification |
| **DistributedModel** | A $\to$ D | Ring All-Reduce + Pipeline schedules | Scaling efficiency, communication overhead, bubble fraction |
| **ServingModel** | A $\to$ B | Pre-fill / Decode phase decomposition | TTFT, inter-token latency, KV-cache memory |
| **EconomicsModel** | D $\to$ cost | CapEx + OpEx over time horizon | TCO in USD, cost per query |
| **SustainabilityModel** | D $\to$ C | PUE $\times$ grid carbon intensity | Energy (kWh), carbon (kg CO₂e), water (L) |
| **ReliabilityModel** | D $\to$ uptime | Young-Daly optimal checkpointing | Fleet MTBF, failure probability, checkpoint interval |
Solvers are composable. To answer "What is the most sustainable way to serve Llama-70B?", chain `ServingModel` (feasibility and latency) into `EconomicsModel` (cost) into `SustainabilityModel` (carbon). Each solver's typed output feeds naturally into the next.
::: {.callout-note}
## Textbook and slides
Layer E concepts draw from multiple lecture decks that teach the diagnostic and optimization frameworks:
| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Hardware Acceleration](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 11) | The Roofline model | Arithmetic intensity, ridge points, memory-bound vs. compute-bound diagnosis |
| [Model Serving](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 13) | Inverting every training priority | Latency budgets, queuing theory (Little's Law), continuous batching, training-serving skew |
| [Inference at Scale](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 9) | Where ML systems live or die economically | Serving economics, KV-cache bottleneck, batching strategies, autoscaling |
| [Performance Engineering](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 10) | Match the software to the silicon | Operator fusion, FlashAttention, mixed precision, systematic profiling workflow |
| [Distributed Training](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 5) | Scaling efficiency analysis | Communication-Computation Ratio, parallelism strategy selection, scaling laws |
:::
*See the [Solver Guide](solver-guide.qmd) for a decision guide on choosing the right solver, and [Math Foundations](math.qmd) for the complete equations.*
---
## Progressive Lowering in Action
The architecture is best understood through a concrete example. Consider the question: **"Can I serve Llama-3-70B on a cluster of 4 H100s within a $50K/year budget while minimizing carbon?"**
This single question touches all five layers:
```
1. Layer A — Llama-3-70B workload: 70B parameters, GQA with 8 KV heads,
~140 GB at FP16 precision
2. Layer B — H100 hardware: 990 TFLOP/s (FP16), 3.35 TB/s HBM3,
80 GB capacity, 700W TDP
3. Layer C — Infrastructure: choose between Quebec (8 g CO₂/kWh)
and US Average (385 g CO₂/kWh)
4. Layer D — System: 1 node × 4 H100s, NVLink 900 GB/s intra-node,
tensor parallelism across 4 GPUs
5. Layer E — Chain three solvers:
ServingModel → TTFT, ITL, KV-cache feasibility
EconomicsModel → TCO over 1 year
SustainabilityModel → carbon footprint by region
```
Each layer contributes a different piece of the answer. No single layer is sufficient alone. This is why MLSYSIM separates concerns into five composable layers rather than offering a monolithic "predict performance" function.
---
## Slide Deck Quick Reference {#sec-slides-reference}
For instructors and students using the companion [lecture slides](https://mlsysbook.ai/slides/), the table below maps every MLSYSIM layer to the relevant slide decks.
### Volume I: Foundations
| Ch | Deck | MLSYSIM Layer(s) | Download |
|:---|:-----|:------------------|:---------|
| 5 | NN Computation | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf) |
| 6 | NN Architectures | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf) |
| 8 | Model Training | A + B + E | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf) |
| 10 | Model Compression | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf) |
| 11 | Hardware Acceleration | B (Hardware) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf) |
| 12 | Benchmarking | B (Hardware) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf) |
| 13 | Model Serving | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf) |
### Volume II: At Scale
| Ch | Deck | MLSYSIM Layer(s) | Download |
|:---|:-----|:------------------|:---------|
| 2 | Compute Infrastructure | B (Hardware) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf) |
| 3 | Network Fabrics | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf) |
| 5 | Distributed Training | D (Systems) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf) |
| 6 | Collective Communication | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf) |
| 8 | Fleet Orchestration | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_08_fleet_orchestration.pdf) |
| 9 | Inference at Scale | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf) |
| 10 | Performance Engineering | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf) |
| 15 | Sustainable AI | C (Infrastructure) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf) |
---
## Where to Go Next
- **[Getting Started](getting-started.qmd)** — Install MLSYSIM and run your first analysis in 5 minutes.
- **[Solver Guide](solver-guide.qmd)** — Decision guide for choosing the right analytical engine.
- **[Math Foundations](math.qmd)** — The closed-form equations behind every solver.
- **[Tutorials](tutorials/index.qmd)** — Hands-on notebooks for roofline analysis, distributed training, LLM serving, and sustainability.
- **[Zoo Overview](zoo/index.qmd)** — Browse the registries of vetted models, hardware, fleets, and infrastructure.