cs249r_book/mlsysim/docs/architecture.qmd

---
title: "The 5-Layer Architecture"
subtitle: "Progressive Lowering: From Abstract Demand to Concrete Supply"
---

MLSYSIM organizes the full ML systems domain into five composable layers. Abstract workload *demand* (Layer A) is progressively mapped onto concrete hardware *supply* (Layers B--D) through analytical *solvers* (Layer E). Each layer corresponds directly to chapters in the [Machine Learning Systems](https://mlsysbook.ai) textbook and its companion [lecture slides](https://mlsysbook.ai/slides/).

Understanding this stack is the key to mastering both this library and the textbook it accompanies.

## The Stack at a Glance

```{mermaid}
%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSYSIM 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
    A["<b>Layer A · Workloads</b> — <i>Demand</i><br/>TransformerWorkload · CNNWorkload<br/>Parameters · FLOPs · Arithmetic Intensity"]
    B["<b>Layer B · Hardware</b> — <i>Silicon</i><br/>HardwareNode · ComputeCore · MemoryHierarchy<br/>Peak FLOP/s · Bandwidth · Capacity · TDP"]
    C["<b>Layer C · Infrastructure</b> — <i>Environment</i><br/>GridProfile · Datacenter<br/>Carbon Intensity · PUE · WUE"]
    D["<b>Layer D · Systems</b> — <i>Topology</i><br/>Node · Fleet · NetworkFabric<br/>Topology · Accelerators/Node · Fabric BW"]
    E["<b>Layer E · Solvers</b> — <i>Analysis</i><br/>SingleNode · Distributed · Serving<br/>Economics · Sustainability · Reliability"]
    F["<b>Results</b><br/>PerformanceProfile · Dict[str, Quantity]"]

    A --> E
    B --> D
    C --> D
    D --> E
    E --> F
```

The flow reads top-to-bottom: a **Workload** (what you want to compute) and a **System** (the hardware, network, and environment you have) feed into a **Solver** (the analytical engine), which returns quantitative results you can act on.

---

## Layer A: Workloads (Demand) {#sec-layer-a}

A **Workload** is a hardware-agnostic description of computational demand. The question is not "How fast is Llama-3?" but rather "How many FLOPs and memory bytes does Llama-3 *require*?"

MLSYSIM provides two workload types:

- **`TransformerWorkload`** — LLMs and attention-based models (GPT, Llama, BERT). Defined by parameter count, layer count, hidden dimension, attention heads, and sequence length. Supports KV-cache size estimation for autoregressive inference.
- **`CNNWorkload`** — Vision and convolutional models (ResNet, MobileNet, YOLO). Defined by parameter count and total FLOPs per forward pass.

The crucial step happens when a workload is "lowered" at a specific numerical precision (FP32, FP16, INT8, INT4). This precision lowering determines the **Arithmetic Intensity** (FLOPs/Byte) — the single ratio that decides whether a model will be compute-bound or memory-bound on any given hardware target.

::: {.callout-note}
## Textbook and slides
Layer A concepts are covered in the following lecture decks:

| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [NN Computation](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 5) | The math behind the model | FLOPs, parameter counting, activation memory, forward/backward pass cost |
| [NN Architectures](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 6) | Architecture as infrastructure | Computational signatures of CNNs, RNNs, Transformers; quadratic attention scaling |
| [Model Compression](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 10) | From benchmark winner to production model | Pruning, distillation, quantization (FP32 to INT4); precision as a workload knob |
| [Model Training](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 8) | The physics of learning | Iron Law of training, memory breakdown (activations, gradients, optimizer states) |
:::

*See the [Model Zoo](zoo/models.qmd) for vetted workloads spanning language, vision, recommendation, and TinyML models.*

---

## Layer B: Hardware (Silicon) {#sec-layer-b}

A **`HardwareNode`** represents a single physical accelerator — an H100 GPU, an Apple M3 chip, a Jetson Orin, or even an ESP32 microcontroller. It provides the raw physical supply:

- **Compute:** Peak throughput (TFLOP/s) across precisions (FP32, FP16, INT8), modeled via `ComputeCore`.
- **Memory:** HBM or SRAM capacity and transfer bandwidth (TB/s), modeled via `MemoryHierarchy`.
- **Power:** Thermal Design Power (TDP) in watts.

Every piece of silicon has a **ridge point** — the ratio of peak compute to memory bandwidth ($I^* = \text{Peak\_FLOPs} / \text{BW}$). When a workload's arithmetic intensity falls below this ridge point, the workload is memory-bound: the compute units starve waiting for data. When above, the workload is compute-bound: memory delivers data faster than the cores can consume it.

This is the Roofline model — the diagnostic framework at the heart of Layer E's `SingleNodeModel`.

::: {.callout-note}
## Textbook and slides
Layer B concepts are covered in the following lecture decks:

| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Hardware Acceleration](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 11) | Moving data costs more than computing it | Systolic arrays, tensor cores, Roofline model, accelerator spectrum (GPU to ASIC) |
| [Compute Infrastructure](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 2) | The physics of the ML fleet | HBM generations, bandwidth hierarchy (HBM to NVLink to InfiniBand), TCO analysis |
| [Benchmarking](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 12) | Measuring what matters | MLPerf, roofline profiling, statistical rigor, the lab-to-production gap |
:::

*See the [Silicon Zoo](zoo/hardware.qmd) for 18+ vetted hardware specs from cloud GPUs to microcontrollers.*

---

## Layer C: Infrastructure (Environment) {#sec-layer-c}

Hardware does not run in a vacuum. It runs in datacenters plugged into regional power grids, cooled by air or liquid systems, and constrained by physical energy budgets.

A **`GridProfile`** captures this physical context:

- **Carbon intensity** (g CO₂/kWh) — varies 40x across regions: Quebec's hydroelectric grid produces ~8 g CO₂/kWh; Poland's coal-dominated grid produces ~700 g CO₂/kWh.
- **Power Usage Effectiveness (PUE)** — the ratio of total facility power to IT equipment power. A PUE of 1.1 means 10% overhead for cooling and infrastructure; 1.6 means 60%.
- **Water Usage Effectiveness (WUE)** — liters of water consumed per kWh, determined by cooling technology (air, liquid, evaporative).

A 1000-watt GPU running identical computations in Quebec versus Poland produces vastly different carbon footprints. This layer makes that difference quantifiable.

::: {.callout-note}
## Textbook and slides
Layer C concepts are covered in the following lecture deck:

| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Sustainable AI](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 15) | Energy as a first-class engineering constraint | Lifecycle carbon accounting, PUE modeling, carbon-aware scheduling, the 4 Ms framework |
:::

*See the [Infrastructure Zoo](zoo/infra.qmd) for regional grid profiles and datacenter configurations.*

---

## Layer D: Systems (Topology) {#sec-layer-d}

A single GPU cannot train a 100-billion-parameter model. Layer D composes individual `HardwareNode`s from Layer B into distributed clusters, connected by the network fabrics that determine communication overhead.

Three types define a system:

- **`Node`** — A single compute server grouping accelerators within a physical chassis (e.g., 8x H100 GPUs connected by 900 GB/s NVLink). Includes intra-node bandwidth and NIC count.
- **`NetworkFabric`** — The inter-node interconnect: topology (fat-tree, rail-optimized), bandwidth (e.g., 400 Gbps InfiniBand NDR), latency, and oversubscription ratio.
- **`Fleet`** — A collection of `Node`s connected by a `NetworkFabric`, deployed in a specific `Datacenter` (Layer C). This is the complete system that solvers operate on.

The way you structure this system — how many GPUs per node, what fabric connects them, which parallelism strategy you apply — determines your communication overhead and scaling efficiency.

::: {.callout-note}
## Textbook and slides
Layer D concepts span four lecture decks covering the full distributed systems stack:

| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Network Fabrics](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 3) | The synchronization backbone | $\alpha$-$\beta$ model, topology (fat-tree, dragonfly), RDMA, congestion control, bisection bandwidth |
| [Distributed Training](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 5) | The physics of scaling | Data/tensor/pipeline parallelism, ZeRO/FSDP, Communication-Computation Ratio (CCR), 3D hybrid strategies |
| [Collective Communication](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 6) | The traffic patterns of distributed ML | AllReduce algorithms (ring, tree), gradient compression, hierarchical communication, computation-communication overlap |
| [Fleet Orchestration](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 8) | Extracting useful work from shared infrastructure | Slurm vs. Kubernetes, topology-aware scheduling, elastic training, multi-tenancy |
:::

*See the [Fleet Zoo](zoo/fleets.qmd) for production cluster topologies from 256-GPU research clusters to 8192-GPU frontier fleets.*

---

## Layer E: Solvers (Analysis) {#sec-layer-e}

The previous four layers are static definitions — nouns. **Solvers** are the analytical engines — verbs — that bridge demand and supply to answer specific questions.

Each solver implements closed-form equations from peer-reviewed systems literature. No simulation, no benchmarking, no hardware required.

| Solver | Bridges | Core Equation | Key Output |
|:-------|:--------|:--------------|:-----------|
| **SingleNodeModel** | A $\to$ B | Roofline / Iron Law (Williams et al., 2009) | Latency, throughput, bottleneck classification |
| **DistributedModel** | A $\to$ D | Ring All-Reduce + Pipeline schedules | Scaling efficiency, communication overhead, bubble fraction |
| **ServingModel** | A $\to$ B | Pre-fill / Decode phase decomposition | TTFT, inter-token latency, KV-cache memory |
| **EconomicsModel** | D $\to$ cost | CapEx + OpEx over time horizon | TCO in USD, cost per query |
| **SustainabilityModel** | D $\to$ C | PUE $\times$ grid carbon intensity | Energy (kWh), carbon (kg CO₂e), water (L) |
| **ReliabilityModel** | D $\to$ uptime | Young-Daly optimal checkpointing | Fleet MTBF, failure probability, checkpoint interval |

Solvers are composable. To answer "What is the most sustainable way to serve Llama-70B?", chain `ServingModel` (feasibility and latency) into `EconomicsModel` (cost) into `SustainabilityModel` (carbon). Each solver's typed output feeds naturally into the next.

::: {.callout-note}
## Textbook and slides
Layer E concepts draw from multiple lecture decks that teach the diagnostic and optimization frameworks:

| Deck | Topic | Key Concepts |
|:-----|:------|:-------------|
| [Hardware Acceleration](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 11) | The Roofline model | Arithmetic intensity, ridge points, memory-bound vs. compute-bound diagnosis |
| [Model Serving](https://mlsysbook.ai/slides/vol1.html) (Vol I, Ch 13) | Inverting every training priority | Latency budgets, queuing theory (Little's Law), continuous batching, training-serving skew |
| [Inference at Scale](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 9) | Where ML systems live or die economically | Serving economics, KV-cache bottleneck, batching strategies, autoscaling |
| [Performance Engineering](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 10) | Match the software to the silicon | Operator fusion, FlashAttention, mixed precision, systematic profiling workflow |
| [Distributed Training](https://mlsysbook.ai/slides/vol2.html) (Vol II, Ch 5) | Scaling efficiency analysis | Communication-Computation Ratio, parallelism strategy selection, scaling laws |
:::

*See the [Solver Guide](solver-guide.qmd) for a decision guide on choosing the right solver, and [Math Foundations](math.qmd) for the complete equations.*

---

## Progressive Lowering in Action

The architecture is best understood through a concrete example. Consider the question: **"Can I serve Llama-3-70B on a cluster of 4 H100s within a $50K/year budget while minimizing carbon?"**

This single question touches all five layers:

```
1. Layer A — Llama-3-70B workload: 70B parameters, GQA with 8 KV heads,
             ~140 GB at FP16 precision

2. Layer B — H100 hardware: 990 TFLOP/s (FP16), 3.35 TB/s HBM3,
             80 GB capacity, 700W TDP

3. Layer C — Infrastructure: choose between Quebec (8 g CO₂/kWh)
             and US Average (385 g CO₂/kWh)

4. Layer D — System: 1 node × 4 H100s, NVLink 900 GB/s intra-node,
             tensor parallelism across 4 GPUs

5. Layer E — Chain three solvers:
             ServingModel  → TTFT, ITL, KV-cache feasibility
             EconomicsModel → TCO over 1 year
             SustainabilityModel → carbon footprint by region
```

Each layer contributes a different piece of the answer. No single layer is sufficient alone. This is why MLSYSIM separates concerns into five composable layers rather than offering a monolithic "predict performance" function.

---

## Slide Deck Quick Reference {#sec-slides-reference}

For instructors and students using the companion [lecture slides](https://mlsysbook.ai/slides/), the table below maps every MLSYSIM layer to the relevant slide decks.

### Volume I: Foundations

| Ch | Deck | MLSYSIM Layer(s) | Download |
|:---|:-----|:------------------|:---------|
| 5 | NN Computation | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf) |
| 6 | NN Architectures | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf) |
| 8 | Model Training | A + B + E | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf) |
| 10 | Model Compression | A (Workloads) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf) |
| 11 | Hardware Acceleration | B (Hardware) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf) |
| 12 | Benchmarking | B (Hardware) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf) |
| 13 | Model Serving | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf) |

### Volume II: At Scale

| Ch | Deck | MLSYSIM Layer(s) | Download |
|:---|:-----|:------------------|:---------|
| 2 | Compute Infrastructure | B (Hardware) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf) |
| 3 | Network Fabrics | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf) |
| 5 | Distributed Training | D (Systems) + E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf) |
| 6 | Collective Communication | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf) |
| 8 | Fleet Orchestration | D (Systems) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_08_fleet_orchestration.pdf) |
| 9 | Inference at Scale | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf) |
| 10 | Performance Engineering | E (Solvers) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf) |
| 15 | Sustainable AI | C (Infrastructure) | [PDF](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf) |

---

## Where to Go Next

- **[Getting Started](getting-started.qmd)** — Install MLSYSIM and run your first analysis in 5 minutes.
- **[Solver Guide](solver-guide.qmd)** — Decision guide for choosing the right analytical engine.
- **[Math Foundations](math.qmd)** — The closed-form equations behind every solver.
- **[Tutorials](tutorials/index.qmd)** — Hands-on notebooks for roofline analysis, distributed training, LLM serving, and sustainability.
- **[Zoo Overview](zoo/index.qmd)** — Browse the registries of vetted models, hardware, fleets, and infrastructure.