cs249r_book/mlsysim/docs/tutorials/index.qmd

---
title: "Tutorials"
subtitle: "Learn ML systems reasoning through the 22 Systems Walls."
---

These tutorials teach you to reason quantitatively about ML infrastructure using
**MLSys·im**, the first-principles infrastructure modeling engine behind the
*Machine Learning Systems* textbook. They are organized by the six domains of the
[22 Systems Walls taxonomy](../architecture.qmd).

**Each tutorial answers one question.** Start at the beginning for a guided path, or
jump to any domain that matches your interest.

::: {.callout-tip}
## How to Use These Tutorials

- **Every tutorial runs on a laptop in under 30 seconds.** No GPU required.
- **Code cells are executable.** Clone the repo and run them, or follow along on the website.
- **Exercises ask you to predict first, then verify.** This builds intuition faster than reading alone.
- **Tutorials within a cluster build on each other**, but clusters are largely independent.
- **Time estimates** are for reading + running code. Add 15–20 min if you do all exercises.
:::

---

## Start Here {#sec-start}

Before diving into any domain, complete this introduction to the roofline model.

::: {.tutorial-grid}

::: {.tutorial-card}
[Beginner]{.tutorial-level .level-beginner}

### 0 · Hello, Roofline

**Question:** *How do I predict whether my model is memory-bound or compute-bound?*

Five lines of code, one answer. The foundation for everything that follows.
⏱ ~10 min

[Start Tutorial →](00_hello_roofline.qmd){.tutorial-arrow}
:::

:::

---

## Cluster 1: Node — Walls 1–3 {#sec-node}

*One accelerator, one model. Where is the ceiling?*

These tutorials explore the three walls that constrain a single accelerator: **compute throughput** (Wall 1), **memory capacity** (Wall 2), and **memory bandwidth** (Wall 3). Understanding which wall binds — and why — is the most fundamental skill in ML systems reasoning.

::: {.tutorial-grid}

::: {.tutorial-card}
[Beginner]{.tutorial-level .level-beginner}

### 1 · The Memory Wall

**Question:** *Why doesn't 3.2× more FLOPS give 3.2× speedup?*

Compare A100 → H100 and discover that for LLM inference, bandwidth — not compute — is the binding constraint. The most important fallacy in ML systems.
⏱ ~15 min

[Start Tutorial →](01_memory_wall.qmd){.tutorial-arrow}
:::

::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}

### 2 · Two Phases, One Request

**Question:** *Why is LLM serving fundamentally different from CNN inference?*

The same model on the same GPU hits two different ceilings: prefill is compute-bound, decode is memory-bound. This is why LLM serving requires its own analysis.
⏱ ~15 min

[Start Tutorial →](02_two_phases.qmd){.tutorial-arrow}
:::

::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}

### 3 · KV-Cache: The Hidden Tax

**Question:** *What actually limits how many users I can serve concurrently?*

At 128K context length, the KV-cache alone fills an 80 GB GPU. Explore how sequence length, batch size, and paged attention interact to constrain serving capacity.
⏱ ~20 min

[Start Tutorial →](03_kv_cache.qmd){.tutorial-arrow}
:::

:::

---

## Cluster 2: Data — Walls 8–10 {#sec-data}

*The GPU is fast. Is the pipeline faster?*

Even the fastest accelerator sits idle if the data pipeline cannot keep up. These walls cover **ingestion** (Wall 8), **transformation** (Wall 9), and **storage bandwidth** (Wall 10).

::: {.tutorial-grid}

::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}

### 4 · Starving the GPU

**Question:** *Why is my GPU utilization only 40%?*

A100 compute takes 48 ms per step, but the CPU augmentation pipeline is the true bottleneck. The binding constraint is JPEG decoding, not silicon.
⏱ ~15 min

[Start Tutorial →](04_starving_the_gpu.qmd){.tutorial-arrow}
:::

:::

---

## Cluster 3: Algorithm — Walls 11–13 {#sec-algorithm}

*Can I make the model smaller or the training cheaper?*

These walls govern **scaling laws** (Wall 11), **compression and quantization** (Wall 12), and **architecture efficiency** (Wall 13).

::: {.tutorial-grid}

::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}

### 5 · Quantization: Not a Free Lunch

**Question:** *Does INT4 always give 4× speedup?*

For memory-bound decode: nearly 4×. For compute-bound training: 0×. The *regime* determines whether quantization helps — and the roofline tells you which regime you're in.
⏱ ~20 min

[Start Tutorial →](05_quantization.qmd){.tutorial-arrow}
:::

:::

---

## Cluster 4: Fleet — Walls 14–16 {#sec-fleet}

*Scaling past one machine. Where does efficiency go?*

Distributed training introduces three new walls: **communication overhead** (Wall 14), **synchronization cost** (Wall 15), and **reliability** (Wall 16).

::: {.tutorial-grid}

::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}

### 6 · Scaling to 1000 GPUs

**Question:** *Where does my training efficiency disappear at scale?*

At 1024 GPUs, AllReduce communication overhead erodes scaling efficiency. But the real hidden cost is reliability: cluster MTBF drops to ~20 hours, forcing frequent checkpoints that consume more wall-clock time than communication itself.
⏱ ~20 min

[Start Tutorial →](06_scaling_1000_gpus.qmd){.tutorial-arrow}
:::

:::

---

## Cluster 5: Ops — Walls 17–20 {#sec-ops}

*What does it cost — in dollars, carbon, and water?*

Operational walls cover **energy** (Wall 17), **sustainability** (Wall 18), **economic cost** (Wall 19), and **safety** (Wall 20). These are the walls that determine whether a technically feasible system is actually deployable.

::: {.tutorial-grid}

::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}

### 7 · Geography is a Systems Variable

**Question:** *Does it matter where I train?*

Same 256-GPU cluster, same model, same duration: 412 tonnes CO₂ in Iowa vs. 10 tonnes in Québec. A 40× difference from geography alone.
⏱ ~15 min

[Start Tutorial →](07_geography.qmd){.tutorial-arrow}
:::

::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}

### 8 · The $9M Question

**Question:** *How much does chain-of-thought reasoning actually cost?*

K=8 reasoning steps multiply your serving bill by 7.6× — from $1.2M to $9.1M per year. A seemingly simple algorithmic choice becomes a capital expenditure decision.
⏱ ~20 min

[Start Tutorial →](08_nine_million_dollar.qmd){.tutorial-arrow}
:::

:::

---

## Cluster 6: Analysis — Walls 21–22 {#sec-analysis}

*Cross-cutting diagnostics. Which knob matters most?*

These walls provide the tools for **sensitivity analysis** (Wall 21) and **synthesis** (Wall 22) — the ability to ask "what if?" and "what must be true?"

::: {.tutorial-grid}

::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}

### 9 · Where to Invest: Sensitivity Analysis

**Question:** *Should I buy more FLOPS or more bandwidth?*

∂T/∂BW = −0.88 vs. ∂T/∂FLOPS = −0.06. For LLM inference, a 10% bandwidth increase yields 15× more improvement than a 10% compute increase. Then use inverse Roofline to derive the minimum hardware spec from an SLA.
⏱ ~20 min

[Start Tutorial →](09_sensitivity.qmd){.tutorial-arrow}
:::

::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}

### 10 · GPU vs. Wafer-Scale

**Question:** *Can a fundamentally different architecture change which wall binds?*

Cerebras eliminates the HBM memory wall entirely — but the binding constraint shifts to injection bandwidth. A qualitative regime change, not just a speedup.
⏱ ~20 min

[Start Tutorial →](10_gpu_vs_wafer.qmd){.tutorial-arrow}
:::

::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}

### 11 · Design Space Exploration

**Question:** *How do I search for the best architecture without writing nested loops?*

Learn to use the declarative DSE Engine to navigate complex ML trade-offs and filter configurations based on SLA constraints.
⏱ ~20 min

[Start Tutorial →](12_design_space_exploration.qmd){.tutorial-arrow}
:::

:::

---

## Cluster 7: Capstone {#sec-capstone}

*Compose everything. All 22 walls, one analysis.*

::: {.tutorial-grid}

::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}

### 12 · Full-Stack Audit: LLaMA-70B Training

**Question:** *What does a complete systems analysis look like?*

Trace LLaMA-70B training through all six domains: Node → Data → Algorithm → Fleet → Ops → Analysis. Twelve of the 22 walls exercised in one coherent analysis.
⏱ ~30 min

[Start Tutorial →](12_full_stack_audit.qmd){.tutorial-arrow}
:::

:::

---

## Extending MLSys·im

::: {.tutorial-grid}

::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}

### The Differential Explainer

**Question:** *How do I automatically explain why a hardware upgrade didn't work?*

Learn to use the Differential Explainer to compare two configurations and generate a human-readable explanation of Regime Shifts.
⏱ ~10 min

[Start Tutorial →](02_differential_explainer.qmd){.tutorial-arrow}
:::

::: {.tutorial-card}
[Developer]{.tutorial-level .level-developer}

### Custom Solvers & Hardware

Learn to contribute new hardware specifications to the Silicon Zoo or build your own analytical solvers using the 5-layer architecture.
⏱ ~15 min

[Start Guide →](../contributing.qmd){.tutorial-arrow}
:::

::: {.tutorial-card}
[Developer]{.tutorial-level .level-developer}

### Composable Pipelines & Callbacks

Learn how to snap custom analytical solvers into the MLSys·im Pipeline and use middleware hooks to log data to external MLOps platforms.
⏱ ~15 min

[Start Tutorial →](01_pipeline_callbacks.qmd){.tutorial-arrow}
:::

:::

---

## Learning Paths

Choose a path based on your role:

### Path A: First-Time Learner (~ 2 hours)
> *"I'm new to ML systems and want to build intuition from scratch."*

0 → 1 → 2 → 3 → 4 → 5 → 7 → 12

*Why this order:* Start with single-node physics (roofline, memory wall, two phases),
understand the KV-cache memory constraint that dominates LLM serving, see the data pipeline
bottleneck, understand quantization regimes, learn that geography matters, then compose it
all in the capstone. Tutorial 6 (distributed reliability) is deferred — come back to it
after the capstone.

### Path B: ML Engineer (~ 2.5 hours)
> *"I deploy models in production and need to make hardware decisions."*

0 → 1 → 2 → 3 → 8 → 9 → 12

### Path C: Researcher (~ 2 hours)
> *"I evaluate hardware architectures and need quantitative tools."*

0 → 1 → 5 → 9 → 10 → 6 → 11 → 12

### Path D: Conference Tutorial (90 min)
> *"I'm attending a live tutorial at ISCA / MLSys / ASPLOS."*

| Time | Tutorial | Core Message | Format |
|------|----------|-------------|--------|
| 0–15 min | **0. Hello, Roofline** | The equation, the tool, the regime | Live coding, audience predicts |
| 15–35 min | **1. The Memory Wall** | Why 3.2× FLOPS ≠ 3.2× speedup | Live coding, audience verifies |
| 35–45 min | **Break + Q&A** | | |
| 45–60 min | **9. Sensitivity** | Bandwidth is 15× more valuable | Live coding + derivation |
| 60–75 min | **6. Scaling to 1000 GPUs** | Reliability dominates communication | Pre-computed results, discussion |
| 75–85 min | **12. Full-Stack Audit** | Composing all six domains | Summary table walkthrough |
| 85–90 min | **Where to learn more** | Take-home exercises | |

*Why 5 tutorials, not 8:* Each tutorial needs enough time for the
predict-compute-reflect cycle to land. Tutorials 2, 5, and 7 are available as
take-home exercises for attendees who want to continue after the session.

### Path D+: Half-Day Workshop (3 hours)
> *"I'm attending a half-day conference tutorial."*

The full 12-tutorial sequence with hands-on exercises. All 8 tutorials from the
original conference path plus Tutorials 3 (KV-Cache) and 10 (Wafer-Scale), with
15-minute breaks between clusters.

---

> **All tutorials are Quarto-compatible.** Run them locally after `pip install mlsysim`,
> or browse the rendered versions on this website.