Files
cs249r_book/mlsysim/docs/tutorials/index.qmd
Vijay Janapa Reddi 1eb30f5f86 fix(mlsysim): harden release QA and paper artifacts
Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.
2026-04-25 10:06:01 -04:00

383 lines
12 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Tutorials"
subtitle: "Learn ML systems reasoning through the 22 Systems Walls."
---
These tutorials teach you to reason quantitatively about ML infrastructure using
**MLSys·im**, the first-principles infrastructure modeling engine behind the
*Machine Learning Systems* textbook. They are organized by the six domains of the
[22 Systems Walls taxonomy](../architecture.qmd).
**Each tutorial answers one question.** Start at the beginning for a guided path, or
jump to any domain that matches your interest.
::: {.callout-tip}
## How to Use These Tutorials
- **Every tutorial runs on a laptop in under 30 seconds.** No GPU required.
- **Code cells are executable.** Clone the repo and run them, or follow along on the website.
- **Exercises ask you to predict first, then verify.** This builds intuition faster than reading alone.
- **Tutorials within a cluster build on each other**, but clusters are largely independent.
- **Time estimates** are for reading + running code. Add 1520 min if you do all exercises.
:::
---
## Start Here {#sec-start}
Before diving into any domain, complete this introduction to the roofline model.
::: {.tutorial-grid}
::: {.tutorial-card}
[Beginner]{.tutorial-level .level-beginner}
### 0 · Hello, Roofline
**Question:** *How do I predict whether my model is memory-bound or compute-bound?*
Five lines of code, one answer. The foundation for everything that follows.
⏱ ~10 min
[Start Tutorial →](00_hello_roofline.qmd){.tutorial-arrow}
:::
:::
---
## Cluster 1: Node — Walls 13 {#sec-node}
*One accelerator, one model. Where is the ceiling?*
These tutorials explore the three walls that constrain a single accelerator: **compute throughput** (Wall 1), **memory capacity** (Wall 2), and **memory bandwidth** (Wall 3). Understanding which wall binds — and why — is the most fundamental skill in ML systems reasoning.
::: {.tutorial-grid}
::: {.tutorial-card}
[Beginner]{.tutorial-level .level-beginner}
### 1 · The Memory Wall
**Question:** *Why doesn't 3.2× more FLOPS give 3.2× speedup?*
Compare A100 → H100 and discover that for LLM inference, bandwidth — not compute — is the binding constraint. The most important fallacy in ML systems.
⏱ ~15 min
[Start Tutorial →](01_memory_wall.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### 2 · Two Phases, One Request
**Question:** *Why is LLM serving fundamentally different from CNN inference?*
The same model on the same GPU hits two different ceilings: prefill is compute-bound, decode is memory-bound. This is why LLM serving requires its own analysis.
⏱ ~15 min
[Start Tutorial →](02_two_phases.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### 3 · KV-Cache: The Hidden Tax
**Question:** *What actually limits how many users I can serve concurrently?*
At 128K context length, the KV-cache alone fills an 80 GB GPU. Explore how sequence length, batch size, and paged attention interact to constrain serving capacity.
⏱ ~20 min
[Start Tutorial →](03_kv_cache.qmd){.tutorial-arrow}
:::
:::
---
## Cluster 2: Data — Walls 810 {#sec-data}
*The GPU is fast. Is the pipeline faster?*
Even the fastest accelerator sits idle if the data pipeline cannot keep up. These walls cover **ingestion** (Wall 8), **transformation** (Wall 9), and **storage bandwidth** (Wall 10).
::: {.tutorial-grid}
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### 4 · Starving the GPU
**Question:** *Why is my GPU utilization only 40%?*
A100 compute takes 48 ms per step, but the CPU augmentation pipeline is the true bottleneck. The binding constraint is JPEG decoding, not silicon.
⏱ ~15 min
[Start Tutorial →](04_starving_the_gpu.qmd){.tutorial-arrow}
:::
:::
---
## Cluster 3: Algorithm — Walls 1113 {#sec-algorithm}
*Can I make the model smaller or the training cheaper?*
These walls govern **scaling laws** (Wall 11), **compression and quantization** (Wall 12), and **architecture efficiency** (Wall 13).
::: {.tutorial-grid}
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### 5 · Quantization: Not a Free Lunch
**Question:** *Does INT4 always give 4× speedup?*
For memory-bound decode: nearly 4×. For compute-bound training: 0×. The *regime* determines whether quantization helps — and the roofline tells you which regime you're in.
⏱ ~20 min
[Start Tutorial →](05_quantization.qmd){.tutorial-arrow}
:::
:::
---
## Cluster 4: Fleet — Walls 1416 {#sec-fleet}
*Scaling past one machine. Where does efficiency go?*
Distributed training introduces three new walls: **communication overhead** (Wall 14), **synchronization cost** (Wall 15), and **reliability** (Wall 16).
::: {.tutorial-grid}
::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}
### 6 · Scaling to 1000 GPUs
**Question:** *Where does my training efficiency disappear at scale?*
At 1024 GPUs, AllReduce communication overhead erodes scaling efficiency. But the real hidden cost is reliability: cluster MTBF drops to ~20 hours, forcing frequent checkpoints that consume more wall-clock time than communication itself.
⏱ ~20 min
[Start Tutorial →](06_scaling_1000_gpus.qmd){.tutorial-arrow}
:::
:::
---
## Cluster 5: Ops — Walls 1720 {#sec-ops}
*What does it cost — in dollars, carbon, and water?*
Operational walls cover **energy** (Wall 17), **sustainability** (Wall 18), **economic cost** (Wall 19), and **safety** (Wall 20). These are the walls that determine whether a technically feasible system is actually deployable.
::: {.tutorial-grid}
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### 7 · Geography is a Systems Variable
**Question:** *Does it matter where I train?*
Same 256-GPU cluster, same model, same duration: 412 tonnes CO₂ in Iowa vs. 10 tonnes in Québec. A 40× difference from geography alone.
⏱ ~15 min
[Start Tutorial →](07_geography.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### 8 · The $9M Question
**Question:** *How much does chain-of-thought reasoning actually cost?*
K=8 reasoning steps multiply your serving bill by 7.6× — from $1.2M to $9.1M per year. A seemingly simple algorithmic choice becomes a capital expenditure decision.
⏱ ~20 min
[Start Tutorial →](08_nine_million_dollar.qmd){.tutorial-arrow}
:::
:::
---
## Cluster 6: Analysis — Walls 2122 {#sec-analysis}
*Cross-cutting diagnostics. Which knob matters most?*
These walls provide the tools for **sensitivity analysis** (Wall 21) and **synthesis** (Wall 22) — the ability to ask "what if?" and "what must be true?"
::: {.tutorial-grid}
::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}
### 9 · Where to Invest: Sensitivity Analysis
**Question:** *Should I buy more FLOPS or more bandwidth?*
∂T/∂BW = 0.88 vs. ∂T/∂FLOPS = 0.06. For LLM inference, a 10% bandwidth increase yields 15× more improvement than a 10% compute increase. Then use inverse Roofline to derive the minimum hardware spec from an SLA.
⏱ ~20 min
[Start Tutorial →](09_sensitivity.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}
### 10 · GPU vs. Wafer-Scale
**Question:** *Can a fundamentally different architecture change which wall binds?*
Cerebras eliminates the HBM memory wall entirely — but the binding constraint shifts to injection bandwidth. A qualitative regime change, not just a speedup.
⏱ ~20 min
[Start Tutorial →](10_gpu_vs_wafer.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}
### 11 · Design Space Exploration
**Question:** *How do I search for the best architecture without writing nested loops?*
Learn to use the declarative DSE Engine to navigate complex ML trade-offs and filter configurations based on SLA constraints.
⏱ ~20 min
[Start Tutorial →](12_design_space_exploration.qmd){.tutorial-arrow}
:::
:::
---
## Cluster 7: Capstone {#sec-capstone}
*Compose everything. All 22 walls, one analysis.*
::: {.tutorial-grid}
::: {.tutorial-card}
[Advanced]{.tutorial-level .level-advanced}
### 12 · Full-Stack Audit: LLaMA-70B Training
**Question:** *What does a complete systems analysis look like?*
Trace LLaMA-70B training through all six domains: Node → Data → Algorithm → Fleet → Ops → Analysis. Twelve of the 22 walls exercised in one coherent analysis.
⏱ ~30 min
[Start Tutorial →](12_full_stack_audit.qmd){.tutorial-arrow}
:::
:::
---
## Extending MLSys·im
::: {.tutorial-grid}
::: {.tutorial-card}
[Intermediate]{.tutorial-level .level-intermediate}
### The Differential Explainer
**Question:** *How do I automatically explain why a hardware upgrade didn't work?*
Learn to use the Differential Explainer to compare two configurations and generate a human-readable explanation of Regime Shifts.
⏱ ~10 min
[Start Tutorial →](02_differential_explainer.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Developer]{.tutorial-level .level-developer}
### Custom Solvers & Hardware
Learn to contribute new hardware specifications to the Silicon Zoo or build your own analytical solvers using the 5-layer architecture.
⏱ ~15 min
[Start Guide →](../contributing.qmd){.tutorial-arrow}
:::
::: {.tutorial-card}
[Developer]{.tutorial-level .level-developer}
### Composable Pipelines & Callbacks
Learn how to snap custom analytical solvers into the MLSys·im Pipeline and use middleware hooks to log data to external MLOps platforms.
⏱ ~15 min
[Start Tutorial →](01_pipeline_callbacks.qmd){.tutorial-arrow}
:::
:::
---
## Learning Paths
Choose a path based on your role:
### Path A: First-Time Learner (~ 2 hours)
> *"I'm new to ML systems and want to build intuition from scratch."*
0 → 1 → 2 → 3 → 4 → 5 → 7 → 12
*Why this order:* Start with single-node physics (roofline, memory wall, two phases),
understand the KV-cache memory constraint that dominates LLM serving, see the data pipeline
bottleneck, understand quantization regimes, learn that geography matters, then compose it
all in the capstone. Tutorial 6 (distributed reliability) is deferred — come back to it
after the capstone.
### Path B: ML Engineer (~ 2.5 hours)
> *"I deploy models in production and need to make hardware decisions."*
0 → 1 → 2 → 3 → 8 → 9 → 12
### Path C: Researcher (~ 2 hours)
> *"I evaluate hardware architectures and need quantitative tools."*
0 → 1 → 5 → 9 → 10 → 6 → 11 → 12
### Path D: Conference Tutorial (90 min)
> *"I'm attending a live tutorial at ISCA / MLSys / ASPLOS."*
| Time | Tutorial | Core Message | Format |
|------|----------|-------------|--------|
| 015 min | **0. Hello, Roofline** | The equation, the tool, the regime | Live coding, audience predicts |
| 1535 min | **1. The Memory Wall** | Why 3.2× FLOPS ≠ 3.2× speedup | Live coding, audience verifies |
| 3545 min | **Break + Q&A** | | |
| 4560 min | **9. Sensitivity** | Bandwidth is 15× more valuable | Live coding + derivation |
| 6075 min | **6. Scaling to 1000 GPUs** | Reliability dominates communication | Pre-computed results, discussion |
| 7585 min | **12. Full-Stack Audit** | Composing all six domains | Summary table walkthrough |
| 8590 min | **Where to learn more** | Take-home exercises | |
*Why 5 tutorials, not 8:* Each tutorial needs enough time for the
predict-compute-reflect cycle to land. Tutorials 2, 5, and 7 are available as
take-home exercises for attendees who want to continue after the session.
### Path D+: Half-Day Workshop (3 hours)
> *"I'm attending a half-day conference tutorial."*
The full 12-tutorial sequence with hands-on exercises. All 8 tutorials from the
original conference path plus Tutorials 3 (KV-Cache) and 10 (Wafer-Scale), with
15-minute breaks between clusters.
---
> **All tutorials are Quarto-compatible.** Run them locally after `pip install mlsysim`,
> or browse the rendered versions on this website.