mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-22 22:33:28 -05:00
Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.
383 lines
12 KiB
Plaintext
383 lines
12 KiB
Plaintext
---
|
||
title: "Tutorials"
|
||
subtitle: "Learn ML systems reasoning through the 22 Systems Walls."
|
||
---
|
||
|
||
These tutorials teach you to reason quantitatively about ML infrastructure using
|
||
**MLSys·im**, the first-principles infrastructure modeling engine behind the
|
||
*Machine Learning Systems* textbook. They are organized by the six domains of the
|
||
[22 Systems Walls taxonomy](../architecture.qmd).
|
||
|
||
**Each tutorial answers one question.** Start at the beginning for a guided path, or
|
||
jump to any domain that matches your interest.
|
||
|
||
::: {.callout-tip}
|
||
## How to Use These Tutorials
|
||
|
||
- **Every tutorial runs on a laptop in under 30 seconds.** No GPU required.
|
||
- **Code cells are executable.** Clone the repo and run them, or follow along on the website.
|
||
- **Exercises ask you to predict first, then verify.** This builds intuition faster than reading alone.
|
||
- **Tutorials within a cluster build on each other**, but clusters are largely independent.
|
||
- **Time estimates** are for reading + running code. Add 15–20 min if you do all exercises.
|
||
:::
|
||
|
||
---
|
||
|
||
## Start Here {#sec-start}
|
||
|
||
Before diving into any domain, complete this introduction to the roofline model.
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Beginner]{.tutorial-level .level-beginner}
|
||
|
||
### 0 · Hello, Roofline
|
||
|
||
**Question:** *How do I predict whether my model is memory-bound or compute-bound?*
|
||
|
||
Five lines of code, one answer. The foundation for everything that follows.
|
||
⏱ ~10 min
|
||
|
||
[Start Tutorial →](00_hello_roofline.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Cluster 1: Node — Walls 1–3 {#sec-node}
|
||
|
||
*One accelerator, one model. Where is the ceiling?*
|
||
|
||
These tutorials explore the three walls that constrain a single accelerator: **compute throughput** (Wall 1), **memory capacity** (Wall 2), and **memory bandwidth** (Wall 3). Understanding which wall binds — and why — is the most fundamental skill in ML systems reasoning.
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Beginner]{.tutorial-level .level-beginner}
|
||
|
||
### 1 · The Memory Wall
|
||
|
||
**Question:** *Why doesn't 3.2× more FLOPS give 3.2× speedup?*
|
||
|
||
Compare A100 → H100 and discover that for LLM inference, bandwidth — not compute — is the binding constraint. The most important fallacy in ML systems.
|
||
⏱ ~15 min
|
||
|
||
[Start Tutorial →](01_memory_wall.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
::: {.tutorial-card}
|
||
[Intermediate]{.tutorial-level .level-intermediate}
|
||
|
||
### 2 · Two Phases, One Request
|
||
|
||
**Question:** *Why is LLM serving fundamentally different from CNN inference?*
|
||
|
||
The same model on the same GPU hits two different ceilings: prefill is compute-bound, decode is memory-bound. This is why LLM serving requires its own analysis.
|
||
⏱ ~15 min
|
||
|
||
[Start Tutorial →](02_two_phases.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
::: {.tutorial-card}
|
||
[Intermediate]{.tutorial-level .level-intermediate}
|
||
|
||
### 3 · KV-Cache: The Hidden Tax
|
||
|
||
**Question:** *What actually limits how many users I can serve concurrently?*
|
||
|
||
At 128K context length, the KV-cache alone fills an 80 GB GPU. Explore how sequence length, batch size, and paged attention interact to constrain serving capacity.
|
||
⏱ ~20 min
|
||
|
||
[Start Tutorial →](03_kv_cache.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Cluster 2: Data — Walls 8–10 {#sec-data}
|
||
|
||
*The GPU is fast. Is the pipeline faster?*
|
||
|
||
Even the fastest accelerator sits idle if the data pipeline cannot keep up. These walls cover **ingestion** (Wall 8), **transformation** (Wall 9), and **storage bandwidth** (Wall 10).
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Intermediate]{.tutorial-level .level-intermediate}
|
||
|
||
### 4 · Starving the GPU
|
||
|
||
**Question:** *Why is my GPU utilization only 40%?*
|
||
|
||
A100 compute takes 48 ms per step, but the CPU augmentation pipeline is the true bottleneck. The binding constraint is JPEG decoding, not silicon.
|
||
⏱ ~15 min
|
||
|
||
[Start Tutorial →](04_starving_the_gpu.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Cluster 3: Algorithm — Walls 11–13 {#sec-algorithm}
|
||
|
||
*Can I make the model smaller or the training cheaper?*
|
||
|
||
These walls govern **scaling laws** (Wall 11), **compression and quantization** (Wall 12), and **architecture efficiency** (Wall 13).
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Intermediate]{.tutorial-level .level-intermediate}
|
||
|
||
### 5 · Quantization: Not a Free Lunch
|
||
|
||
**Question:** *Does INT4 always give 4× speedup?*
|
||
|
||
For memory-bound decode: nearly 4×. For compute-bound training: 0×. The *regime* determines whether quantization helps — and the roofline tells you which regime you're in.
|
||
⏱ ~20 min
|
||
|
||
[Start Tutorial →](05_quantization.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Cluster 4: Fleet — Walls 14–16 {#sec-fleet}
|
||
|
||
*Scaling past one machine. Where does efficiency go?*
|
||
|
||
Distributed training introduces three new walls: **communication overhead** (Wall 14), **synchronization cost** (Wall 15), and **reliability** (Wall 16).
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Advanced]{.tutorial-level .level-advanced}
|
||
|
||
### 6 · Scaling to 1000 GPUs
|
||
|
||
**Question:** *Where does my training efficiency disappear at scale?*
|
||
|
||
At 1024 GPUs, AllReduce communication overhead erodes scaling efficiency. But the real hidden cost is reliability: cluster MTBF drops to ~20 hours, forcing frequent checkpoints that consume more wall-clock time than communication itself.
|
||
⏱ ~20 min
|
||
|
||
[Start Tutorial →](06_scaling_1000_gpus.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Cluster 5: Ops — Walls 17–20 {#sec-ops}
|
||
|
||
*What does it cost — in dollars, carbon, and water?*
|
||
|
||
Operational walls cover **energy** (Wall 17), **sustainability** (Wall 18), **economic cost** (Wall 19), and **safety** (Wall 20). These are the walls that determine whether a technically feasible system is actually deployable.
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Intermediate]{.tutorial-level .level-intermediate}
|
||
|
||
### 7 · Geography is a Systems Variable
|
||
|
||
**Question:** *Does it matter where I train?*
|
||
|
||
Same 256-GPU cluster, same model, same duration: 412 tonnes CO₂ in Iowa vs. 10 tonnes in Québec. A 40× difference from geography alone.
|
||
⏱ ~15 min
|
||
|
||
[Start Tutorial →](07_geography.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
::: {.tutorial-card}
|
||
[Intermediate]{.tutorial-level .level-intermediate}
|
||
|
||
### 8 · The $9M Question
|
||
|
||
**Question:** *How much does chain-of-thought reasoning actually cost?*
|
||
|
||
K=8 reasoning steps multiply your serving bill by 7.6× — from $1.2M to $9.1M per year. A seemingly simple algorithmic choice becomes a capital expenditure decision.
|
||
⏱ ~20 min
|
||
|
||
[Start Tutorial →](08_nine_million_dollar.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Cluster 6: Analysis — Walls 21–22 {#sec-analysis}
|
||
|
||
*Cross-cutting diagnostics. Which knob matters most?*
|
||
|
||
These walls provide the tools for **sensitivity analysis** (Wall 21) and **synthesis** (Wall 22) — the ability to ask "what if?" and "what must be true?"
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Advanced]{.tutorial-level .level-advanced}
|
||
|
||
### 9 · Where to Invest: Sensitivity Analysis
|
||
|
||
**Question:** *Should I buy more FLOPS or more bandwidth?*
|
||
|
||
∂T/∂BW = −0.88 vs. ∂T/∂FLOPS = −0.06. For LLM inference, a 10% bandwidth increase yields 15× more improvement than a 10% compute increase. Then use inverse Roofline to derive the minimum hardware spec from an SLA.
|
||
⏱ ~20 min
|
||
|
||
[Start Tutorial →](09_sensitivity.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
::: {.tutorial-card}
|
||
[Advanced]{.tutorial-level .level-advanced}
|
||
|
||
### 10 · GPU vs. Wafer-Scale
|
||
|
||
**Question:** *Can a fundamentally different architecture change which wall binds?*
|
||
|
||
Cerebras eliminates the HBM memory wall entirely — but the binding constraint shifts to injection bandwidth. A qualitative regime change, not just a speedup.
|
||
⏱ ~20 min
|
||
|
||
[Start Tutorial →](10_gpu_vs_wafer.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
::: {.tutorial-card}
|
||
[Advanced]{.tutorial-level .level-advanced}
|
||
|
||
### 11 · Design Space Exploration
|
||
|
||
**Question:** *How do I search for the best architecture without writing nested loops?*
|
||
|
||
Learn to use the declarative DSE Engine to navigate complex ML trade-offs and filter configurations based on SLA constraints.
|
||
⏱ ~20 min
|
||
|
||
[Start Tutorial →](12_design_space_exploration.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Cluster 7: Capstone {#sec-capstone}
|
||
|
||
*Compose everything. All 22 walls, one analysis.*
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Advanced]{.tutorial-level .level-advanced}
|
||
|
||
### 12 · Full-Stack Audit: LLaMA-70B Training
|
||
|
||
**Question:** *What does a complete systems analysis look like?*
|
||
|
||
Trace LLaMA-70B training through all six domains: Node → Data → Algorithm → Fleet → Ops → Analysis. Twelve of the 22 walls exercised in one coherent analysis.
|
||
⏱ ~30 min
|
||
|
||
[Start Tutorial →](12_full_stack_audit.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Extending MLSys·im
|
||
|
||
::: {.tutorial-grid}
|
||
|
||
::: {.tutorial-card}
|
||
[Intermediate]{.tutorial-level .level-intermediate}
|
||
|
||
### The Differential Explainer
|
||
|
||
**Question:** *How do I automatically explain why a hardware upgrade didn't work?*
|
||
|
||
Learn to use the Differential Explainer to compare two configurations and generate a human-readable explanation of Regime Shifts.
|
||
⏱ ~10 min
|
||
|
||
[Start Tutorial →](02_differential_explainer.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
::: {.tutorial-card}
|
||
[Developer]{.tutorial-level .level-developer}
|
||
|
||
### Custom Solvers & Hardware
|
||
|
||
Learn to contribute new hardware specifications to the Silicon Zoo or build your own analytical solvers using the 5-layer architecture.
|
||
⏱ ~15 min
|
||
|
||
[Start Guide →](../contributing.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
::: {.tutorial-card}
|
||
[Developer]{.tutorial-level .level-developer}
|
||
|
||
### Composable Pipelines & Callbacks
|
||
|
||
Learn how to snap custom analytical solvers into the MLSys·im Pipeline and use middleware hooks to log data to external MLOps platforms.
|
||
⏱ ~15 min
|
||
|
||
[Start Tutorial →](01_pipeline_callbacks.qmd){.tutorial-arrow}
|
||
:::
|
||
|
||
:::
|
||
|
||
---
|
||
|
||
## Learning Paths
|
||
|
||
Choose a path based on your role:
|
||
|
||
### Path A: First-Time Learner (~ 2 hours)
|
||
> *"I'm new to ML systems and want to build intuition from scratch."*
|
||
|
||
0 → 1 → 2 → 3 → 4 → 5 → 7 → 12
|
||
|
||
*Why this order:* Start with single-node physics (roofline, memory wall, two phases),
|
||
understand the KV-cache memory constraint that dominates LLM serving, see the data pipeline
|
||
bottleneck, understand quantization regimes, learn that geography matters, then compose it
|
||
all in the capstone. Tutorial 6 (distributed reliability) is deferred — come back to it
|
||
after the capstone.
|
||
|
||
### Path B: ML Engineer (~ 2.5 hours)
|
||
> *"I deploy models in production and need to make hardware decisions."*
|
||
|
||
0 → 1 → 2 → 3 → 8 → 9 → 12
|
||
|
||
### Path C: Researcher (~ 2 hours)
|
||
> *"I evaluate hardware architectures and need quantitative tools."*
|
||
|
||
0 → 1 → 5 → 9 → 10 → 6 → 11 → 12
|
||
|
||
### Path D: Conference Tutorial (90 min)
|
||
> *"I'm attending a live tutorial at ISCA / MLSys / ASPLOS."*
|
||
|
||
| Time | Tutorial | Core Message | Format |
|
||
|------|----------|-------------|--------|
|
||
| 0–15 min | **0. Hello, Roofline** | The equation, the tool, the regime | Live coding, audience predicts |
|
||
| 15–35 min | **1. The Memory Wall** | Why 3.2× FLOPS ≠ 3.2× speedup | Live coding, audience verifies |
|
||
| 35–45 min | **Break + Q&A** | | |
|
||
| 45–60 min | **9. Sensitivity** | Bandwidth is 15× more valuable | Live coding + derivation |
|
||
| 60–75 min | **6. Scaling to 1000 GPUs** | Reliability dominates communication | Pre-computed results, discussion |
|
||
| 75–85 min | **12. Full-Stack Audit** | Composing all six domains | Summary table walkthrough |
|
||
| 85–90 min | **Where to learn more** | Take-home exercises | |
|
||
|
||
*Why 5 tutorials, not 8:* Each tutorial needs enough time for the
|
||
predict-compute-reflect cycle to land. Tutorials 2, 5, and 7 are available as
|
||
take-home exercises for attendees who want to continue after the session.
|
||
|
||
### Path D+: Half-Day Workshop (3 hours)
|
||
> *"I'm attending a half-day conference tutorial."*
|
||
|
||
The full 12-tutorial sequence with hands-on exercises. All 8 tutorials from the
|
||
original conference path plus Tutorials 3 (KV-Cache) and 10 (Wafer-Scale), with
|
||
15-minute breaks between clusters.
|
||
|
||
---
|
||
|
||
> **All tutorials are Quarto-compatible.** Run them locally after `pip install mlsysim`,
|
||
> or browse the rendered versions on this website.
|