--- title: "Tutorials" subtitle: "Learn ML systems reasoning through the 22 Systems Walls." --- These tutorials teach you to reason quantitatively about ML infrastructure using **MLSys·im**, the first-principles infrastructure modeling engine behind the *Machine Learning Systems* textbook. They are organized by the six domains of the [22 Systems Walls taxonomy](../architecture.qmd). **Each tutorial answers one question.** Start at the beginning for a guided path, or jump to any domain that matches your interest. ::: {.callout-tip} ## How to Use These Tutorials - **Every tutorial runs on a laptop in under 30 seconds.** No GPU required. - **Code cells are executable.** Clone the repo and run them, or follow along on the website. - **Exercises ask you to predict first, then verify.** This builds intuition faster than reading alone. - **Tutorials within a cluster build on each other**, but clusters are largely independent. - **Time estimates** are for reading + running code. Add 15–20 min if you do all exercises. ::: --- ## Start Here {#sec-start} Before diving into any domain, complete this introduction to the roofline model. ::: {.tutorial-grid} ::: {.tutorial-card} [Beginner]{.tutorial-level .level-beginner} ### 0 · Hello, Roofline **Question:** *How do I predict whether my model is memory-bound or compute-bound?* Five lines of code, one answer. The foundation for everything that follows. ⏱ ~10 min [Start Tutorial →](00_hello_roofline.qmd){.tutorial-arrow} ::: ::: --- ## Cluster 1: Node — Walls 1–3 {#sec-node} *One accelerator, one model. Where is the ceiling?* These tutorials explore the three walls that constrain a single accelerator: **compute throughput** (Wall 1), **memory capacity** (Wall 2), and **memory bandwidth** (Wall 3). Understanding which wall binds — and why — is the most fundamental skill in ML systems reasoning. ::: {.tutorial-grid} ::: {.tutorial-card} [Beginner]{.tutorial-level .level-beginner} ### 1 · The Memory Wall **Question:** *Why doesn't 3.2× more FLOPS give 3.2× speedup?* Compare A100 → H100 and discover that for LLM inference, bandwidth — not compute — is the binding constraint. The most important fallacy in ML systems. ⏱ ~15 min [Start Tutorial →](01_memory_wall.qmd){.tutorial-arrow} ::: ::: {.tutorial-card} [Intermediate]{.tutorial-level .level-intermediate} ### 2 · Two Phases, One Request **Question:** *Why is LLM serving fundamentally different from CNN inference?* The same model on the same GPU hits two different ceilings: prefill is compute-bound, decode is memory-bound. This is why LLM serving requires its own analysis. ⏱ ~15 min [Start Tutorial →](02_two_phases.qmd){.tutorial-arrow} ::: ::: {.tutorial-card} [Intermediate]{.tutorial-level .level-intermediate} ### 3 · KV-Cache: The Hidden Tax **Question:** *What actually limits how many users I can serve concurrently?* At 128K context length, the KV-cache alone fills an 80 GB GPU. Explore how sequence length, batch size, and paged attention interact to constrain serving capacity. ⏱ ~20 min [Start Tutorial →](03_kv_cache.qmd){.tutorial-arrow} ::: ::: --- ## Cluster 2: Data — Walls 8–10 {#sec-data} *The GPU is fast. Is the pipeline faster?* Even the fastest accelerator sits idle if the data pipeline cannot keep up. These walls cover **ingestion** (Wall 8), **transformation** (Wall 9), and **storage bandwidth** (Wall 10). ::: {.tutorial-grid} ::: {.tutorial-card} [Intermediate]{.tutorial-level .level-intermediate} ### 4 · Starving the GPU **Question:** *Why is my GPU utilization only 40%?* A100 compute takes 48 ms per step, but the CPU augmentation pipeline is the true bottleneck. The binding constraint is JPEG decoding, not silicon. ⏱ ~15 min [Start Tutorial →](04_starving_the_gpu.qmd){.tutorial-arrow} ::: ::: --- ## Cluster 3: Algorithm — Walls 11–13 {#sec-algorithm} *Can I make the model smaller or the training cheaper?* These walls govern **scaling laws** (Wall 11), **compression and quantization** (Wall 12), and **architecture efficiency** (Wall 13). ::: {.tutorial-grid} ::: {.tutorial-card} [Intermediate]{.tutorial-level .level-intermediate} ### 5 · Quantization: Not a Free Lunch **Question:** *Does INT4 always give 4× speedup?* For memory-bound decode: nearly 4×. For compute-bound training: 0×. The *regime* determines whether quantization helps — and the roofline tells you which regime you're in. ⏱ ~20 min [Start Tutorial →](05_quantization.qmd){.tutorial-arrow} ::: ::: --- ## Cluster 4: Fleet — Walls 14–16 {#sec-fleet} *Scaling past one machine. Where does efficiency go?* Distributed training introduces three new walls: **communication overhead** (Wall 14), **synchronization cost** (Wall 15), and **reliability** (Wall 16). ::: {.tutorial-grid} ::: {.tutorial-card} [Advanced]{.tutorial-level .level-advanced} ### 6 · Scaling to 1000 GPUs **Question:** *Where does my training efficiency disappear at scale?* At 1024 GPUs, AllReduce communication overhead erodes scaling efficiency. But the real hidden cost is reliability: cluster MTBF drops to ~20 hours, forcing frequent checkpoints that consume more wall-clock time than communication itself. ⏱ ~20 min [Start Tutorial →](06_scaling_1000_gpus.qmd){.tutorial-arrow} ::: ::: --- ## Cluster 5: Ops — Walls 17–20 {#sec-ops} *What does it cost — in dollars, carbon, and water?* Operational walls cover **energy** (Wall 17), **sustainability** (Wall 18), **economic cost** (Wall 19), and **safety** (Wall 20). These are the walls that determine whether a technically feasible system is actually deployable. ::: {.tutorial-grid} ::: {.tutorial-card} [Intermediate]{.tutorial-level .level-intermediate} ### 7 · Geography is a Systems Variable **Question:** *Does it matter where I train?* Same 256-GPU cluster, same model, same duration: 412 tonnes CO₂ in Iowa vs. 10 tonnes in Québec. A 40× difference from geography alone. ⏱ ~15 min [Start Tutorial →](07_geography.qmd){.tutorial-arrow} ::: ::: {.tutorial-card} [Intermediate]{.tutorial-level .level-intermediate} ### 8 · The $9M Question **Question:** *How much does chain-of-thought reasoning actually cost?* K=8 reasoning steps multiply your serving bill by 7.6× — from $1.2M to $9.1M per year. A seemingly simple algorithmic choice becomes a capital expenditure decision. ⏱ ~20 min [Start Tutorial →](08_nine_million_dollar.qmd){.tutorial-arrow} ::: ::: --- ## Cluster 6: Analysis — Walls 21–22 {#sec-analysis} *Cross-cutting diagnostics. Which knob matters most?* These walls provide the tools for **sensitivity analysis** (Wall 21) and **synthesis** (Wall 22) — the ability to ask "what if?" and "what must be true?" ::: {.tutorial-grid} ::: {.tutorial-card} [Advanced]{.tutorial-level .level-advanced} ### 9 · Where to Invest: Sensitivity Analysis **Question:** *Should I buy more FLOPS or more bandwidth?* ∂T/∂BW = −0.88 vs. ∂T/∂FLOPS = −0.06. For LLM inference, a 10% bandwidth increase yields 15× more improvement than a 10% compute increase. Then use inverse Roofline to derive the minimum hardware spec from an SLA. ⏱ ~20 min [Start Tutorial →](09_sensitivity.qmd){.tutorial-arrow} ::: ::: {.tutorial-card} [Advanced]{.tutorial-level .level-advanced} ### 10 · GPU vs. Wafer-Scale **Question:** *Can a fundamentally different architecture change which wall binds?* Cerebras eliminates the HBM memory wall entirely — but the binding constraint shifts to injection bandwidth. A qualitative regime change, not just a speedup. ⏱ ~20 min [Start Tutorial →](10_gpu_vs_wafer.qmd){.tutorial-arrow} ::: ::: {.tutorial-card} [Advanced]{.tutorial-level .level-advanced} ### 11 · Design Space Exploration **Question:** *How do I search for the best architecture without writing nested loops?* Learn to use the declarative DSE Engine to navigate complex ML trade-offs and filter configurations based on SLA constraints. ⏱ ~20 min [Start Tutorial →](12_design_space_exploration.qmd){.tutorial-arrow} ::: ::: --- ## Cluster 7: Capstone {#sec-capstone} *Compose everything. All 22 walls, one analysis.* ::: {.tutorial-grid} ::: {.tutorial-card} [Advanced]{.tutorial-level .level-advanced} ### 12 · Full-Stack Audit: LLaMA-70B Training **Question:** *What does a complete systems analysis look like?* Trace LLaMA-70B training through all six domains: Node → Data → Algorithm → Fleet → Ops → Analysis. Twelve of the 22 walls exercised in one coherent analysis. ⏱ ~30 min [Start Tutorial →](12_full_stack_audit.qmd){.tutorial-arrow} ::: ::: --- ## Extending MLSys·im ::: {.tutorial-grid} ::: {.tutorial-card} [Intermediate]{.tutorial-level .level-intermediate} ### The Differential Explainer **Question:** *How do I automatically explain why a hardware upgrade didn't work?* Learn to use the Differential Explainer to compare two configurations and generate a human-readable explanation of Regime Shifts. ⏱ ~10 min [Start Tutorial →](02_differential_explainer.qmd){.tutorial-arrow} ::: ::: {.tutorial-card} [Developer]{.tutorial-level .level-developer} ### Custom Solvers & Hardware Learn to contribute new hardware specifications to the Silicon Zoo or build your own analytical solvers using the 5-layer architecture. ⏱ ~15 min [Start Guide →](../contributing.qmd){.tutorial-arrow} ::: ::: {.tutorial-card} [Developer]{.tutorial-level .level-developer} ### Composable Pipelines & Callbacks Learn how to snap custom analytical solvers into the MLSys·im Pipeline and use middleware hooks to log data to external MLOps platforms. ⏱ ~15 min [Start Tutorial →](01_pipeline_callbacks.qmd){.tutorial-arrow} ::: ::: --- ## Learning Paths Choose a path based on your role: ### Path A: First-Time Learner (~ 2 hours) > *"I'm new to ML systems and want to build intuition from scratch."* 0 → 1 → 2 → 3 → 4 → 5 → 7 → 12 *Why this order:* Start with single-node physics (roofline, memory wall, two phases), understand the KV-cache memory constraint that dominates LLM serving, see the data pipeline bottleneck, understand quantization regimes, learn that geography matters, then compose it all in the capstone. Tutorial 6 (distributed reliability) is deferred — come back to it after the capstone. ### Path B: ML Engineer (~ 2.5 hours) > *"I deploy models in production and need to make hardware decisions."* 0 → 1 → 2 → 3 → 8 → 9 → 12 ### Path C: Researcher (~ 2 hours) > *"I evaluate hardware architectures and need quantitative tools."* 0 → 1 → 5 → 9 → 10 → 6 → 11 → 12 ### Path D: Conference Tutorial (90 min) > *"I'm attending a live tutorial at ISCA / MLSys / ASPLOS."* | Time | Tutorial | Core Message | Format | |------|----------|-------------|--------| | 0–15 min | **0. Hello, Roofline** | The equation, the tool, the regime | Live coding, audience predicts | | 15–35 min | **1. The Memory Wall** | Why 3.2× FLOPS ≠ 3.2× speedup | Live coding, audience verifies | | 35–45 min | **Break + Q&A** | | | | 45–60 min | **9. Sensitivity** | Bandwidth is 15× more valuable | Live coding + derivation | | 60–75 min | **6. Scaling to 1000 GPUs** | Reliability dominates communication | Pre-computed results, discussion | | 75–85 min | **12. Full-Stack Audit** | Composing all six domains | Summary table walkthrough | | 85–90 min | **Where to learn more** | Take-home exercises | | *Why 5 tutorials, not 8:* Each tutorial needs enough time for the predict-compute-reflect cycle to land. Tutorials 2, 5, and 7 are available as take-home exercises for attendees who want to continue after the session. ### Path D+: Half-Day Workshop (3 hours) > *"I'm attending a half-day conference tutorial."* The full 12-tutorial sequence with hands-on exercises. All 8 tutorials from the original conference path plus Tutorials 3 (KV-Cache) and 10 (Wafer-Scale), with 15-minute breaks between clusters. --- > **All tutorials are Quarto-compatible.** Run them locally after `pip install mlsysim`, > or browse the rendered versions on this website.