Files
cs249r_book/mlsysim/tutorial/instructor-quickstart.md
Vijay Janapa Reddi f62dc8cca2 docs(tutorial): instructor quick-start guide — 15 min adoption path
Covers: install, first demo, first homework, semester plan, auto-grading
hints, and material inventory. Designed to minimize instructor adoption
friction for ML systems courses.
2026-04-01 23:19:21 -04:00

4.4 KiB
Raw Blame History

Instructor Quick-Start Guide

Get mlsysim into your ML Systems course in 15 minutes.


The 15-Minute Adoption Path

Minute 03: Install and Verify

pip install mlsysim
python3 -c "import mlsysim; print(mlsysim.__version__)"
# Should print: 0.1.0

Minute 38: Your First Live Demo

Copy this into a Jupyter cell or Python script. This IS your first lecture demo:

import mlsysim

# "Is Llama-3 8B compute-bound or memory-bound on an H100?"
profile = mlsysim.Engine.solve(
    mlsysim.Models.Language.Llama3_8B,
    mlsysim.Hardware.Cloud.H100,
    batch_size=1,
)

print(f"Bottleneck: {profile.bottleneck}")  # → Memory
print(f"MFU:        {profile.mfu:.3f}")     # → 0.003 (nearly idle compute!)
print(f"Latency:    {profile.latency:.2f}") # → 5.60 ms

# Now change batch_size to 256 and watch the bottleneck shift...

The teaching moment: Students predict "Compute-bound, because GPUs are fast." They run it and see "Memory-bound, MFU=0.3%." Their intuition breaks. You rebuild it with the Roofline model. That's the entire first lecture.

Minute 812: Your First Homework Problem

Problem: The H100 has 6.3× more FLOPS than the A100 (989 vs 156 TFLOPS FP16 dense). Use Engine.solve() to compare Llama-3-8B inference at batch=1 on both. How much faster is H100? Why isn't it 6.3×?

Expected answer: ~1.7× faster. Memory-bound workload scales with bandwidth ratio (3.35/2.04 ≈ 1.64×), not FLOPS ratio. Students learn that advertising FLOPS means nothing for memory-bound workloads.

Minute 1215: Plan Your Semester

Week Topic mlsysim Exercise
2 Roofline Model Batch-size sweep: find the compute↔memory crossover
4 LLM Serving KV cache capacity: how many concurrent requests?
6 Quantization Compress Llama-3 to INT4: what's the fleet impact?
8 Distributed Training Scale from 8 to 256 GPUs: find the efficiency cliff
10 TCO & Carbon Move training to Quebec: how much carbon saved?
12 Design Challenge $5M budget, Llama-3 70B at 1000 QPS — design the fleet

Each exercise takes 2030 minutes of class time. Solutions are in exercises.md.


What Students Need

  • Python 3.10+
  • pip install mlsysim
  • No GPU required. No cloud account. Everything runs on a laptop CPU.
  • See prerequisites.md for detailed setup.

What You Get

Material File Description
Tutorial slides slides/tutorial_part1.tex + tutorial_part2.tex 102 Beamer slides with speaker notes
8 exercises exercises.md Hands-on problems with expected answers
Cheat sheet cheatsheet.md Single-page reference (Iron Law + key equations)
Pre/post quiz assessment/quiz.md 10-question assessment with distractor analysis
Backward design DESIGN.md Learning goals, "aha moments," facilitation notes
6 SVG figures slides/images/svg/ Publication-quality diagrams (Roofline, AllReduce, etc.)

Auto-Grading Hint

mlsysim returns typed Pydantic objects. You can auto-grade by checking:

# Student submits their analysis
result = mlsysim.Engine.solve(model, hardware, batch_size=student_batch)

# Auto-grade: check the bottleneck label
assert result.bottleneck == "Memory", "Expected Memory-bound at batch=1"

# Auto-grade: check MFU is in the right range
assert 0.001 < result.mfu < 0.01, f"MFU {result.mfu:.3f} outside expected range"

# Auto-grade: check feasibility
assert result.feasible, "Model should fit on this hardware"

This works with Gradescope, nbgrader, or any Python-based autograder.

The Pedagogical Framework

mlsysim is organized around the Iron Law of ML Systems:

Time = FLOPs / (N × Peak × MFU × η_scaling × Goodput)

Every concept in your course maps to one term in this equation. Every homework problem is about understanding which term is the bottleneck and how to improve it. The 22-wall taxonomy provides the vocabulary; the Iron Law provides the structure.

See laws-explained.md for plain-English explanations of all 22 constraints.

Getting Help