mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 17:49:07 -05:00

Files

Vijay Janapa Reddi f62dc8cca2 docs(tutorial): instructor quick-start guide — 15 min adoption path

Covers: install, first demo, first homework, semester plan, auto-grading
hints, and material inventory. Designed to minimize instructor adoption
friction for ML systems courses.

2026-04-01 23:19:21 -04:00

4.4 KiB

Raw Permalink Blame History

Instructor Quick-Start Guide

Get mlsysim into your ML Systems course in 15 minutes.

The 15-Minute Adoption Path

Minute 0–3: Install and Verify

pip install mlsysim
python3 -c "import mlsysim; print(mlsysim.__version__)"
# Should print: 0.1.0

Minute 3–8: Your First Live Demo

Copy this into a Jupyter cell or Python script. This IS your first lecture demo:

import mlsysim

# "Is Llama-3 8B compute-bound or memory-bound on an H100?"
profile = mlsysim.Engine.solve(
    mlsysim.Models.Language.Llama3_8B,
    mlsysim.Hardware.Cloud.H100,
    batch_size=1,
)

print(f"Bottleneck: {profile.bottleneck}")  # → Memory
print(f"MFU:        {profile.mfu:.3f}")     # → 0.003 (nearly idle compute!)
print(f"Latency:    {profile.latency:.2f}") # → 5.60 ms

# Now change batch_size to 256 and watch the bottleneck shift...

The teaching moment: Students predict "Compute-bound, because GPUs are fast." They run it and see "Memory-bound, MFU=0.3%." Their intuition breaks. You rebuild it with the Roofline model. That's the entire first lecture.

Minute 8–12: Your First Homework Problem

Problem: The H100 has 6.3× more FLOPS than the A100 (989 vs 156 TFLOPS FP16 dense). Use Engine.solve() to compare Llama-3-8B inference at batch=1 on both. How much faster is H100? Why isn't it 6.3×?

Expected answer: ~1.7× faster. Memory-bound workload scales with bandwidth ratio (3.35/2.04 ≈ 1.64×), not FLOPS ratio. Students learn that advertising FLOPS means nothing for memory-bound workloads.

Minute 12–15: Plan Your Semester

Week	Topic	mlsysim Exercise
2	Roofline Model	Batch-size sweep: find the compute↔memory crossover
4	LLM Serving	KV cache capacity: how many concurrent requests?
6	Quantization	Compress Llama-3 to INT4: what's the fleet impact?
8	Distributed Training	Scale from 8 to 256 GPUs: find the efficiency cliff
10	TCO & Carbon	Move training to Quebec: how much carbon saved?
12	Design Challenge	$5M budget, Llama-3 70B at 1000 QPS — design the fleet

Each exercise takes 20–30 minutes of class time. Solutions are in exercises.md.

What Students Need

Python 3.10+
pip install mlsysim
No GPU required. No cloud account. Everything runs on a laptop CPU.
See prerequisites.md for detailed setup.

What You Get

Material	File	Description
Tutorial slides	`slides/tutorial_part1.tex` + `tutorial_part2.tex`	102 Beamer slides with speaker notes
8 exercises	`exercises.md`	Hands-on problems with expected answers
Cheat sheet	`cheatsheet.md`	Single-page reference (Iron Law + key equations)
Pre/post quiz	`assessment/quiz.md`	10-question assessment with distractor analysis
Backward design	`DESIGN.md`	Learning goals, "aha moments," facilitation notes
6 SVG figures	`slides/images/svg/`	Publication-quality diagrams (Roofline, AllReduce, etc.)

Auto-Grading Hint

mlsysim returns typed Pydantic objects. You can auto-grade by checking:

# Student submits their analysis
result = mlsysim.Engine.solve(model, hardware, batch_size=student_batch)

# Auto-grade: check the bottleneck label
assert result.bottleneck == "Memory", "Expected Memory-bound at batch=1"

# Auto-grade: check MFU is in the right range
assert 0.001 < result.mfu < 0.01, f"MFU {result.mfu:.3f} outside expected range"

# Auto-grade: check feasibility
assert result.feasible, "Model should fit on this hardware"

This works with Gradescope, nbgrader, or any Python-based autograder.

The Pedagogical Framework

mlsysim is organized around the Iron Law of ML Systems:

Time = FLOPs / (N × Peak × MFU × η_scaling × Goodput)

Every concept in your course maps to one term in this equation. Every homework problem is about understanding which term is the bottleneck and how to improve it. The 22-wall taxonomy provides the vocabulary; the Iron Law provides the structure.

See laws-explained.md for plain-English explanations of all 22 constraints.

Getting Help

Issues: github.com/harvard-edge/cs249r_book/issues (use the mlsysim template)
Documentation: mlsysbook.ai/mlsysim
Citation: See CITATION.cff in the package root

4.4 KiB Raw Permalink Blame History Unescape Escape