cs249r_book/mlsysim/tutorial/instructor-quickstart.md

# Instructor Quick-Start Guide

> **Get mlsysim into your ML Systems course in 15 minutes.**

---

## The 15-Minute Adoption Path

### Minute 0–3: Install and Verify

```bash
pip install mlsysim
python3 -c "import mlsysim; print(mlsysim.__version__)"
# Should print: 0.1.0
```

### Minute 3–8: Your First Live Demo

Copy this into a Jupyter cell or Python script. This IS your first lecture demo:

```python
import mlsysim

# "Is Llama-3 8B compute-bound or memory-bound on an H100?"
profile = mlsysim.Engine.solve(
    mlsysim.Models.Language.Llama3_8B,
    mlsysim.Hardware.Cloud.H100,
    batch_size=1,
)

print(f"Bottleneck: {profile.bottleneck}")  # → Memory
print(f"MFU:        {profile.mfu:.3f}")     # → 0.003 (nearly idle compute!)
print(f"Latency:    {profile.latency:.2f}") # → 5.60 ms

# Now change batch_size to 256 and watch the bottleneck shift...
```

**The teaching moment:** Students predict "Compute-bound, because GPUs are fast."
They run it and see "Memory-bound, MFU=0.3%." Their intuition breaks. You rebuild
it with the Roofline model. That's the entire first lecture.

### Minute 8–12: Your First Homework Problem

> **Problem:** The H100 has 6.3× more FLOPS than the A100 (989 vs 156 TFLOPS FP16 dense).
> Use `Engine.solve()` to compare Llama-3-8B inference at batch=1 on both.
> How much faster is H100? Why isn't it 6.3×?

**Expected answer:** ~1.7× faster. Memory-bound workload scales with bandwidth
ratio (3.35/2.04 ≈ 1.64×), not FLOPS ratio. Students learn that advertising
FLOPS means nothing for memory-bound workloads.

### Minute 12–15: Plan Your Semester

| Week | Topic | mlsysim Exercise |
|------|-------|-----------------|
| 2 | Roofline Model | Batch-size sweep: find the compute↔memory crossover |
| 4 | LLM Serving | KV cache capacity: how many concurrent requests? |
| 6 | Quantization | Compress Llama-3 to INT4: what's the fleet impact? |
| 8 | Distributed Training | Scale from 8 to 256 GPUs: find the efficiency cliff |
| 10 | TCO & Carbon | Move training to Quebec: how much carbon saved? |
| 12 | Design Challenge | $5M budget, Llama-3 70B at 1000 QPS — design the fleet |

Each exercise takes 20–30 minutes of class time. Solutions are in
[exercises.md](exercises.md).

---

## What Students Need

- Python 3.10+
- `pip install mlsysim`
- No GPU required. No cloud account. Everything runs on a laptop CPU.
- See [prerequisites.md](prerequisites.md) for detailed setup.

## What You Get

| Material | File | Description |
|----------|------|-------------|
| Tutorial slides | `slides/tutorial_part1.tex` + `tutorial_part2.tex` | 102 Beamer slides with speaker notes |
| 8 exercises | `exercises.md` | Hands-on problems with expected answers |
| Cheat sheet | `cheatsheet.md` | Single-page reference (Iron Law + key equations) |
| Pre/post quiz | `assessment/quiz.md` | 10-question assessment with distractor analysis |
| Backward design | `DESIGN.md` | Learning goals, "aha moments," facilitation notes |
| 6 SVG figures | `slides/images/svg/` | Publication-quality diagrams (Roofline, AllReduce, etc.) |

## Auto-Grading Hint

mlsysim returns typed Pydantic objects. You can auto-grade by checking:

```python
# Student submits their analysis
result = mlsysim.Engine.solve(model, hardware, batch_size=student_batch)

# Auto-grade: check the bottleneck label
assert result.bottleneck == "Memory", "Expected Memory-bound at batch=1"

# Auto-grade: check MFU is in the right range
assert 0.001 < result.mfu < 0.01, f"MFU {result.mfu:.3f} outside expected range"

# Auto-grade: check feasibility
assert result.feasible, "Model should fit on this hardware"
```

This works with Gradescope, nbgrader, or any Python-based autograder.

## The Pedagogical Framework

mlsysim is organized around the **Iron Law of ML Systems**:

```
Time = FLOPs / (N × Peak × MFU × η_scaling × Goodput)
```

Every concept in your course maps to one term in this equation. Every homework
problem is about understanding which term is the bottleneck and how to improve it.
The 22-wall taxonomy provides the vocabulary; the Iron Law provides the structure.

See [laws-explained.md](../docs/laws-explained.md) for plain-English explanations
of all 22 constraints.

## Getting Help

- **Issues:** [github.com/harvard-edge/cs249r_book/issues](https://github.com/harvard-edge/cs249r_book/issues) (use the mlsysim template)
- **Documentation:** [mlsysbook.ai/mlsysim](https://mlsysbook.ai/mlsysim)
- **Citation:** See CITATION.cff in the package root