mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 02:03:55 -05:00
Covers: install, first demo, first homework, semester plan, auto-grading hints, and material inventory. Designed to minimize instructor adoption friction for ML systems courses.
126 lines
4.4 KiB
Markdown
126 lines
4.4 KiB
Markdown
# Instructor Quick-Start Guide
|
||
|
||
> **Get mlsysim into your ML Systems course in 15 minutes.**
|
||
|
||
---
|
||
|
||
## The 15-Minute Adoption Path
|
||
|
||
### Minute 0–3: Install and Verify
|
||
|
||
```bash
|
||
pip install mlsysim
|
||
python3 -c "import mlsysim; print(mlsysim.__version__)"
|
||
# Should print: 0.1.0
|
||
```
|
||
|
||
### Minute 3–8: Your First Live Demo
|
||
|
||
Copy this into a Jupyter cell or Python script. This IS your first lecture demo:
|
||
|
||
```python
|
||
import mlsysim
|
||
|
||
# "Is Llama-3 8B compute-bound or memory-bound on an H100?"
|
||
profile = mlsysim.Engine.solve(
|
||
mlsysim.Models.Language.Llama3_8B,
|
||
mlsysim.Hardware.Cloud.H100,
|
||
batch_size=1,
|
||
)
|
||
|
||
print(f"Bottleneck: {profile.bottleneck}") # → Memory
|
||
print(f"MFU: {profile.mfu:.3f}") # → 0.003 (nearly idle compute!)
|
||
print(f"Latency: {profile.latency:.2f}") # → 5.60 ms
|
||
|
||
# Now change batch_size to 256 and watch the bottleneck shift...
|
||
```
|
||
|
||
**The teaching moment:** Students predict "Compute-bound, because GPUs are fast."
|
||
They run it and see "Memory-bound, MFU=0.3%." Their intuition breaks. You rebuild
|
||
it with the Roofline model. That's the entire first lecture.
|
||
|
||
### Minute 8–12: Your First Homework Problem
|
||
|
||
> **Problem:** The H100 has 6.3× more FLOPS than the A100 (989 vs 156 TFLOPS FP16 dense).
|
||
> Use `Engine.solve()` to compare Llama-3-8B inference at batch=1 on both.
|
||
> How much faster is H100? Why isn't it 6.3×?
|
||
|
||
**Expected answer:** ~1.7× faster. Memory-bound workload scales with bandwidth
|
||
ratio (3.35/2.04 ≈ 1.64×), not FLOPS ratio. Students learn that advertising
|
||
FLOPS means nothing for memory-bound workloads.
|
||
|
||
### Minute 12–15: Plan Your Semester
|
||
|
||
| Week | Topic | mlsysim Exercise |
|
||
|------|-------|-----------------|
|
||
| 2 | Roofline Model | Batch-size sweep: find the compute↔memory crossover |
|
||
| 4 | LLM Serving | KV cache capacity: how many concurrent requests? |
|
||
| 6 | Quantization | Compress Llama-3 to INT4: what's the fleet impact? |
|
||
| 8 | Distributed Training | Scale from 8 to 256 GPUs: find the efficiency cliff |
|
||
| 10 | TCO & Carbon | Move training to Quebec: how much carbon saved? |
|
||
| 12 | Design Challenge | $5M budget, Llama-3 70B at 1000 QPS — design the fleet |
|
||
|
||
Each exercise takes 20–30 minutes of class time. Solutions are in
|
||
[exercises.md](exercises.md).
|
||
|
||
---
|
||
|
||
## What Students Need
|
||
|
||
- Python 3.10+
|
||
- `pip install mlsysim`
|
||
- No GPU required. No cloud account. Everything runs on a laptop CPU.
|
||
- See [prerequisites.md](prerequisites.md) for detailed setup.
|
||
|
||
## What You Get
|
||
|
||
| Material | File | Description |
|
||
|----------|------|-------------|
|
||
| Tutorial slides | `slides/tutorial_part1.tex` + `tutorial_part2.tex` | 102 Beamer slides with speaker notes |
|
||
| 8 exercises | `exercises.md` | Hands-on problems with expected answers |
|
||
| Cheat sheet | `cheatsheet.md` | Single-page reference (Iron Law + key equations) |
|
||
| Pre/post quiz | `assessment/quiz.md` | 10-question assessment with distractor analysis |
|
||
| Backward design | `DESIGN.md` | Learning goals, "aha moments," facilitation notes |
|
||
| 6 SVG figures | `slides/images/svg/` | Publication-quality diagrams (Roofline, AllReduce, etc.) |
|
||
|
||
## Auto-Grading Hint
|
||
|
||
mlsysim returns typed Pydantic objects. You can auto-grade by checking:
|
||
|
||
```python
|
||
# Student submits their analysis
|
||
result = mlsysim.Engine.solve(model, hardware, batch_size=student_batch)
|
||
|
||
# Auto-grade: check the bottleneck label
|
||
assert result.bottleneck == "Memory", "Expected Memory-bound at batch=1"
|
||
|
||
# Auto-grade: check MFU is in the right range
|
||
assert 0.001 < result.mfu < 0.01, f"MFU {result.mfu:.3f} outside expected range"
|
||
|
||
# Auto-grade: check feasibility
|
||
assert result.feasible, "Model should fit on this hardware"
|
||
```
|
||
|
||
This works with Gradescope, nbgrader, or any Python-based autograder.
|
||
|
||
## The Pedagogical Framework
|
||
|
||
mlsysim is organized around the **Iron Law of ML Systems**:
|
||
|
||
```
|
||
Time = FLOPs / (N × Peak × MFU × η_scaling × Goodput)
|
||
```
|
||
|
||
Every concept in your course maps to one term in this equation. Every homework
|
||
problem is about understanding which term is the bottleneck and how to improve it.
|
||
The 22-wall taxonomy provides the vocabulary; the Iron Law provides the structure.
|
||
|
||
See [laws-explained.md](../docs/laws-explained.md) for plain-English explanations
|
||
of all 22 constraints.
|
||
|
||
## Getting Help
|
||
|
||
- **Issues:** [github.com/harvard-edge/cs249r_book/issues](https://github.com/harvard-edge/cs249r_book/issues) (use the mlsysim template)
|
||
- **Documentation:** [mlsysbook.ai/mlsysim](https://mlsysbook.ai/mlsysim)
|
||
- **Citation:** See CITATION.cff in the package root
|