cs249r_book/instructors/ta-guide.qmd

---
title: "TA Guide"
subtitle: "Everything you need to run labs, grade assignments, and support students"
---

Welcome to the teaching team. This guide covers what you need to know to be an effective TA for the ML Systems course.

---

## Before the Semester

### TA Preparation Checklist

Complete these before Week 1:

- [ ] **Read** the textbook chapters you will be covering (at minimum, the Part you are assigned)
- [ ] **Complete** TinyTorch Modules 01–08 yourself (Foundations semester)
- [ ] **Run** all labs for your assigned weeks — note where students will get stuck
- [ ] **Read** the [Pedagogy Guide](pedagogy.qmd) — understand Prediction Locks, Decision Logs, and the A-B-C structure
- [ ] **Read** the [Assessment & Grading](assessment.qmd) — internalize the Decision Log rubric
- [ ] **Attend** the grading calibration session (Week 0)
- [ ] **Set up** nbgrader on your machine (see [TinyTorch Instructor Guide](https://mlsysbook.ai/tinytorch/INSTRUCTOR.html))

---

## Grading Decision Logs

Decision Logs are the most important written artifact in the course. Every student submits one per week, so plan your grading time accordingly. Here is how to do it efficiently.

### The 3-Question Speed Rubric

For each Decision Log, ask three questions:

1. **Numbers?** Did the student cite specific values from the instruments? (latency, memory, throughput, accuracy)
2. **Why?** Did the student use Iron Law terminology to explain the cause?
3. **Tradeoff?** Did the student acknowledge what they sacrificed for what they gained?

**Yes to all three → 27-30 points. Missing one → 18-22. Missing two+ → 6-12.**

See [Assessment & Grading](assessment.qmd) for the full rubric and sample student work at each quality level.

### Time Budget

Expect 3–5 minutes per Decision Log using the speed rubric. For a 30-student section, that's ~2 hours per week. See [Assessment & Grading](assessment.qmd) for full grading load estimates across all assignment types.

### Batch Grading Tips

- Grade all Decision Logs for one week in a single sitting (consistency matters)
- Read the excellent sample first to calibrate your expectations
- Mark the first 5, then check with another TA — align before continuing
- Flag borderline cases for the instructor rather than agonizing

---

## Grading TinyTorch

### Auto-Graded (70 points)

- Run `pytest` on student submissions — pass/fail per test
- Students who pass all tests get 70/70; no partial credit per test
- If a student passes 90%+ of tests, check whether the failures are edge cases vs. fundamental errors

### Systems Thinking Questions (30 points)

Each module has 3 manually-graded questions (10 points each). Use this scale:

| Score | What It Looks Like |
|:---|:---|
| **10** | Correct reasoning + quantitative estimate + hardware awareness |
| **7** | Right direction, missing numbers or hardware specifics |
| **4** | Partially correct with significant conceptual gaps |
| **1** | Attempted but fundamentally wrong |

---

## Running Lab Sections (50 minutes)

### Recommended Structure

| Time | Activity | Your Role |
|:---|:---|:---|
| 0-5 min | Prediction Lock | Collect predictions; don't reveal answers |
| 5-15 min | Part A walkthrough | Circulate; help students who can't get instruments running |
| 15-30 min | Part B exploration | Ask probing questions (see below) |
| 30-45 min | Part C design challenge | Let students struggle; intervene only if truly stuck |
| 45-50 min | Debrief | Revisit predictions; discuss surprises |

### Probing Questions to Use While Circulating

| Situation | What to Ask |
|:---|:---|
| Student says "it's faster" | "How much faster? Which Iron Law term changed?" |
| Student hits an OOM error | "Find the exact value where it breaks. What constraint did you hit?" |
| Student doesn't know what to try | "Change one variable. What happened? Now try a different one." |
| Student finishes Part B early | "Can you find a configuration 2x better than your best? What's the limit?" |
| Student's prediction was wrong | "What did you assume that turned out to be false?" |

---

## Common Student Struggles by Week

### Semester 1 (Foundations)

| Week | Common Issue | How to Help |
|:---|:---|:---|
| 1-2 | "What is a system? I thought this was an ML class." | Redirect: "The model is just one layer. What carries the data to the model? What executes the math?" |
| 5-6 | TinyTorch Module 03 (Layers) — broadcasting bugs | Check tensor shapes at each step; remind students that NumPy broadcasting rules apply |
| 6-8 | TinyTorch Module 06 (Autograd) — wrong gradients | Most common cause: incorrect topological sort order. Have them draw the computation graph on paper first |
| 8 | "My training loop is slow" | Ask: "Is the GPU actually busy? Check utilization. The bottleneck is usually data loading, not compute." |
| 10 | Lab 09 (Quantization) — "INT8 destroyed my model" | Check if they are quantizing batch norm layers. Remind them to use calibration data |
| 13-16 | Capstone overwhelm | Break it down: "First, meet the accuracy target. Then optimize for latency. Then for memory. One constraint at a time." |

### Semester 2 (Scale)

| Week | Common Issue | How to Help |
|:---|:---|:---|
| 5 | "Which parallelism should I use?" | "Calculate communication-to-computation ratio for each strategy. The math tells you." |
| 6 | "AllReduce is confusing" | Draw the ring on the whiteboard with 4 nodes. Walk through one full cycle |
| 7 | "Why checkpoint so often?" | "Calculate expected time-to-failure for 1000 GPUs. Now multiply by cost per GPU-hour." |
| 10 | KV-cache memory confusion | "How many bytes per token per layer? Multiply by sequence length times batch size times number of layers." |

---

## Office Hours Protocol

### How Much Help is Too Much?

- **Do:** Ask clarifying questions. Help students debug their approach, not their code.
- **Do:** Point students to the right textbook section or lab instrument.
- **Don't:** Write code for students. Don't give away Part C answers.
- **Don't:** Debug TinyTorch implementations line by line — have them add print statements and explain what they see.

### The 10-Minute Rule

If a student has been stuck for 10+ minutes during office hours:

1. Ask them to explain what they've tried (this often unsticks them)
2. If still stuck, narrow the problem: "Is it a shape error, a value error, or a logic error?"
3. If still stuck after 15 minutes, give a directed hint: "Look at how the gradient flows through this specific node"

### Escalation

- **Grading disputes**: Flag for the instructor. Do not overrule your own grade without discussion.
- **Academic integrity concerns**: Flag for the instructor immediately. Do not confront the student.
- **Accessibility needs**: Refer to the instructor and campus disability services.

---

## Quick Reference: What's Due Each Week

See the full syllabi for detailed weekly breakdowns:

- [Foundations Syllabus](foundations-syllabus.qmd) — every week has a table with Read / Lab / Build / Due
- [Scale Syllabus](scale-syllabus.qmd) — every week has a table with Read / Lab / Due

Each week, students typically submit:

1. **A Decision Log** (200 words) for the lab they completed
2. **A TinyTorch module** (Foundations only) auto-graded via pytest
3. **A Design Challenge** (bi-weekly) for the open-ended Part C problems