mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 02:03:55 -05:00
- Add TinyTorch module preview callout to Getting Started showing what a module looks like - Fix all TinyTorch links to point to rendered ABOUT pages instead of source directories - Add mlsysim links everywhere the simulator is mentioned - Add suggested case study papers to both syllabi (Sculley, Sambasivan, Dettmers, Kwon, etc.) - Add grading load estimate table to Assessment page with per-task time budgets - Add grading time budget section to TA Guide
161 lines
7.2 KiB
Plaintext
161 lines
7.2 KiB
Plaintext
---
|
||
title: "TA Guide"
|
||
subtitle: "Everything you need to run labs, grade assignments, and support students"
|
||
---
|
||
|
||
Welcome to the teaching team. This guide covers what you need to know to be an effective TA for the ML Systems course.
|
||
|
||
---
|
||
|
||
## Before the Semester
|
||
|
||
### TA Preparation Checklist
|
||
|
||
Complete these before Week 1:
|
||
|
||
- [ ] **Read** the textbook chapters you will be covering (at minimum, the Part you are assigned)
|
||
- [ ] **Complete** TinyTorch Modules 01–08 yourself (Foundations semester)
|
||
- [ ] **Run** all labs for your assigned weeks — note where students will get stuck
|
||
- [ ] **Read** the [Pedagogy Guide](pedagogy.qmd) — understand Prediction Locks, Decision Logs, and the A-B-C structure
|
||
- [ ] **Read** the [Assessment & Grading](assessment.qmd) — internalize the Decision Log rubric
|
||
- [ ] **Attend** the grading calibration session (Week 0)
|
||
- [ ] **Set up** nbgrader on your machine (see [TinyTorch Instructor Guide](https://mlsysbook.ai/tinytorch/INSTRUCTOR.html))
|
||
|
||
---
|
||
|
||
## Grading Decision Logs
|
||
|
||
Decision Logs are the most important written artifact in the course. Every student submits one per week, so plan your grading time accordingly. Here is how to do it efficiently.
|
||
|
||
### The 3-Question Speed Rubric
|
||
|
||
For each Decision Log, ask three questions:
|
||
|
||
1. **Numbers?** Did the student cite specific values from the instruments? (latency, memory, throughput, accuracy)
|
||
2. **Why?** Did the student use Iron Law terminology to explain the cause?
|
||
3. **Tradeoff?** Did the student acknowledge what they sacrificed for what they gained?
|
||
|
||
**Yes to all three → 27-30 points. Missing one → 18-22. Missing two+ → 6-12.**
|
||
|
||
See [Assessment & Grading](assessment.qmd) for the full rubric and sample student work at each quality level.
|
||
|
||
### Time Budget
|
||
|
||
Expect 3–5 minutes per Decision Log using the speed rubric. For a 30-student section, that's ~2 hours per week. See [Assessment & Grading](assessment.qmd) for full grading load estimates across all assignment types.
|
||
|
||
### Batch Grading Tips
|
||
|
||
- Grade all Decision Logs for one week in a single sitting (consistency matters)
|
||
- Read the excellent sample first to calibrate your expectations
|
||
- Mark the first 5, then check with another TA — align before continuing
|
||
- Flag borderline cases for the instructor rather than agonizing
|
||
|
||
---
|
||
|
||
## Grading TinyTorch
|
||
|
||
### Auto-Graded (70 points)
|
||
|
||
- Run `pytest` on student submissions — pass/fail per test
|
||
- Students who pass all tests get 70/70; no partial credit per test
|
||
- If a student passes 90%+ of tests, check whether the failures are edge cases vs. fundamental errors
|
||
|
||
### Systems Thinking Questions (30 points)
|
||
|
||
Each module has 3 manually-graded questions (10 points each). Use this scale:
|
||
|
||
| Score | What It Looks Like |
|
||
|:---|:---|
|
||
| **10** | Correct reasoning + quantitative estimate + hardware awareness |
|
||
| **7** | Right direction, missing numbers or hardware specifics |
|
||
| **4** | Partially correct with significant conceptual gaps |
|
||
| **1** | Attempted but fundamentally wrong |
|
||
|
||
---
|
||
|
||
## Running Lab Sections (50 minutes)
|
||
|
||
### Recommended Structure
|
||
|
||
| Time | Activity | Your Role |
|
||
|:---|:---|:---|
|
||
| 0-5 min | Prediction Lock | Collect predictions; don't reveal answers |
|
||
| 5-15 min | Part A walkthrough | Circulate; help students who can't get instruments running |
|
||
| 15-30 min | Part B exploration | Ask probing questions (see below) |
|
||
| 30-45 min | Part C design challenge | Let students struggle; intervene only if truly stuck |
|
||
| 45-50 min | Debrief | Revisit predictions; discuss surprises |
|
||
|
||
### Probing Questions to Use While Circulating
|
||
|
||
| Situation | What to Ask |
|
||
|:---|:---|
|
||
| Student says "it's faster" | "How much faster? Which Iron Law term changed?" |
|
||
| Student hits an OOM error | "Find the exact value where it breaks. What constraint did you hit?" |
|
||
| Student doesn't know what to try | "Change one variable. What happened? Now try a different one." |
|
||
| Student finishes Part B early | "Can you find a configuration 2x better than your best? What's the limit?" |
|
||
| Student's prediction was wrong | "What did you assume that turned out to be false?" |
|
||
|
||
---
|
||
|
||
## Common Student Struggles by Week
|
||
|
||
### Semester 1 (Foundations)
|
||
|
||
| Week | Common Issue | How to Help |
|
||
|:---|:---|:---|
|
||
| 1-2 | "What is a system? I thought this was an ML class." | Redirect: "The model is just one layer. What carries the data to the model? What executes the math?" |
|
||
| 5-6 | TinyTorch Module 03 (Layers) — broadcasting bugs | Check tensor shapes at each step; remind students that NumPy broadcasting rules apply |
|
||
| 6-8 | TinyTorch Module 06 (Autograd) — wrong gradients | Most common cause: incorrect topological sort order. Have them draw the computation graph on paper first |
|
||
| 8 | "My training loop is slow" | Ask: "Is the GPU actually busy? Check utilization. The bottleneck is usually data loading, not compute." |
|
||
| 10 | Lab 09 (Quantization) — "INT8 destroyed my model" | Check if they are quantizing batch norm layers. Remind them to use calibration data |
|
||
| 13-16 | Capstone overwhelm | Break it down: "First, meet the accuracy target. Then optimize for latency. Then for memory. One constraint at a time." |
|
||
|
||
### Semester 2 (Scale)
|
||
|
||
| Week | Common Issue | How to Help |
|
||
|:---|:---|:---|
|
||
| 5 | "Which parallelism should I use?" | "Calculate communication-to-computation ratio for each strategy. The math tells you." |
|
||
| 6 | "AllReduce is confusing" | Draw the ring on the whiteboard with 4 nodes. Walk through one full cycle |
|
||
| 7 | "Why checkpoint so often?" | "Calculate expected time-to-failure for 1000 GPUs. Now multiply by cost per GPU-hour." |
|
||
| 10 | KV-cache memory confusion | "How many bytes per token per layer? Multiply by sequence length times batch size times number of layers." |
|
||
|
||
---
|
||
|
||
## Office Hours Protocol
|
||
|
||
### How Much Help is Too Much?
|
||
|
||
- **Do:** Ask clarifying questions. Help students debug their approach, not their code.
|
||
- **Do:** Point students to the right textbook section or lab instrument.
|
||
- **Don't:** Write code for students. Don't give away Part C answers.
|
||
- **Don't:** Debug TinyTorch implementations line by line — have them add print statements and explain what they see.
|
||
|
||
### The 10-Minute Rule
|
||
|
||
If a student has been stuck for 10+ minutes during office hours:
|
||
|
||
1. Ask them to explain what they've tried (this often unsticks them)
|
||
2. If still stuck, narrow the problem: "Is it a shape error, a value error, or a logic error?"
|
||
3. If still stuck after 15 minutes, give a directed hint: "Look at how the gradient flows through this specific node"
|
||
|
||
### Escalation
|
||
|
||
- **Grading disputes**: Flag for the instructor. Do not overrule your own grade without discussion.
|
||
- **Academic integrity concerns**: Flag for the instructor immediately. Do not confront the student.
|
||
- **Accessibility needs**: Refer to the instructor and campus disability services.
|
||
|
||
---
|
||
|
||
## Quick Reference: What's Due Each Week
|
||
|
||
See the full syllabi for detailed weekly breakdowns:
|
||
|
||
- [Foundations Syllabus](foundations-syllabus.qmd) — every week has a table with Read / Lab / Build / Due
|
||
- [Scale Syllabus](scale-syllabus.qmd) — every week has a table with Read / Lab / Due
|
||
|
||
Each week, students typically submit:
|
||
|
||
1. **A Decision Log** (200 words) for the lab they completed
|
||
2. **A TinyTorch module** (Foundations only) auto-graded via pytest
|
||
3. **A Design Challenge** (bi-weekly) for the open-ended Part C problems
|