mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 01:28:35 -05:00
- Add TinyTorch module preview callout to Getting Started showing what a module looks like - Fix all TinyTorch links to point to rendered ABOUT pages instead of source directories - Add mlsysim links everywhere the simulator is mentioned - Add suggested case study papers to both syllabi (Sculley, Sambasivan, Dettmers, Kwon, etc.) - Add grading load estimate table to Assessment page with per-task time budgets - Add grading time budget section to TA Guide
187 lines
9.3 KiB
Plaintext
187 lines
9.3 KiB
Plaintext
---
|
||
title: "Assessment & Grading"
|
||
subtitle: "Rubrics, examples, and grading strategies"
|
||
---
|
||
|
||
This guide provides everything you need to evaluate student mastery: weighted grade structures, detailed rubrics, sample student work, and the AI Olympics capstone specification.
|
||
|
||
---
|
||
|
||
## 1. The Three-Tier Assessment Model
|
||
|
||

|
||
|
||
We recommend a 100-point scale per assignment, distributed across three tiers:
|
||
|
||
| Tier | Method | Weight | What It Tests |
|
||
|:---|:---|:---|:---|
|
||
| **Correctness** | Auto-graded tests (TinyTorch, Lab answers) | 40% | Can the student implement the logic? |
|
||
| **Systems Thinking** | Decision Logs (200-word justifications) | 30% | Does the student understand *why*? |
|
||
| **Mastery** | Design Challenges (open-ended optimization) | 30% | Can the student apply theory to new problems? |
|
||
|
||
### Grading Load Estimates
|
||
|
||
Plan your TA staffing based on these estimates:
|
||
|
||
| Task | Time per Student | Frequency | 30-Student Section | 60-Student Class |
|
||
|:---|:---|:---|:---|:---|
|
||
| Decision Log grading | 3–5 min | Weekly | ~2 hrs/week | ~4 hrs/week |
|
||
| TinyTorch systems questions | 5–8 min | Weekly | ~3 hrs/week | ~6 hrs/week |
|
||
| Design Challenges | 10–15 min | Bi-weekly | ~4 hrs bi-weekly | ~8 hrs bi-weekly |
|
||
| TinyTorch auto-grading | Automated | Weekly | ~10 min (review flags) | ~20 min |
|
||
|
||
::: {.callout-tip}
|
||
## Efficiency Tip
|
||
Use the [3-question speed rubric](ta-guide.qmd) for Decision Logs — it cuts grading time by ~40% while maintaining consistency. Grade all logs for one week in a single sitting.
|
||
:::
|
||
|
||
### Semester-Level Grade Breakdown
|
||
|
||
**Semester 1 (Foundations):**
|
||
|
||
| Component | Weight | Count |
|
||
|:---|:---|:---|
|
||
| TinyTorch Modules | 35% | 8 modules |
|
||
| Lab Decision Logs | 25% | 15 labs |
|
||
| Design Challenges (Part C) | 20% | 4 major (one per Part) |
|
||
| Capstone (AI Olympics) | 20% | 1 |
|
||
|
||
**Semester 2 (Scale):**
|
||
|
||
| Component | Weight | Count |
|
||
|:---|:---|:---|
|
||
| Lab Decision Logs | 40% | 15 labs |
|
||
| Design Challenges | 30% | 4 major (one per Part) |
|
||
| Capstone (Fleet Synthesis) | 30% | 1 |
|
||
|
||
---
|
||
|
||
## 2. Decision Log Rubric (30 Points)
|
||
|
||
The Decision Log is the most important metacognitive artifact in the course. Use this rubric for consistent grading:
|
||
|
||
| Criterion | Excellent (10) | Adequate (6) | Insufficient (2) |
|
||
|:---|:---|:---|:---|
|
||
| **Quantitative Evidence** | Cites 3+ specific numbers from instruments (latency, memory, throughput, accuracy). | Cites 1–2 numbers. | No numbers; vague assertions ("it was faster"). |
|
||
| **Causal Reasoning** | Explains *why* using Iron Law terms or Roofline terminology. Maps the optimization to a specific equation term. | Describes *what* happened but misses the "why." | Lists observations without explanation. |
|
||
| **Constraint Awareness** | Explicitly addresses tradeoffs (e.g., "Trading 1.2% accuracy for 4x throughput to meet the 50ms SLA"). | Mentions constraints but doesn't quantify the tradeoff. | Ignores constraints entirely. |
|
||
|
||
### Sample Decision Log: Excellent (30/30)
|
||
|
||
> **Lab 09: Quantization — Part C Decision Log**
|
||
>
|
||
> I recommend deploying the MobileNetV2 model with INT8 post-training quantization for the Seeed XIAO target.
|
||
>
|
||
> **Evidence:** INT8 quantization reduced model size from 14.2 MB to 3.6 MB (3.9x compression), meeting the 4 MB flash constraint. Inference latency dropped from 48.3ms to 12.1ms (3.99x speedup), well under the 50ms SLA. Accuracy decreased from 91.4% to 90.2% (1.2% drop).
|
||
>
|
||
> **Reasoning:** The dominant bottleneck on this device is the $D_{vol}/BW$ term — the 256KB SRAM cannot hold FP32 activations for the depthwise separable convolutions. INT8 reduces $D_{vol}$ by 4x, which directly unblocks the memory bandwidth constraint. The accuracy loss is negligible because MobileNetV2's architecture is inherently quantization-friendly (no large dynamic ranges in the depthwise layers).
|
||
>
|
||
> **Tradeoff:** I rejected INT4 quantization despite its additional 2x memory savings because accuracy dropped to 84.7% (6.7% loss) — below the 88% project threshold. The marginal memory gain (1.8 MB → 0.9 MB) was not worth the accuracy cliff.
|
||
|
||
### Sample Decision Log: Insufficient (6/30)
|
||
|
||
> The quantized model was smaller and faster. I used INT8 because it seemed like a good balance. The model still worked after quantization.
|
||
|
||
*Why insufficient:* No specific numbers. No Iron Law mapping. No tradeoff analysis. No evidence of understanding *why* it worked.
|
||
|
||
---
|
||
|
||
## 3. TinyTorch Grading (100 Points per Module)
|
||
|
||
TinyTorch modules test both implementation correctness and systems understanding:
|
||
|
||
### Auto-Graded (70 Points)
|
||
|
||
- Standard unit test pass/fail via `pytest` or `nbgrader`
|
||
- Tests cover: correctness, edge cases, numerical stability
|
||
- All-or-nothing per test (no partial credit for failing tests)
|
||
|
||
### Systems Thinking Questions (30 Points)
|
||
|
||
Each module includes 3 manually-graded questions (10 points each). These ask students to reason about the systems implications of their code.
|
||
|
||
**Example questions by module:**
|
||
|
||
| Module | Sample Question |
|
||
|:---|:---|
|
||
| 01: Tensor | "Your tensor stores data in row-major order. How would inference latency change if you switched to column-major? For which operations would it matter?" |
|
||
| 06: Autograd | "Your `backward()` builds a full computation graph in memory. At what model size (number of parameters) would this become the memory bottleneck on a 16GB GPU? Show your calculation." |
|
||
| 08: Training | "Your training loop processes one batch at a time. Describe two concrete changes that would improve GPU utilization, and estimate the speedup for each." |
|
||
|
||
### Grading Scale for Systems Thinking Questions
|
||
|
||
| Score | Criteria |
|
||
|:---|:---|
|
||
| **10** | Correct reasoning with quantitative estimate. References specific hardware constraints. |
|
||
| **7** | Correct direction of reasoning but missing quantitative support or hardware specifics. |
|
||
| **4** | Partially correct but with significant conceptual errors. |
|
||
| **1** | Attempted but fundamentally misunderstands the systems implication. |
|
||
|
||
---
|
||
|
||
## 4. Design Challenge Rubric (50 Points)
|
||
|
||
Design Challenges are the Part C of labs — open-ended optimization problems with real constraints.
|
||
|
||
| Criterion | Excellent (12–13) | Adequate (7–9) | Insufficient (1–4) |
|
||
|:---|:---|:---|:---|
|
||
| **Solution Quality** | Meets all constraints (latency, memory, accuracy). Near-optimal configuration. | Meets most constraints. Reasonable but suboptimal. | Fails to meet one or more constraints. |
|
||
| **Methodology** | Systematic exploration: tried multiple approaches, justified final choice. | Tried 2–3 approaches but justification is weak. | Trial-and-error with no systematic approach. |
|
||
| **Iron Law Mapping** | Every optimization traced to a specific Iron Law term. | Some optimizations mapped, others missing. | No Iron Law references. |
|
||
| **Documentation** | Decision Log meets Excellent criteria (see above). | Decision Log meets Adequate criteria. | No meaningful documentation. |
|
||
|
||
---
|
||
|
||
## 5. The AI Olympics — Capstone Specification (Semester 1)
|
||
|
||
The course culminates in a competition: deploy the "Smart Doorbell" application across multiple hardware tracks.
|
||
|
||
### The Challenge
|
||
|
||
Design and optimize an image classification pipeline (person detection + recognition) that meets strict deployment constraints:
|
||
|
||
| Track | Device | Latency Budget | Memory Budget | Target Accuracy |
|
||
|:---|:---|:---|:---|:---|
|
||
| **Cloud** | A100 GPU | < 10ms P99 | 16 GB | > 95% |
|
||
| **Edge** | Raspberry Pi + Coral | < 50ms P99 | 1 GB | > 90% |
|
||
| **Mobile** | Simulated phone SoC | < 30ms P99 | 256 MB | > 88% |
|
||
| **Tiny** | Seeed XIAO ESP32S3 | < 100ms P99 | 256 KB | > 80% |
|
||
|
||
### Scoring (100 Points)
|
||
|
||
| Component | Points | Description |
|
||
|:---|:---|:---|
|
||
| **Metric Achievement** | 40 | Score = (Accuracy + Throughput) / (Energy + Latency). Higher is better. |
|
||
| **Engineering Report** | 40 | 1,000-word design document. Every decision mapped to Iron Law. Rubric: 10 pts quantitative evidence, 10 pts causal reasoning, 10 pts constraint awareness, 10 pts clarity. |
|
||
| **Robustness** | 20 | System is tested with distribution shift (brightness, noise, rotation). Graceful degradation > catastrophic failure. |
|
||
|
||
### Deliverables
|
||
|
||
1. **Code submission**: Model, optimization scripts, deployment configuration
|
||
2. **Design report**: 1,000 words, structured as Problem → Approach → Results → Tradeoffs
|
||
3. **Live demo**: 5-minute presentation showing deployment on target hardware (or simulation)
|
||
|
||
---
|
||
|
||
## 6. Semester 2 Assessment: Scale-Specific Criteria
|
||
|
||
As the course moves to distributed systems, assessment shifts toward reliability and economics:
|
||
|
||
### TCO Audit (Design Challenge)
|
||
|
||
*Can the student calculate the 3-year Total Cost of Ownership of their cluster design?*
|
||
|
||
Students must account for: hardware acquisition, power consumption (including PUE), network equipment, cooling, maintenance, and opportunity cost of downtime.
|
||
|
||
### Failure Recovery Analysis (Design Challenge)
|
||
|
||
*Can the student articulate how their system recovers from a node failure without restarting a 90-day training run?*
|
||
|
||
Students must specify: checkpoint frequency, recovery time, wasted computation per failure, and the cost-optimal checkpoint interval.
|
||
|
||
### Carbon Budget (Design Challenge)
|
||
|
||
*Can the student justify their datacenter location based on carbon footprint?*
|
||
|
||
Students must compare at least 3 regions (e.g., Quebec, Virginia, Iowa) using grid carbon intensity data, PUE, and renewable energy availability.
|