mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-29 17:20:21 -05:00
475 lines
21 KiB
Markdown
475 lines
21 KiB
Markdown
# Lab Developer Protocol: Gold Standard Specification
|
||
|
||
This is the authoritative instruction set for any agent or developer building an interactive lab for the *Machine Learning Systems* textbook. Labs are not demos. They are **pedagogical instruments**: every element — slider range, chart axis, prediction question, reflection prompt — must be traceable to a specific quantitative claim in the chapter.
|
||
|
||
A lab that cannot cite its chapter source for every number is not finished.
|
||
|
||
---
|
||
|
||
## Part I: What These Labs Are For
|
||
|
||
The textbook teaches students to reason quantitatively about ML infrastructure. The labs force students to *experience* the consequences of the invariants they just read about. The sequence is:
|
||
|
||
> **Read** the chapter → **Predict** what will happen → **Discover** that reality differs → **Explain** why using the chapter's math.
|
||
|
||
The prediction step is the most important. A student who predicts wrong and then discovers why has learned more than a student who reads a correct answer. Every lab must manufacture that productive failure.
|
||
|
||
Labs are **not**:
|
||
- Demos that illustrate concepts students already accept
|
||
- Tutorials that walk students through known steps
|
||
- Exploratory sandboxes with no expected destination
|
||
|
||
Labs **are**:
|
||
- Structured confrontations with a quantitative reality that surprises
|
||
- Diagnosis instruments that surface root causes students couldn't see in the text
|
||
- Design challenges where constraints collide and every choice has a cost
|
||
|
||
---
|
||
|
||
## Part II: The Invariants (Non-Negotiable Quality Gates)
|
||
|
||
Every lab plan and every lab implementation must satisfy **all** of the following. These are not suggestions.
|
||
|
||
### Invariant 1: Every Number Has a Source
|
||
|
||
Before writing a single line of the lab plan, the developer must extract the **actual quantitative claims** from the chapter text. Then every number in the lab must trace to a specific claim.
|
||
|
||
The plan must contain a **Traceability Table** (Section 8 of the plan template) that maps:
|
||
|
||
| Lab Element | Chapter Section | Exact Claim Being Tested |
|
||
|---|---|---|
|
||
| `[slider range / chart value / threshold]` | `[@sec-... or line number]` | `"[exact quote or formula from chapter]"` |
|
||
|
||
A plan without a complete traceability table is **rejected**.
|
||
|
||
### Invariant 2: Structured Predictions, Never Free Text
|
||
|
||
The `mo.stop(prediction == "", ...)` gate currently accepts "idk" and unlocks all instruments. This defeats the entire pedagogical purpose.
|
||
|
||
**Required prediction formats:**
|
||
|
||
| Format | When to Use | Implementation |
|
||
|---|---|---|
|
||
| **Multiple choice (4 options)** | When the answer is a specific ratio, threshold, or category | Radio buttons; exactly one correct answer; distractor options at 5×, 10×, 50× etc. |
|
||
| **Numeric range entry** | When the answer is a quantity the student can estimate (memory, latency, FLOPS) | Number input field; system records estimate for later comparison overlay |
|
||
| **Sentence completion with dropdown** | When the reflection requires using chapter terminology | Partial sentence with 3–4 dropdown options; only one is semantically correct |
|
||
|
||
**Never use:** open text fields for predictions, "type your hypothesis," or gates that accept any non-empty string.
|
||
|
||
After the act completes, the lab **must overlay the student's prediction** on the actual result with an annotation:
|
||
> "You predicted [X]. The actual value is [Y]. You were off by [Z]×."
|
||
|
||
This overlay is the learning moment. Do not skip it.
|
||
|
||
### Invariant 3: Failure States Are Mandatory
|
||
|
||
Every lab must have at least one instrument that can reach a visually distinct **failure state** when the student's design violates a physical constraint. Failure states teach constraints more effectively than captions.
|
||
|
||
Required failure states by constraint type:
|
||
|
||
| Constraint | Failure State Visual | Trigger Condition |
|
||
|---|---|---|
|
||
| Memory wall | Bar chart turns red; banner: **"OOM — Training infeasible on this device"** | `memory_footprint > device.ram` |
|
||
| Latency budget | Timeline bar turns red; banner: **"SLA violated — P99 exceeds budget"** | `latency_p99 > sla_budget` |
|
||
| Power/thermal | Gauge turns orange → red; banner: **"Thermal throttle — sustained throughput drops to [X]%"** | `power > tdp` |
|
||
| Compute budget | ROI gauge drops below zero; banner: **"Negative ROI — cost exceeds benefit"** | `cost > revenue_gain` |
|
||
|
||
The failure state must be **reversible**: students should be able to pull sliders back and watch the system recover. The point is to find the boundary, not to punish the student.
|
||
|
||
### Invariant 4: 2-Act Structure (Not 3 KATs)
|
||
|
||
Labs use a **2-Act structure**. The 3-KAT (three 15-minute Key Analysis Tasks) format produces 45–90 minute sessions, which students abandon mid-lab.
|
||
|
||
```
|
||
Act 1: Calibration (10–15 minutes)
|
||
- One focused prediction question
|
||
- One primary instrument with 1–2 controls
|
||
- One structured reflection
|
||
- Outcome: Student has a wrong prior corrected by data
|
||
|
||
Act 2: Design Challenge (18–25 minutes)
|
||
- One numeric prediction
|
||
- The full instrument set (2–3 charts, multiple controls)
|
||
- One scaling challenge: push the system to its physical limit
|
||
- Structured multi-choice reflection
|
||
- Outcome: Student has made a design decision with quantified trade-offs
|
||
```
|
||
|
||
**Total target: 35–40 minutes.** If a lab plan requires more than 40 minutes to complete both acts, it must be trimmed.
|
||
|
||
### Invariant 5: 2 Deployment Contexts (Not 4 Narrative Tracks)
|
||
|
||
The original 4-track system (Cloud Titan, Edge Guardian, Mobile Nomad, Tiny Pioneer) requires 128 track-specific scenarios across 16 labs. This is unsustainable to maintain and does not provide proportional pedagogical value.
|
||
|
||
**Replacement:** Every lab uses **2 deployment contexts** as a comparison toggle, not as a persistent narrative identity.
|
||
|
||
The two contexts for Volume 1 are always drawn from:
|
||
|
||
| Context Pool | Device | Key Constraint |
|
||
|---|---|---|
|
||
| Training Node | H100 (80 GB) | Maximize throughput; memory is abundant |
|
||
| Edge Inference | Mobile GPU (2 GB) | Minimize latency; memory is the wall |
|
||
| MCU | ARM Cortex-M (256 KB SRAM) | Sub-1mW; only quantized INT8 models fit |
|
||
|
||
Each lab chooses the **two contexts most relevant to its chapter's invariant** and presents them as a comparison toggle. Students see the same system behave differently under different constraints. This is more instructive than narrative identity.
|
||
|
||
The Design Ledger carries the student's chapter-5 deployment context into chapter-8, so cross-lab continuity is preserved without maintaining 4 parallel track scripts.
|
||
|
||
### Invariant 6: No Instruments Before Chapter Introduction
|
||
|
||
Progressive disclosure is enforced strictly:
|
||
|
||
| Lab | Instruments Available |
|
||
|---|---|
|
||
| lab_01 | Magnitude Gap slider, D·A·M triangle |
|
||
| lab_02 | + Latency Waterfall |
|
||
| lab_05 | + Activation Comparator, Memory Ledger |
|
||
| lab_09 | + Pareto Curve |
|
||
| lab_10 | + Compression Trade-off Frontier |
|
||
| lab_11 | + Roofline Model |
|
||
| lab_13 | + P99 Latency Histogram, Little's Law Calculator |
|
||
|
||
Agents **must not** use an instrument in lab N if it is introduced in the chapter for lab N+k. Verify against the chapter text.
|
||
|
||
---
|
||
|
||
## Part III: The Plan Template (Required Structure)
|
||
|
||
Every lab plan must contain exactly these 8 sections in this order. A plan missing any section is not ready for implementation.
|
||
|
||
---
|
||
|
||
### Section 1: Chapter Alignment
|
||
```
|
||
- Chapter: [Title] (`@sec-[slug]`)
|
||
- Core Invariant: [One sentence. This is the chapter's central quantitative claim.]
|
||
- Central Tension: [Two sentences. What wrong prior does the student bring? What does the data reveal?]
|
||
- Target Duration: [X–Y minutes (2 acts)]
|
||
```
|
||
|
||
### Section 2: The Two-Act Structure Overview
|
||
One paragraph per act stating the pedagogical goal of that act in plain English. No bullet lists. Write it as a statement of what the student will experience and what they will learn.
|
||
|
||
### Section 3: Act 1 — Calibration
|
||
Required subsections:
|
||
1. **Pedagogical Goal** — one paragraph stating the wrong prior and what will correct it
|
||
2. **The Lock (Structured Prediction)** — the exact prediction question, with all answer choices listed, the correct answer marked, and the reason each distractor is plausible
|
||
3. **The Instrument** — describe every control (slider range, toggle options, selectors) and every output (chart type, axis labels, what updates on each control change)
|
||
4. **The Reveal** — the exact overlay text shown after interaction, including the prediction-vs-reality gap annotation
|
||
5. **Reflection (Structured)** — sentence completion with dropdown, or 4-option multiple choice. The exact text of the prompt and all options, with correct answer marked.
|
||
6. **Math Peek** — the LaTeX formula that governs this act (collapsible panel)
|
||
|
||
### Section 4: Act 2 — Design Challenge
|
||
Required subsections:
|
||
1. **Pedagogical Goal** — one paragraph
|
||
2. **The Lock (Numeric Prediction)** — exact question text, expected answer range, what the system will show afterward
|
||
3. **The Instrument** — complete control inventory with ranges, and complete output inventory with formulas
|
||
4. **The Scaling Challenge** — a specific target the student must hit by exploring the design space (e.g., "find maximum W where training fits on a Laptop GPU")
|
||
5. **The Failure State** — exact trigger condition, exact visual change, exact banner text
|
||
6. **Structured Reflection** — exact prompt and all options
|
||
7. **Math Peek** — the governing equation
|
||
|
||
### Section 5: Visual Layout Specification
|
||
List every chart in priority order. For each chart:
|
||
- Chart type (histogram, stacked bar, scatter, waterfall, etc.)
|
||
- X axis: label, range, units
|
||
- Y axis: label, range, units
|
||
- What data series are shown
|
||
- When/how it enters failure state (if applicable)
|
||
|
||
### Section 6: Deployment Context Definitions
|
||
Table with exactly 2 rows (the two chosen contexts):
|
||
| Context | Device | RAM | Power Budget | Key Constraint |
|
||
Explain in one sentence what distinguishes the two contexts for this chapter's invariant.
|
||
|
||
### Section 7: Design Ledger Output
|
||
The exact JSON fields the lab records at completion, and which future labs read those fields. If no future lab reads a field, it should not be recorded.
|
||
|
||
```json
|
||
{
|
||
"chapter": N,
|
||
"field_name": "<value>",
|
||
...
|
||
}
|
||
```
|
||
|
||
### Section 8: Traceability Table (Mandatory)
|
||
Every quantitative value in the lab must appear in this table.
|
||
|
||
| Lab Element | Chapter Section | Exact Claim Being Tested |
|
||
|---|---|---|
|
||
| [every slider range, threshold, formula, ratio] | [@sec-... or line number] | ["exact quote"] |
|
||
|
||
Rows without a chapter source are placeholder content and must be replaced with chapter-grounded values before the plan is complete.
|
||
|
||
---
|
||
|
||
## Part IV: Technical Implementation Specification
|
||
|
||
### 4.1 Format
|
||
- Single-file Marimo Notebook (`.py`)
|
||
- WASM-first: zero local file I/O; all data in native Python dicts
|
||
- Named `lab_NN_slug.py` with underscore-separated slug matching chapter slug
|
||
|
||
### 4.2 Physics Engine
|
||
All computations must call `mlsys.Engine.solve()`. No hardcoded physics constants outside `mlsys/constants.py`.
|
||
|
||
```python
|
||
from mlsys import Engine, Models, Systems
|
||
|
||
profile = Engine.solve(
|
||
model=Models.ResNet50,
|
||
system=Systems.Mobile,
|
||
batch_size=32,
|
||
precision="int8", # "fp32" | "fp16" | "int8"
|
||
efficiency=0.5 # float 0.0–1.0
|
||
)
|
||
|
||
# Guaranteed fields on the returned PerformanceProfile:
|
||
# .latency (Pint ms)
|
||
# .latency_compute (Pint ms)
|
||
# .latency_memory (Pint ms)
|
||
# .latency_overhead (Pint ms) — dispatch tax; always shown
|
||
# .throughput (Pint samples/sec)
|
||
# .bottleneck (str: "Compute" | "Memory")
|
||
# .energy (Pint joule)
|
||
# .memory_footprint (Pint byte)
|
||
# .feasible (bool: memory_footprint <= system.ram)
|
||
```
|
||
|
||
### 4.3 Prediction Lock Implementation
|
||
|
||
The current `mo.stop(prediction == "", ...)` implementation is **not compliant**. The compliant prediction lock must:
|
||
|
||
1. For **multiple choice**: use a radio group; `mo.stop` fires until a radio option is selected (not just any text)
|
||
2. For **numeric entry**: use a number input; `mo.stop` fires until the value is non-null and within a plausible order-of-magnitude range
|
||
3. After act completion: **always** show the prediction-vs-reality overlay in a dedicated card above the reflection prompt
|
||
|
||
```python
|
||
# Compliant multiple-choice lock
|
||
prediction_choice = mo.ui.radio(
|
||
options={"A) ~1-2×": "1x", "B) ~5×": "5x", "C) ~20×": "20x", "D) ~50×": "50x"},
|
||
label="How much more expensive is Sigmoid than ReLU in transistors?"
|
||
)
|
||
mo.stop(prediction_choice.value is None, mo.md("⚠️ Select your prediction to unlock instruments."))
|
||
|
||
# Reveal overlay (shown after Act 1 completes)
|
||
actual = 50
|
||
predicted = {"1x": 1, "5x": 5, "20x": 20, "50x": 50}[prediction_choice.value]
|
||
gap = actual / predicted
|
||
mo.md(f"**You predicted {predicted}×. Actual: {actual}×. You were off by {gap:.1f}×.**")
|
||
```
|
||
|
||
### 4.4 Failure State Implementation
|
||
|
||
```python
|
||
# OOM failure state pattern
|
||
memory_total = weights + gradients + optimizer_state + activations
|
||
oom = memory_total > system.ram
|
||
|
||
if oom:
|
||
# Chart bars turn red (update Plotly trace colors)
|
||
bar_colors = ["#CB202D"] * 4 # RedLine
|
||
banner = mo.callout(
|
||
mo.md("🔴 **OOM — Training infeasible on this device.** "
|
||
f"Required: {memory_total:.1f} GB | Available: {system.ram:.1f} GB"),
|
||
kind="danger"
|
||
)
|
||
else:
|
||
bar_colors = ["#006395", "#008F45", "#CC5500", "#4B0082"] # BlueLine, GreenLine, OrangeLine, Purple
|
||
banner = mo.md("")
|
||
```
|
||
|
||
### 4.5 Variable Naming
|
||
Marimo treats the notebook as a single dataflow graph. Variable names must be unique across all cells. Use cell-specific prefixes:
|
||
|
||
```python
|
||
# Act 1 variables
|
||
act1_prediction = mo.ui.radio(...)
|
||
act1_depth_slider = mo.ui.slider(...)
|
||
act1_fig = go.Figure(...)
|
||
|
||
# Act 2 variables
|
||
act2_prediction = mo.ui.number(...)
|
||
act2_batch_slider = mo.ui.slider(...)
|
||
act2_fig_memory = go.Figure(...)
|
||
```
|
||
|
||
### 4.6 Visual Identity
|
||
Import all components from `labs.core`. Use the canonical color palette:
|
||
|
||
| Token | Hex | Use |
|
||
|---|---|---|
|
||
| BlueLine | #006395 | Primary data, healthy state |
|
||
| GreenLine | #008F45 | Target/goal, success state |
|
||
| RedLine | #CB202D | Failure state, violation |
|
||
| OrangeLine | #CC5500 | Warning, secondary constraint |
|
||
|
||
Every chart must include a `MathPeek` toggle revealing the governing equation. Every latency waterfall must show the overhead/dispatch tax term.
|
||
|
||
---
|
||
|
||
## Part V: Developer Workflow
|
||
|
||
Every lab goes through exactly these steps in order. No step may be skipped.
|
||
|
||
### Step 1: Read the Chapter Completely
|
||
Read the full chapter text, not the learning objectives summary. Extract:
|
||
- Every quantitative claim with its value and units
|
||
- Every formula with variable definitions
|
||
- Every named invariant or law
|
||
- Every footnote with a numerical assertion
|
||
|
||
Do not write the plan until this extraction is complete.
|
||
|
||
### Step 2: Identify the Central Tension
|
||
Answer these two questions:
|
||
1. What does a typical student *believe* before reading this chapter?
|
||
2. What does the chapter's data reveal that contradicts that belief?
|
||
|
||
The answer to (2) minus the answer to (1) is the lab's pedagogical purpose. Every element of the lab must serve this gap.
|
||
|
||
### Step 3: Build the Traceability Table First
|
||
Before writing any other section, fill in the traceability table with the chapter's quantitative claims. If you can fill fewer than 4 rows, the chapter may not have enough quantitative content for a lab, or you have not read the chapter thoroughly enough. Re-read.
|
||
|
||
### Step 4: Design the Prediction Questions
|
||
For each act, write the prediction question and all answer options. The correct answer should be surprising — students who are overconfident about intuition should be wrong. The distractor options should map to common misconceptions (e.g., "about the same," "2× more expensive," "scales quadratically").
|
||
|
||
Test the question against the chapter: can a student who read the chapter carefully get the answer right? If yes, proceed. If the answer requires outside knowledge, revise.
|
||
|
||
### Step 5: Design the Instruments
|
||
For each instrument, specify:
|
||
- Every slider: min, max, step, default value, and the formula that maps slider position to output
|
||
- Every chart: x-axis, y-axis, all data series, the formula for each series
|
||
- Every threshold line: value, units, source in chapter
|
||
|
||
### Step 6: Design the Failure State
|
||
Identify the physical boundary the student will cross. Write the exact trigger condition as a Python boolean expression. Write the exact banner text. Test that the failure state is reachable within the instrument's slider ranges.
|
||
|
||
### Step 7: Write the Full Plan
|
||
Write all 8 sections of the plan template. Every number must be in the traceability table. Every prediction question must have all options listed. Every reflection must have all options listed.
|
||
|
||
### Step 8: Depth Check
|
||
Verify the plan meets minimum depth:
|
||
- ≥ 150 lines
|
||
- ≥ 4 rows in the traceability table
|
||
- All 8 sections present
|
||
- No placeholder text ("TBD", "see chapter", "varies")
|
||
- Every slider range is a specific number, not "appropriate range"
|
||
- Every act has a prediction question, instrument description, failure state (Act 2), and structured reflection
|
||
|
||
A plan that fails this check is not submitted for implementation.
|
||
|
||
---
|
||
|
||
## Part VI: The Design Ledger (Persistence Schema)
|
||
|
||
The Design Ledger carries student decisions across labs. It is a Python dict (not file I/O) persisted via Marimo's reactive state.
|
||
|
||
```python
|
||
# Schema — all fields are optional except chapter and timestamp
|
||
ledger = {
|
||
"chapter": int, # current chapter number
|
||
"context": str, # deployment context chosen: "training_node" | "edge_inference" | "mcu"
|
||
"timestamp": str, # ISO 8601
|
||
|
||
# Chapter-specific fields (added by each lab, never overwritten)
|
||
"ch05": {
|
||
"activation_choice": str, # "relu" | "sigmoid" | "tanh" | "gelu"
|
||
"max_trainable_width_laptop_gpu": int,
|
||
"training_memory_estimate_error_kb": float,
|
||
"batch_size_chosen": int
|
||
},
|
||
# ch06, ch07, ch08, ... added by subsequent labs
|
||
}
|
||
```
|
||
|
||
**Rules:**
|
||
- Each lab adds exactly one `chNN` key. It never modifies prior chapter keys.
|
||
- Downstream labs READ prior chapter values to initialize their default slider positions.
|
||
- Example: lab_10 reads `ledger["ch05"]["activation_choice"]` to set the default activation in the compression comparison.
|
||
- If the ledger is empty (student starting mid-book), labs initialize from `mlsys.Systems.DefaultPreset`.
|
||
|
||
---
|
||
|
||
## Part VII: Validation Checklist
|
||
|
||
Before submitting a plan for implementation, verify every item:
|
||
|
||
**Content**
|
||
- [ ] Core Invariant is stated in one sentence with a specific quantitative claim
|
||
- [ ] Both prediction questions are structured (not free text)
|
||
- [ ] Each prediction question has exactly 4 options with one correct answer marked
|
||
- [ ] Correct answer is surprising (students who haven't read chapter likely get it wrong)
|
||
- [ ] Traceability table has ≥ 4 rows, all with chapter citations
|
||
- [ ] No placeholder values in slider ranges or chart axes
|
||
|
||
**Structure**
|
||
- [ ] All 8 plan sections present
|
||
- [ ] 2-Act structure (not 3 KATs)
|
||
- [ ] Target duration is 35–40 minutes
|
||
- [ ] 2 deployment contexts defined (not 4 narrative tracks)
|
||
|
||
**Instruments**
|
||
- [ ] Every slider has specific min, max, step, default values
|
||
- [ ] Every chart has labeled axes with units
|
||
- [ ] Act 2 has at least one failure state with trigger condition and banner text
|
||
- [ ] Prediction-vs-reality overlay is specified for both acts
|
||
|
||
**Design Ledger**
|
||
- [ ] Output fields are listed in Section 7
|
||
- [ ] At least one future lab reads a field from this lab's ledger output
|
||
|
||
**Implementation Readiness**
|
||
- [ ] Plan is ≥ 150 lines
|
||
- [ ] All numbers are chapter-grounded (no invented values)
|
||
- [ ] No instruments used that are introduced in a later chapter
|
||
|
||
---
|
||
|
||
## Appendix A: Lab Slug List
|
||
|
||
**Volume 1 (Foundations)**
|
||
|
||
| Lab | Slug | Chapter |
|
||
|---|---|---|
|
||
| 00 | lab_00_the_map | Architect's Portal (special case: no prediction lock, track orientation only) |
|
||
| 01 | lab_01_ml_intro | ML Introduction |
|
||
| 02 | lab_02_ml_systems | ML Systems |
|
||
| 03 | lab_03_ml_workflow | ML Workflow |
|
||
| 04 | lab_04_data_engr | Data Engineering |
|
||
| 05 | lab_05_nn_compute | Neural Computation |
|
||
| 06 | lab_06_nn_arch | NN Architectures |
|
||
| 07 | lab_07_ml_frameworks | ML Frameworks |
|
||
| 08 | lab_08_model_train | Training |
|
||
| 09 | lab_09_data_selection | Data Selection |
|
||
| 10 | lab_10_model_compress | Model Compression |
|
||
| 11 | lab_11_hw_accel | HW Acceleration |
|
||
| 12 | lab_12_perf_bench | Performance Benchmarking |
|
||
| 13 | lab_13_model_serving | Model Serving |
|
||
| 14 | lab_14_ml_ops | ML Operations |
|
||
| 15 | lab_15_responsible_engr | Responsible Engineering |
|
||
| 16 | lab_16_ml_conclusion | Conclusion (special case: synthesis across all invariants) |
|
||
|
||
**Volume 2 (Scale)** — same conventions apply; labs prefix `v2_`.
|
||
|
||
---
|
||
|
||
## Appendix B: Instrument Library
|
||
|
||
Available instruments from `labs.core`. Use only what is listed here. Do not invent new component names.
|
||
|
||
| Component | Import | Chapter First Available |
|
||
|---|---|---|
|
||
| `LatencyWaterfall` | `labs.core.components` | lab_02 |
|
||
| `MathPeek` | `labs.core.components` | lab_01 |
|
||
| `ComparisonRow` | `labs.core.components` | lab_01 |
|
||
| `MetricRow` | `labs.core.components` | lab_01 |
|
||
| `StakeholderMessage` | `labs.core.components` | lab_03 |
|
||
| `RooflineVisualizer` | `labs.core.components` | lab_11 |
|
||
| `PredictionLock` | `labs.core.components` | lab_01 (use compliant version from §4.3) |
|
||
|
||
---
|
||
|
||
*This document governs all lab development. When in doubt, ask: "Could Hennessy and Patterson point to the chapter line that justifies this slider range?" If not, it doesn't belong in the lab.*
|