mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-03-09 07:15:51 -05:00

Files

Vijay Janapa Reddi d299e49d10 update

2026-02-28 16:25:00 -05:00

21 KiB

Raw Blame History

Lab Developer Protocol: Gold Standard Specification

This is the authoritative instruction set for any agent or developer building an interactive lab for the Machine Learning Systems textbook. Labs are not demos. They are pedagogical instruments: every element — slider range, chart axis, prediction question, reflection prompt — must be traceable to a specific quantitative claim in the chapter.

A lab that cannot cite its chapter source for every number is not finished.

Part I: What These Labs Are For

The textbook teaches students to reason quantitatively about ML infrastructure. The labs force students to experience the consequences of the invariants they just read about. The sequence is:

Read the chapter → Predict what will happen → Discover that reality differs → Explain why using the chapter's math.

The prediction step is the most important. A student who predicts wrong and then discovers why has learned more than a student who reads a correct answer. Every lab must manufacture that productive failure.

Labs are not:

Demos that illustrate concepts students already accept
Tutorials that walk students through known steps
Exploratory sandboxes with no expected destination

Labs are:

Structured confrontations with a quantitative reality that surprises
Diagnosis instruments that surface root causes students couldn't see in the text
Design challenges where constraints collide and every choice has a cost

Part II: The Invariants (Non-Negotiable Quality Gates)

Every lab plan and every lab implementation must satisfy all of the following. These are not suggestions.

Invariant 1: Every Number Has a Source

Before writing a single line of the lab plan, the developer must extract the actual quantitative claims from the chapter text. Then every number in the lab must trace to a specific claim.

The plan must contain a Traceability Table (Section 8 of the plan template) that maps:

Lab Element	Chapter Section	Exact Claim Being Tested
`[slider range / chart value / threshold]`	`[@sec-... or line number]`	`"[exact quote or formula from chapter]"`

A plan without a complete traceability table is rejected.

Invariant 2: Structured Predictions, Never Free Text

The mo.stop(prediction == "", ...) gate currently accepts "idk" and unlocks all instruments. This defeats the entire pedagogical purpose.

Required prediction formats:

Format	When to Use	Implementation
Multiple choice (4 options)	When the answer is a specific ratio, threshold, or category	Radio buttons; exactly one correct answer; distractor options at 5×, 10×, 50× etc.
Numeric range entry	When the answer is a quantity the student can estimate (memory, latency, FLOPS)	Number input field; system records estimate for later comparison overlay
Sentence completion with dropdown	When the reflection requires using chapter terminology	Partial sentence with 3–4 dropdown options; only one is semantically correct

Never use: open text fields for predictions, "type your hypothesis," or gates that accept any non-empty string.

After the act completes, the lab must overlay the student's prediction on the actual result with an annotation:

"You predicted [X]. The actual value is [Y]. You were off by [Z]×."

This overlay is the learning moment. Do not skip it.

Invariant 3: Failure States Are Mandatory

Every lab must have at least one instrument that can reach a visually distinct failure state when the student's design violates a physical constraint. Failure states teach constraints more effectively than captions.

Required failure states by constraint type:

Constraint	Failure State Visual	Trigger Condition
Memory wall	Bar chart turns red; banner: "OOM — Training infeasible on this device"	`memory_footprint > device.ram`
Latency budget	Timeline bar turns red; banner: "SLA violated — P99 exceeds budget"	`latency_p99 > sla_budget`
Power/thermal	Gauge turns orange → red; banner: "Thermal throttle — sustained throughput drops to [X]%"	`power > tdp`
Compute budget	ROI gauge drops below zero; banner: "Negative ROI — cost exceeds benefit"	`cost > revenue_gain`

The failure state must be reversible: students should be able to pull sliders back and watch the system recover. The point is to find the boundary, not to punish the student.

Invariant 4: 2-Act Structure (Not 3 KATs)

Labs use a 2-Act structure. The 3-KAT (three 15-minute Key Analysis Tasks) format produces 45–90 minute sessions, which students abandon mid-lab.

Act 1: Calibration (10–15 minutes)
  - One focused prediction question
  - One primary instrument with 1–2 controls
  - One structured reflection
  - Outcome: Student has a wrong prior corrected by data

Act 2: Design Challenge (18–25 minutes)
  - One numeric prediction
  - The full instrument set (2–3 charts, multiple controls)
  - One scaling challenge: push the system to its physical limit
  - Structured multi-choice reflection
  - Outcome: Student has made a design decision with quantified trade-offs

Total target: 35–40 minutes. If a lab plan requires more than 40 minutes to complete both acts, it must be trimmed.

Invariant 5: 2 Deployment Contexts (Not 4 Narrative Tracks)

The original 4-track system (Cloud Titan, Edge Guardian, Mobile Nomad, Tiny Pioneer) requires 128 track-specific scenarios across 16 labs. This is unsustainable to maintain and does not provide proportional pedagogical value.

Replacement: Every lab uses 2 deployment contexts as a comparison toggle, not as a persistent narrative identity.

The two contexts for Volume 1 are always drawn from:

Context Pool	Device	Key Constraint
Training Node	H100 (80 GB)	Maximize throughput; memory is abundant
Edge Inference	Mobile GPU (2 GB)	Minimize latency; memory is the wall
MCU	ARM Cortex-M (256 KB SRAM)	Sub-1mW; only quantized INT8 models fit

Each lab chooses the two contexts most relevant to its chapter's invariant and presents them as a comparison toggle. Students see the same system behave differently under different constraints. This is more instructive than narrative identity.

The Design Ledger carries the student's chapter-5 deployment context into chapter-8, so cross-lab continuity is preserved without maintaining 4 parallel track scripts.

Invariant 6: No Instruments Before Chapter Introduction

Progressive disclosure is enforced strictly:

Lab	Instruments Available
lab_01	Magnitude Gap slider, D·A·M triangle
lab_02	+ Latency Waterfall
lab_05	+ Activation Comparator, Memory Ledger
lab_09	+ Pareto Curve
lab_10	+ Compression Trade-off Frontier
lab_11	+ Roofline Model
lab_13	+ P99 Latency Histogram, Little's Law Calculator

Agents must not use an instrument in lab N if it is introduced in the chapter for lab N+k. Verify against the chapter text.

Part III: The Plan Template (Required Structure)

Every lab plan must contain exactly these 8 sections in this order. A plan missing any section is not ready for implementation.

Section 1: Chapter Alignment

- Chapter: [Title] (`@sec-[slug]`)
- Core Invariant: [One sentence. This is the chapter's central quantitative claim.]
- Central Tension: [Two sentences. What wrong prior does the student bring? What does the data reveal?]
- Target Duration: [X–Y minutes (2 acts)]

Section 2: The Two-Act Structure Overview

One paragraph per act stating the pedagogical goal of that act in plain English. No bullet lists. Write it as a statement of what the student will experience and what they will learn.

Section 3: Act 1 — Calibration

Required subsections:

Pedagogical Goal — one paragraph stating the wrong prior and what will correct it
The Lock (Structured Prediction) — the exact prediction question, with all answer choices listed, the correct answer marked, and the reason each distractor is plausible
The Instrument — describe every control (slider range, toggle options, selectors) and every output (chart type, axis labels, what updates on each control change)
The Reveal — the exact overlay text shown after interaction, including the prediction-vs-reality gap annotation
Reflection (Structured) — sentence completion with dropdown, or 4-option multiple choice. The exact text of the prompt and all options, with correct answer marked.
Math Peek — the LaTeX formula that governs this act (collapsible panel)

Section 4: Act 2 — Design Challenge

Required subsections:

Pedagogical Goal — one paragraph
The Lock (Numeric Prediction) — exact question text, expected answer range, what the system will show afterward
The Instrument — complete control inventory with ranges, and complete output inventory with formulas
The Scaling Challenge — a specific target the student must hit by exploring the design space (e.g., "find maximum W where training fits on a Laptop GPU")
The Failure State — exact trigger condition, exact visual change, exact banner text
Structured Reflection — exact prompt and all options
Math Peek — the governing equation

Section 5: Visual Layout Specification

List every chart in priority order. For each chart:

Chart type (histogram, stacked bar, scatter, waterfall, etc.)
X axis: label, range, units
Y axis: label, range, units
What data series are shown
When/how it enters failure state (if applicable)

Section 6: Deployment Context Definitions

Section 7: Design Ledger Output

The exact JSON fields the lab records at completion, and which future labs read those fields. If no future lab reads a field, it should not be recorded.

{
  "chapter": N,
  "field_name": "<value>",
  ...
}

Section 8: Traceability Table (Mandatory)

Every quantitative value in the lab must appear in this table.

Lab Element	Chapter Section	Exact Claim Being Tested
[every slider range, threshold, formula, ratio]	[@sec-... or line number]	["exact quote"]

Rows without a chapter source are placeholder content and must be replaced with chapter-grounded values before the plan is complete.

Part IV: Technical Implementation Specification

4.1 Format

Single-file Marimo Notebook (.py)
WASM-first: zero local file I/O; all data in native Python dicts
Named lab_NN_slug.py with underscore-separated slug matching chapter slug

4.2 Physics Engine

All computations must call mlsys.Engine.solve(). No hardcoded physics constants outside mlsys/constants.py.

from mlsys import Engine, Models, Systems

profile = Engine.solve(
    model=Models.ResNet50,
    system=Systems.Mobile,
    batch_size=32,
    precision="int8",     # "fp32" | "fp16" | "int8"
    efficiency=0.5        # float 0.0–1.0
)

# Guaranteed fields on the returned PerformanceProfile:
# .latency            (Pint ms)
# .latency_compute    (Pint ms)
# .latency_memory     (Pint ms)
# .latency_overhead   (Pint ms) — dispatch tax; always shown
# .throughput         (Pint samples/sec)
# .bottleneck         (str: "Compute" | "Memory")
# .energy             (Pint joule)
# .memory_footprint   (Pint byte)
# .feasible           (bool: memory_footprint <= system.ram)

4.3 Prediction Lock Implementation

The current mo.stop(prediction == "", ...) implementation is not compliant. The compliant prediction lock must:

For multiple choice: use a radio group; mo.stop fires until a radio option is selected (not just any text)
For numeric entry: use a number input; mo.stop fires until the value is non-null and within a plausible order-of-magnitude range
After act completion: always show the prediction-vs-reality overlay in a dedicated card above the reflection prompt

# Compliant multiple-choice lock
prediction_choice = mo.ui.radio(
    options={"A) ~1-2×": "1x", "B) ~5×": "5x", "C) ~20×": "20x", "D) ~50×": "50x"},
    label="How much more expensive is Sigmoid than ReLU in transistors?"
)
mo.stop(prediction_choice.value is None, mo.md("⚠️ Select your prediction to unlock instruments."))

# Reveal overlay (shown after Act 1 completes)
actual = 50
predicted = {"1x": 1, "5x": 5, "20x": 20, "50x": 50}[prediction_choice.value]
gap = actual / predicted
mo.md(f"**You predicted {predicted}×. Actual: {actual}×. You were off by {gap:.1f}×.**")

4.4 Failure State Implementation

# OOM failure state pattern
memory_total = weights + gradients + optimizer_state + activations
oom = memory_total > system.ram

if oom:
    # Chart bars turn red (update Plotly trace colors)
    bar_colors = ["#CB202D"] * 4  # RedLine
    banner = mo.callout(
        mo.md("🔴 **OOM — Training infeasible on this device.**  "
              f"Required: {memory_total:.1f} GB | Available: {system.ram:.1f} GB"),
        kind="danger"
    )
else:
    bar_colors = ["#006395", "#008F45", "#CC5500", "#4B0082"]  # BlueLine, GreenLine, OrangeLine, Purple
    banner = mo.md("")

4.5 Variable Naming

Marimo treats the notebook as a single dataflow graph. Variable names must be unique across all cells. Use cell-specific prefixes:

# Act 1 variables
act1_prediction = mo.ui.radio(...)
act1_depth_slider = mo.ui.slider(...)
act1_fig = go.Figure(...)

# Act 2 variables
act2_prediction = mo.ui.number(...)
act2_batch_slider = mo.ui.slider(...)
act2_fig_memory = go.Figure(...)

4.6 Visual Identity

Import all components from labs.core. Use the canonical color palette:

Token	Hex	Use
BlueLine	#006395	Primary data, healthy state
GreenLine	#008F45	Target/goal, success state
RedLine	#CB202D	Failure state, violation
OrangeLine	#CC5500	Warning, secondary constraint

Every chart must include a MathPeek toggle revealing the governing equation. Every latency waterfall must show the overhead/dispatch tax term.

Part V: Developer Workflow

Every lab goes through exactly these steps in order. No step may be skipped.

Step 1: Read the Chapter Completely

Read the full chapter text, not the learning objectives summary. Extract:

Every quantitative claim with its value and units
Every formula with variable definitions
Every named invariant or law
Every footnote with a numerical assertion

Do not write the plan until this extraction is complete.

Step 2: Identify the Central Tension

Answer these two questions:

What does a typical student believe before reading this chapter?
What does the chapter's data reveal that contradicts that belief?

The answer to (2) minus the answer to (1) is the lab's pedagogical purpose. Every element of the lab must serve this gap.

Step 3: Build the Traceability Table First

Before writing any other section, fill in the traceability table with the chapter's quantitative claims. If you can fill fewer than 4 rows, the chapter may not have enough quantitative content for a lab, or you have not read the chapter thoroughly enough. Re-read.

Step 4: Design the Prediction Questions

For each act, write the prediction question and all answer options. The correct answer should be surprising — students who are overconfident about intuition should be wrong. The distractor options should map to common misconceptions (e.g., "about the same," "2× more expensive," "scales quadratically").

Test the question against the chapter: can a student who read the chapter carefully get the answer right? If yes, proceed. If the answer requires outside knowledge, revise.

Step 5: Design the Instruments

For each instrument, specify:

Every slider: min, max, step, default value, and the formula that maps slider position to output
Every chart: x-axis, y-axis, all data series, the formula for each series
Every threshold line: value, units, source in chapter

Step 6: Design the Failure State

Identify the physical boundary the student will cross. Write the exact trigger condition as a Python boolean expression. Write the exact banner text. Test that the failure state is reachable within the instrument's slider ranges.

Step 7: Write the Full Plan

Write all 8 sections of the plan template. Every number must be in the traceability table. Every prediction question must have all options listed. Every reflection must have all options listed.

Step 8: Depth Check

Verify the plan meets minimum depth:

≥ 150 lines
≥ 4 rows in the traceability table
All 8 sections present
No placeholder text ("TBD", "see chapter", "varies")
Every slider range is a specific number, not "appropriate range"
Every act has a prediction question, instrument description, failure state (Act 2), and structured reflection

A plan that fails this check is not submitted for implementation.

Part VI: The Design Ledger (Persistence Schema)

The Design Ledger carries student decisions across labs. It is a Python dict (not file I/O) persisted via Marimo's reactive state.

# Schema — all fields are optional except chapter and timestamp
ledger = {
    "chapter": int,           # current chapter number
    "context": str,           # deployment context chosen: "training_node" | "edge_inference" | "mcu"
    "timestamp": str,         # ISO 8601

    # Chapter-specific fields (added by each lab, never overwritten)
    "ch05": {
        "activation_choice": str,           # "relu" | "sigmoid" | "tanh" | "gelu"
        "max_trainable_width_laptop_gpu": int,
        "training_memory_estimate_error_kb": float,
        "batch_size_chosen": int
    },
    # ch06, ch07, ch08, ... added by subsequent labs
}

Rules:

Each lab adds exactly one chNN key. It never modifies prior chapter keys.
Downstream labs READ prior chapter values to initialize their default slider positions.
Example: lab_10 reads ledger["ch05"]["activation_choice"] to set the default activation in the compression comparison.
If the ledger is empty (student starting mid-book), labs initialize from mlsys.Systems.DefaultPreset.

Part VII: Validation Checklist

Before submitting a plan for implementation, verify every item:

Content

Core Invariant is stated in one sentence with a specific quantitative claim
Both prediction questions are structured (not free text)
Each prediction question has exactly 4 options with one correct answer marked
Correct answer is surprising (students who haven't read chapter likely get it wrong)
Traceability table has ≥ 4 rows, all with chapter citations
No placeholder values in slider ranges or chart axes

Structure

All 8 plan sections present
2-Act structure (not 3 KATs)
Target duration is 35–40 minutes
2 deployment contexts defined (not 4 narrative tracks)

Instruments

Every slider has specific min, max, step, default values
Every chart has labeled axes with units
Act 2 has at least one failure state with trigger condition and banner text
Prediction-vs-reality overlay is specified for both acts

Design Ledger

Output fields are listed in Section 7
At least one future lab reads a field from this lab's ledger output

Implementation Readiness

Plan is ≥ 150 lines
All numbers are chapter-grounded (no invented values)
No instruments used that are introduced in a later chapter

Appendix A: Lab Slug List

Volume 1 (Foundations)

Lab	Slug	Chapter
00	lab_00_the_map	Architect's Portal (special case: no prediction lock, track orientation only)
01	lab_01_ml_intro	ML Introduction
02	lab_02_ml_systems	ML Systems
03	lab_03_ml_workflow	ML Workflow
04	lab_04_data_engr	Data Engineering
05	lab_05_nn_compute	Neural Computation
06	lab_06_nn_arch	NN Architectures
07	lab_07_ml_frameworks	ML Frameworks
08	lab_08_model_train	Training
09	lab_09_data_selection	Data Selection
10	lab_10_model_compress	Model Compression
11	lab_11_hw_accel	HW Acceleration
12	lab_12_perf_bench	Performance Benchmarking
13	lab_13_model_serving	Model Serving
14	lab_14_ml_ops	ML Operations
15	lab_15_responsible_engr	Responsible Engineering
16	lab_16_ml_conclusion	Conclusion (special case: synthesis across all invariants)

Volume 2 (Scale) — same conventions apply; labs prefix v2_.

Appendix B: Instrument Library

Available instruments from labs.core. Use only what is listed here. Do not invent new component names.

Component	Import	Chapter First Available
`LatencyWaterfall`	`labs.core.components`	lab_02
`MathPeek`	`labs.core.components`	lab_01
`ComparisonRow`	`labs.core.components`	lab_01
`MetricRow`	`labs.core.components`	lab_01
`StakeholderMessage`	`labs.core.components`	lab_03
`RooflineVisualizer`	`labs.core.components`	lab_11
`PredictionLock`	`labs.core.components`	lab_01 (use compliant version from §4.3)

This document governs all lab development. When in doubt, ask: "Could Hennessy and Patterson point to the chapter line that justifies this slider range?" If not, it doesn't belong in the lab.

21 KiB Raw Blame History Unescape Escape