Add all Vol1 (labs 01-16) and Vol2 (labs 01-17) interactive Marimo labs as the first full first-pass implementation of the ML Systems curriculum labs. Each lab follows the PROTOCOL 2-Act structure (35-40 min): - Act I: Calibration with prediction lock → instruments → overlay - Act II: Design challenge with failure states and reflection Key pedagogical instruments introduced progressively: - Vol1: D·A·M Triad, Iron Law, Memory Ledger, Roofline, Amdahl's Law, Little's Law, P99 Histogram, Compression Frontier, Chouldechova theorem - Vol2: NVLink vs PCIe cliff, Bisection BW, Young-Daly T*, Parallelism Paradox, AllReduce ring vs tree, KV-cache model, Jevons Paradox, DP ε-δ tradeoff, SLO composition, Adversarial Pareto, two-volume synthesis capstone All 35 staged files pass AST syntax verification (36/36 including lab_00). Also includes: - labs/LABS_SPEC.md: authoritative sub-agent brief for all lab conventions - labs/core/style.py: expanded unified design system with semantic color tokens
16 KiB
MLSys Labs — Sub-Agent Build Specification
Gold Standard: Every Lab, Both Volumes
READ THIS ENTIRE DOCUMENT BEFORE WRITING A SINGLE LINE OF CODE.
This spec overrides all earlier plan documents.
Who You Are
You are a specialist lab developer for the Machine Learning Systems two-volume textbook.
Your job: write ONE complete, runnable Marimo lab (.py file) that is the gold standard
of pedagogical interactive content. Think: the best CS lab you ever encountered,
combined with a real engineering cockpit.
You are NOT writing a demo. You are writing a structured confrontation with physics.
The Non-Negotiable Rules (PROTOCOL invariants)
Rule 1: 2-Act structure, 35-40 minutes total
Act I — Calibration (12-15 min)
One prediction lock → instruments reveal → structured reflection
Act II — Design Challenge (20-25 min)
One numeric/radio prediction → full instrument set → failure state → reflection
No 3-KAT format. No 45-minute labs. If you write 3 acts, you have failed.
Rule 2: Structured predictions only — never free text
- Use
mo.ui.radio(options={...})— exactly 4 options, one correct - Or
mo.ui.number(start=X, stop=Y, step=Z)— bounded numeric entry - Gate with
mo.stop(prediction.value is None, mo.callout(mo.md("Select your prediction to continue."), kind="warn")) - AFTER the act: always show the prediction-vs-reality overlay with exact gap
Rule 3: Every check feedback uses mo.callout(mo.md(...))
NEVER inject markdown text into raw HTML strings. This renders bold as asterisks. Correct pattern:
mo.callout(mo.md("**Correct.** The explanation here with *italic* and **bold**."), kind="success")
mo.callout(mo.md("**Not quite.** The explanation here."), kind="warn")
Rule 4: At least one failure state in Act II
Every Act II must have an instrument that turns red / shows a banner when the student's design violates a physical constraint. The failure must be reversible.
_oom = memory_gb > device_ram_gb
if _oom:
mo.callout(mo.md(f"🔴 **OOM — infeasible.** Required: {memory_gb:.1f} GB | Available: {device_ram_gb:.1f} GB"), kind="danger")
Rule 5: 2 deployment contexts as comparison toggle, NOT 4 narrative tracks
Each lab picks the 2 contexts most relevant to its chapter invariant:
- Cloud: H100 (80 GB HBM, 3.35 TB/s BW, 700W TDP)
- Edge: Jetson Orin NX (16 GB, 102 GB/s BW, 25W TDP)
- Mobile: Smartphone NPU (8 GB, 68 GB/s BW, 5W sustained)
- TinyML: Cortex-M7 (256 KB SRAM, 0.05 GB/s BW, 0.1W)
Toggle pattern:
context_toggle = mo.ui.radio(
options={"☁️ Cloud (H100)": "cloud", "🤖 Edge (Jetson Orin NX)": "edge"},
label="Deployment context:", inline=True
)
Rule 6: Zero instruments before their chapter introduction
| Lab | First new instrument |
|---|---|
| 01 | Magnitude Gap slider, D·A·M comparison |
| 02 | Latency Waterfall |
| 05 | Memory Ledger, Activation Comparator |
| 09 | Pareto Curve |
| 10 | Compression Trade-off Frontier |
| 11 | Roofline Model |
| 13 | P99 Latency Histogram |
Rule 7: Every number traces to a chapter claim
Never invent thresholds or slider ranges. Every value must come from the chapter text. Comment each constant with its source:
H100_BW_GBS = 3350 # H100 SXM5 HBM3e, NVIDIA spec
SRAM_WALL_KB = 256 # Cortex-M7 typical on-chip SRAM ceiling
Rule 8: hide_code=True on all cells except the setup cell
Students see outputs, not implementation. Every @app.cell decorator becomes:
@app.cell(hide_code=True)
Exception: the first imports cell — leave it visible so instructors can inspect.
Rule 9: All markdown feedback via mo.md(), all text in mo.callout()
The pattern for every concept explanation:
mo.callout(mo.md("**Key insight:** explanation with *emphasis* and `code` notation."), kind="info")
Rule 10: MathPeek accordion on every act
mo.accordion({
"📐 The governing equation": mo.md("""
**Formula:** `T = D/BW + O/R + L`
- **T** — total latency ...
""")
})
File Structure Template
import marimo
__generated_with = "0.19.6"
app = marimo.App(width="full")
# ─── CELL 0: SETUP (hide_code=False — leave visible) ───────────────────────
@app.cell
def _():
import marimo as mo
import sys
from pathlib import Path
import plotly.graph_objects as go
import numpy as np
_root = Path(__file__).resolve().parents[2]
if str(_root) not in sys.path:
sys.path.insert(0, str(_root))
from labs.core.state import DesignLedger
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme
from mlsysim.core.hardware import Hardware
from mlsysim.core.models import Models
ledger = DesignLedger()
return mo, ledger, COLORS, LAB_CSS, apply_plotly_theme, Hardware, Models, go, np
# ─── CELL 1: HEADER (hide_code=True) ────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo, LAB_CSS, ledger):
# Dark gradient header with constraint badges
# See lab_00_introduction.py for reference
# ─── CELL 2: RECOMMENDED READING (hide_code=True) ───────────────────────────
@app.cell(hide_code=True)
def _(mo):
mo.callout(mo.md("""
📖 **Recommended Reading** — Complete the following chapter sections before this lab:
- Section X: [Topic] — [one-line description of what to read]
- Section Y: [Topic] — [one-line description]
"""), kind="info")
# ─── CELL 3: CONTEXT TOGGLE + LOAD LEDGER (hide_code=True) ─────────────────
@app.cell(hide_code=True)
def _(mo, ledger):
# 2-context comparison toggle
# Load deployment context from Design Ledger
# ─── ACT I CELLS ─────────────────────────────────────────────────────────────
# Concept intro → prediction lock → instruments → reveal → reflection → MathPeek
# ─── ACT II CELLS ────────────────────────────────────────────────────────────
# Design challenge intro → prediction → instruments → failure state → reflection
# ─── LEDGER SAVE + HUD (hide_code=True) ─────────────────────────────────────
@app.cell(hide_code=True)
def _(mo, ledger, COLORS):
# Save chapter results to Design Ledger
# Render HUD footer
if __name__ == "__main__":
app.run()
Design Language (CSS Classes from labs/core/style.py)
# Import once in setup cell:
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme
# Color tokens:
COLORS['BlueLine'] # #006395 primary data
COLORS['GreenLine'] # #008F45 success / target met
COLORS['RedLine'] # #CB202D failure / violation
COLORS['OrangeLine'] # #CC5500 warning / caution
# Deployment regime accent colors:
COLORS['Cloud'] # #6366f1 indigo
COLORS['Edge'] # #CB202D red
COLORS['Mobile'] # #CC5500 orange
COLORS['Tiny'] # #008F45 green
Constraint badge HTML pattern (use in header):
<span class="badge badge-ok">✅ Latency < 100ms</span>
<span class="badge badge-fail">❌ Power > Budget</span>
The Stakeholder Message Pattern
Every lab opens Act I with a stakeholder message that sets the scenario:
_color = COLORS["BlueLine"] # or regime-specific color
mo.Html(f"""
<div style="border-left:4px solid {_color}; background:{COLORS['BlueL']};
border-radius:0 10px 10px 0; padding:16px 22px; margin:12px 0;">
<div style="font-size:0.72rem; font-weight:700; color:{_color};
text-transform:uppercase; letter-spacing:0.1em; margin-bottom:6px;">
Incoming Message · [Persona Title]
</div>
<div style="font-style:italic; font-size:1.0rem; color:#1e293b; line-height:1.65;">
"[Specific, quantified, urgent message from a named stakeholder]"
</div>
</div>
""")
The Prediction-vs-Reality Overlay Pattern
After Act I instruments run, always show:
_predicted = {"option_a": 10, "option_b": 100, "option_c": 1000}[act1_pred.value]
_actual = computed_value # from physics engine
_ratio = _actual / _predicted if _predicted > 0 else float('inf')
mo.callout(mo.md(
f"**You predicted {_predicted:,}. The actual value is {_actual:,.0f}. "
f"You were off by {_ratio:.1f}×.** [One sentence explaining the gap.]"
), kind="success" if abs(_ratio - 1) < 0.3 else "warn")
Volume 1 Lab Assignments
| Lab | File to create | Chapter | Core Invariant | 2 Contexts |
|---|---|---|---|---|
| 01 | lab_01_ml_intro.py | introduction.qmd | D·A·M Triad, 9-order magnitude gap | Cloud vs TinyML |
| 02 | lab_02_ml_systems.py | ml_systems.qmd | Iron Law T=D/BW+O/R+L, Memory Wall | Cloud vs Edge |
| 03 | lab_03_ml_workflow.py | ml_workflow.qmd | MLOps feedback loop, silent degradation | Cloud vs Mobile |
| 04 | lab_04_data_engr.py | data_engineering.qmd | Data gravity, pipeline bottlenecks | Cloud vs Edge |
| 05 | lab_05_nn_compute.py | nn_computation.qmd | Activation cost, memory hierarchy | Cloud vs Mobile |
| 06 | lab_06_nn_arch.py | nn_architectures.qmd | Transformer attention O(n²), depth vs width | Cloud vs Edge |
| 07 | lab_07_ml_frameworks.py | frameworks.qmd | Kernel fusion, dispatch overhead | Cloud vs Edge |
| 08 | lab_08_model_train.py | training.qmd | Memory = weights+grads+optimizer+activations | Cloud vs Mobile |
| 09 | lab_09_data_selection.py | data_selection.qmd | Curriculum learning, selection cost | Cloud vs Edge |
| 10 | lab_10_model_compress.py | optimizations.qmd (model_compression) | Quantization/pruning Pareto frontier | Cloud vs Mobile |
| 11 | lab_11_hw_accel.py | hw_acceleration.qmd | Roofline Model, ridge point, MFU | Cloud vs Edge |
| 12 | lab_12_perf_bench.py | benchmarking.qmd | Benchmark validity, Amdahl's Law | Cloud vs Edge |
| 13 | lab_13_model_serving.py | model_serving.qmd | Little's Law, P99 vs avg latency | Cloud vs Mobile |
| 14 | lab_14_ml_ops.py | ml_ops.qmd | Drift detection, retraining cost | Cloud vs Edge |
| 15 | lab_15_responsible_engr.py | responsible_engr.qmd | Fairness-accuracy tradeoff, audit cost | Cloud vs Mobile |
| 16 | lab_16_ml_conclusion.py | conclusion.qmd | Synthesis: all invariants, cross-lab ledger | All 4 |
Volume 2 Lab Assignments
| Lab | File to create | Chapter | Core Invariant | 2 Contexts |
|---|---|---|---|---|
| 01 | lab_01_introduction.py | introduction.qmd | Scale laws: single-node → fleet | Cloud vs Fleet |
| 02 | lab_02_compute_infra.py | compute_infrastructure.qmd | NVLink vs PCIe BW, interconnect wall | Single-node vs Multi-node |
| 03 | lab_03_network_fabrics.py | network_fabrics.qmd | Bisection BW, fat-tree topology | 8-GPU vs 1024-GPU |
| 04 | lab_04_data_storage.py | data_storage.qmd | Data gravity, I/O bottleneck | NVMe vs distributed FS |
| 05 | lab_05_dist_train.py | distributed_training.qmd | Parallelism Paradox, MFU at scale | DP vs 3D-Parallel |
| 06 | lab_06_collective_comms.py | collective_communication.qmd | AllReduce bandwidth, ring vs tree | Ring vs Tree topology |
| 07 | lab_07_fault_tolerance.py | fault_tolerance.qmd | Young-Daly optimal checkpoint interval | 8-GPU vs 16k-GPU |
| 08 | lab_08_fleet_orch.py | fleet_orchestration.qmd | Utilization vs queue latency | FIFO vs priority sched |
| 09 | lab_09_perf_engr.py | performance_engineering.qmd | Profile-guided optimization, Amdahl | Batch vs streaming |
| 10 | lab_10_dist_inference.py | inference.qmd | KV-cache memory, continuous batching | Latency vs throughput |
| 11 | lab_11_edge_intelligence.py | edge_intelligence.qmd | Federated learning communication cost | Centralized vs federated |
| 12 | lab_12_ops_scale.py | ops_scale.qmd | SLO budget allocation, cascading failure | K8s vs bare metal |
| 13 | lab_13_security_privacy.py | security_privacy.qmd | Differential privacy ε-δ tradeoff | On-prem vs cloud |
| 14 | lab_14_robust_ai.py | robust_ai.qmd | Adversarial robustness vs accuracy | Production vs hardened |
| 15 | lab_15_sustainable_ai.py | sustainable_ai.qmd | Jevons Paradox, carbon-aware scheduling | Coal region vs renewable |
| 16 | lab_16_responsible_ai.py | responsible_ai.qmd | Fairness metrics incompatibility | Accuracy vs equity |
| 17 | lab_17_ml_conclusion.py | conclusion.qmd | Synthesis: Vol1+Vol2 invariant audit | Full fleet |
The Design Ledger Schema
Each lab saves exactly one chNN key. Downstream labs read prior keys.
# Vol1 schema
ledger.save(chapter=N, design={
"context": "cloud" | "edge" | "mobile" | "tiny",
"act1_prediction": str, # the radio/number value student chose
"act1_correct": bool,
"act2_result": float, # key quantitative outcome
"act2_decision": str, # e.g. "quantize" | "prune" | "increase_batch"
"constraint_hit": bool, # did student trigger the failure state?
})
What Good Looks Like — The Standard
Study labs/vol1/lab_00_introduction.py for:
- Header structure (dark gradient, constraint badges, time estimate)
mo.stop()gating patternmo.callout(mo.md(...))for all feedbackmo.ui.tabs()for multi-section navigation- Design Ledger HUD footer
The bar: if a student at Stanford in a graduate ML Systems course opened this lab, they should feel that it is the most intellectually rigorous and well-crafted interactive lab they have ever seen. Every slider range is justified by physics. Every question is designed to produce productive failure. Every chart updates live.
Import Reference (working paths, verified)
from labs.core.state import DesignLedger # ✓ verified
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme # ✓ verified
from labs.core.components import MathPeek, MetricRow, ComparisonRow # ✓ verified
from mlsysim.core.hardware import Hardware # Cloud.H100, Edge.JetsonOrinNX, etc.
from mlsysim.core.models import Models # Language.Llama3_8B, Vision.ResNet50, etc.
from mlsysim.core.constants import ( # raw constants with units
H100_MEM_BW, H100_FLOPS_FP16_TENSOR, H100_TDP,
A100_MEM_BW, MOBILE_NPU_MEM_BW, ESP32_RAM,
)
Hardware constants for inline use (no pint units — plain floats):
# Cloud
H100_BW_GBS = 3350 # GB/s
H100_TFLOPS_FP16 = 1979 # TFLOPS
H100_RAM_GB = 80 # GB HBM
H100_TDP_W = 700 # Watts
# Edge
ORIN_BW_GBS = 102 # GB/s
ORIN_TFLOPS = 100 # TFLOPS (INT8 equivalent)
ORIN_RAM_GB = 16 # GB
ORIN_TDP_W = 25 # Watts
# Mobile
MOBILE_BW_GBS = 68 # GB/s (Apple A17 class)
MOBILE_TOPS_INT8 = 35 # TOPS
MOBILE_RAM_GB = 8 # GB
MOBILE_TDP_W = 5 # Watts sustained
# TinyML
MCU_BW_GBS = 0.05 # GB/s
MCU_MFLOPS = 1 # MFLOPS (Cortex-M7)
MCU_SRAM_KB = 256 # KB
MCU_TDP_MW = 100 # milliwatts
Syntax Verification
Before returning your output, mentally verify:
- All
f"""..."""strings with{variable}are proper f-strings (not"""withoutf) - No markdown
**text**insidemo.Html(...)— usemo.callout(mo.md(...))instead mo.stop(condition, fallback_ui)— condition is True when you WANT to stop- Every
@app.cellfunction hasreturnat the end (even ifreturnreturns nothing useful) - All widget variables returned from their defining cell are used in dependent cells
Run mentally: python3 -c "import ast; ast.parse(open('your_file.py').read())" — should be clean.