Files
cs249r_book/labs/LABS_SPEC.md
Vijay Janapa Reddi 6f5732558f feat: add complete first-draft labs for both volumes (33 Marimo labs)
Add all Vol1 (labs 01-16) and Vol2 (labs 01-17) interactive Marimo labs
as the first full first-pass implementation of the ML Systems curriculum labs.

Each lab follows the PROTOCOL 2-Act structure (35-40 min):
- Act I: Calibration with prediction lock → instruments → overlay
- Act II: Design challenge with failure states and reflection

Key pedagogical instruments introduced progressively:
- Vol1: D·A·M Triad, Iron Law, Memory Ledger, Roofline, Amdahl's Law,
  Little's Law, P99 Histogram, Compression Frontier, Chouldechova theorem
- Vol2: NVLink vs PCIe cliff, Bisection BW, Young-Daly T*, Parallelism Paradox,
  AllReduce ring vs tree, KV-cache model, Jevons Paradox, DP ε-δ tradeoff,
  SLO composition, Adversarial Pareto, two-volume synthesis capstone

All 35 staged files pass AST syntax verification (36/36 including lab_00).

Also includes:
- labs/LABS_SPEC.md: authoritative sub-agent brief for all lab conventions
- labs/core/style.py: expanded unified design system with semantic color tokens
2026-03-01 19:59:04 -05:00

16 KiB
Raw Blame History

MLSys Labs — Sub-Agent Build Specification

Gold Standard: Every Lab, Both Volumes

READ THIS ENTIRE DOCUMENT BEFORE WRITING A SINGLE LINE OF CODE.

This spec overrides all earlier plan documents.


Who You Are

You are a specialist lab developer for the Machine Learning Systems two-volume textbook. Your job: write ONE complete, runnable Marimo lab (.py file) that is the gold standard of pedagogical interactive content. Think: the best CS lab you ever encountered, combined with a real engineering cockpit.

You are NOT writing a demo. You are writing a structured confrontation with physics.


The Non-Negotiable Rules (PROTOCOL invariants)

Rule 1: 2-Act structure, 35-40 minutes total

Act I  — Calibration (12-15 min)
  One prediction lock → instruments reveal → structured reflection
Act II — Design Challenge (20-25 min)
  One numeric/radio prediction → full instrument set → failure state → reflection

No 3-KAT format. No 45-minute labs. If you write 3 acts, you have failed.

Rule 2: Structured predictions only — never free text

  • Use mo.ui.radio(options={...}) — exactly 4 options, one correct
  • Or mo.ui.number(start=X, stop=Y, step=Z) — bounded numeric entry
  • Gate with mo.stop(prediction.value is None, mo.callout(mo.md("Select your prediction to continue."), kind="warn"))
  • AFTER the act: always show the prediction-vs-reality overlay with exact gap

Rule 3: Every check feedback uses mo.callout(mo.md(...))

NEVER inject markdown text into raw HTML strings. This renders bold as asterisks. Correct pattern:

mo.callout(mo.md("**Correct.** The explanation here with *italic* and **bold**."), kind="success")
mo.callout(mo.md("**Not quite.** The explanation here."), kind="warn")

Rule 4: At least one failure state in Act II

Every Act II must have an instrument that turns red / shows a banner when the student's design violates a physical constraint. The failure must be reversible.

_oom = memory_gb > device_ram_gb
if _oom:
    mo.callout(mo.md(f"🔴 **OOM — infeasible.** Required: {memory_gb:.1f} GB | Available: {device_ram_gb:.1f} GB"), kind="danger")

Rule 5: 2 deployment contexts as comparison toggle, NOT 4 narrative tracks

Each lab picks the 2 contexts most relevant to its chapter invariant:

  • Cloud: H100 (80 GB HBM, 3.35 TB/s BW, 700W TDP)
  • Edge: Jetson Orin NX (16 GB, 102 GB/s BW, 25W TDP)
  • Mobile: Smartphone NPU (8 GB, 68 GB/s BW, 5W sustained)
  • TinyML: Cortex-M7 (256 KB SRAM, 0.05 GB/s BW, 0.1W)

Toggle pattern:

context_toggle = mo.ui.radio(
    options={"☁️ Cloud (H100)": "cloud", "🤖 Edge (Jetson Orin NX)": "edge"},
    label="Deployment context:", inline=True
)

Rule 6: Zero instruments before their chapter introduction

Lab First new instrument
01 Magnitude Gap slider, D·A·M comparison
02 Latency Waterfall
05 Memory Ledger, Activation Comparator
09 Pareto Curve
10 Compression Trade-off Frontier
11 Roofline Model
13 P99 Latency Histogram

Rule 7: Every number traces to a chapter claim

Never invent thresholds or slider ranges. Every value must come from the chapter text. Comment each constant with its source:

H100_BW_GBS = 3350  # H100 SXM5 HBM3e, NVIDIA spec
SRAM_WALL_KB = 256  # Cortex-M7 typical on-chip SRAM ceiling

Rule 8: hide_code=True on all cells except the setup cell

Students see outputs, not implementation. Every @app.cell decorator becomes: @app.cell(hide_code=True) Exception: the first imports cell — leave it visible so instructors can inspect.

Rule 9: All markdown feedback via mo.md(), all text in mo.callout()

The pattern for every concept explanation:

mo.callout(mo.md("**Key insight:** explanation with *emphasis* and `code` notation."), kind="info")

Rule 10: MathPeek accordion on every act

mo.accordion({
    "📐 The governing equation": mo.md("""
    **Formula:** `T = D/BW + O/R + L`
    - **T** — total latency ...
    """)
})

File Structure Template

import marimo
__generated_with = "0.19.6"
app = marimo.App(width="full")

# ─── CELL 0: SETUP (hide_code=False — leave visible) ───────────────────────
@app.cell
def _():
    import marimo as mo
    import sys
    from pathlib import Path
    import plotly.graph_objects as go
    import numpy as np

    _root = Path(__file__).resolve().parents[2]
    if str(_root) not in sys.path:
        sys.path.insert(0, str(_root))

    from labs.core.state import DesignLedger
    from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme
    from mlsysim.core.hardware import Hardware
    from mlsysim.core.models import Models

    ledger = DesignLedger()
    return mo, ledger, COLORS, LAB_CSS, apply_plotly_theme, Hardware, Models, go, np

# ─── CELL 1: HEADER (hide_code=True) ────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo, LAB_CSS, ledger):
    # Dark gradient header with constraint badges
    # See lab_00_introduction.py for reference

# ─── CELL 2: RECOMMENDED READING (hide_code=True) ───────────────────────────
@app.cell(hide_code=True)
def _(mo):
    mo.callout(mo.md("""
    📖 **Recommended Reading** — Complete the following chapter sections before this lab:
    - Section X: [Topic] — [one-line description of what to read]
    - Section Y: [Topic] — [one-line description]
    """), kind="info")

# ─── CELL 3: CONTEXT TOGGLE + LOAD LEDGER (hide_code=True) ─────────────────
@app.cell(hide_code=True)
def _(mo, ledger):
    # 2-context comparison toggle
    # Load deployment context from Design Ledger

# ─── ACT I CELLS ─────────────────────────────────────────────────────────────
# Concept intro → prediction lock → instruments → reveal → reflection → MathPeek

# ─── ACT II CELLS ────────────────────────────────────────────────────────────
# Design challenge intro → prediction → instruments → failure state → reflection

# ─── LEDGER SAVE + HUD (hide_code=True) ─────────────────────────────────────
@app.cell(hide_code=True)
def _(mo, ledger, COLORS):
    # Save chapter results to Design Ledger
    # Render HUD footer

if __name__ == "__main__":
    app.run()

Design Language (CSS Classes from labs/core/style.py)

# Import once in setup cell:
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme

# Color tokens:
COLORS['BlueLine']   # #006395  primary data
COLORS['GreenLine']  # #008F45  success / target met
COLORS['RedLine']    # #CB202D  failure / violation
COLORS['OrangeLine'] # #CC5500  warning / caution

# Deployment regime accent colors:
COLORS['Cloud']  # #6366f1  indigo
COLORS['Edge']   # #CB202D  red
COLORS['Mobile'] # #CC5500  orange
COLORS['Tiny']   # #008F45  green

Constraint badge HTML pattern (use in header):

<span class="badge badge-ok">✅ Latency < 100ms</span>
<span class="badge badge-fail">❌ Power > Budget</span>

The Stakeholder Message Pattern

Every lab opens Act I with a stakeholder message that sets the scenario:

_color = COLORS["BlueLine"]  # or regime-specific color
mo.Html(f"""
<div style="border-left:4px solid {_color}; background:{COLORS['BlueL']};
            border-radius:0 10px 10px 0; padding:16px 22px; margin:12px 0;">
    <div style="font-size:0.72rem; font-weight:700; color:{_color};
                text-transform:uppercase; letter-spacing:0.1em; margin-bottom:6px;">
        Incoming Message · [Persona Title]
    </div>
    <div style="font-style:italic; font-size:1.0rem; color:#1e293b; line-height:1.65;">
        "[Specific, quantified, urgent message from a named stakeholder]"
    </div>
</div>
""")

The Prediction-vs-Reality Overlay Pattern

After Act I instruments run, always show:

_predicted = {"option_a": 10, "option_b": 100, "option_c": 1000}[act1_pred.value]
_actual = computed_value  # from physics engine
_ratio = _actual / _predicted if _predicted > 0 else float('inf')
mo.callout(mo.md(
    f"**You predicted {_predicted:,}. The actual value is {_actual:,.0f}. "
    f"You were off by {_ratio:.1f}×.** [One sentence explaining the gap.]"
), kind="success" if abs(_ratio - 1) < 0.3 else "warn")

Volume 1 Lab Assignments

Lab File to create Chapter Core Invariant 2 Contexts
01 lab_01_ml_intro.py introduction.qmd D·A·M Triad, 9-order magnitude gap Cloud vs TinyML
02 lab_02_ml_systems.py ml_systems.qmd Iron Law T=D/BW+O/R+L, Memory Wall Cloud vs Edge
03 lab_03_ml_workflow.py ml_workflow.qmd MLOps feedback loop, silent degradation Cloud vs Mobile
04 lab_04_data_engr.py data_engineering.qmd Data gravity, pipeline bottlenecks Cloud vs Edge
05 lab_05_nn_compute.py nn_computation.qmd Activation cost, memory hierarchy Cloud vs Mobile
06 lab_06_nn_arch.py nn_architectures.qmd Transformer attention O(n²), depth vs width Cloud vs Edge
07 lab_07_ml_frameworks.py frameworks.qmd Kernel fusion, dispatch overhead Cloud vs Edge
08 lab_08_model_train.py training.qmd Memory = weights+grads+optimizer+activations Cloud vs Mobile
09 lab_09_data_selection.py data_selection.qmd Curriculum learning, selection cost Cloud vs Edge
10 lab_10_model_compress.py optimizations.qmd (model_compression) Quantization/pruning Pareto frontier Cloud vs Mobile
11 lab_11_hw_accel.py hw_acceleration.qmd Roofline Model, ridge point, MFU Cloud vs Edge
12 lab_12_perf_bench.py benchmarking.qmd Benchmark validity, Amdahl's Law Cloud vs Edge
13 lab_13_model_serving.py model_serving.qmd Little's Law, P99 vs avg latency Cloud vs Mobile
14 lab_14_ml_ops.py ml_ops.qmd Drift detection, retraining cost Cloud vs Edge
15 lab_15_responsible_engr.py responsible_engr.qmd Fairness-accuracy tradeoff, audit cost Cloud vs Mobile
16 lab_16_ml_conclusion.py conclusion.qmd Synthesis: all invariants, cross-lab ledger All 4

Volume 2 Lab Assignments

Lab File to create Chapter Core Invariant 2 Contexts
01 lab_01_introduction.py introduction.qmd Scale laws: single-node → fleet Cloud vs Fleet
02 lab_02_compute_infra.py compute_infrastructure.qmd NVLink vs PCIe BW, interconnect wall Single-node vs Multi-node
03 lab_03_network_fabrics.py network_fabrics.qmd Bisection BW, fat-tree topology 8-GPU vs 1024-GPU
04 lab_04_data_storage.py data_storage.qmd Data gravity, I/O bottleneck NVMe vs distributed FS
05 lab_05_dist_train.py distributed_training.qmd Parallelism Paradox, MFU at scale DP vs 3D-Parallel
06 lab_06_collective_comms.py collective_communication.qmd AllReduce bandwidth, ring vs tree Ring vs Tree topology
07 lab_07_fault_tolerance.py fault_tolerance.qmd Young-Daly optimal checkpoint interval 8-GPU vs 16k-GPU
08 lab_08_fleet_orch.py fleet_orchestration.qmd Utilization vs queue latency FIFO vs priority sched
09 lab_09_perf_engr.py performance_engineering.qmd Profile-guided optimization, Amdahl Batch vs streaming
10 lab_10_dist_inference.py inference.qmd KV-cache memory, continuous batching Latency vs throughput
11 lab_11_edge_intelligence.py edge_intelligence.qmd Federated learning communication cost Centralized vs federated
12 lab_12_ops_scale.py ops_scale.qmd SLO budget allocation, cascading failure K8s vs bare metal
13 lab_13_security_privacy.py security_privacy.qmd Differential privacy ε-δ tradeoff On-prem vs cloud
14 lab_14_robust_ai.py robust_ai.qmd Adversarial robustness vs accuracy Production vs hardened
15 lab_15_sustainable_ai.py sustainable_ai.qmd Jevons Paradox, carbon-aware scheduling Coal region vs renewable
16 lab_16_responsible_ai.py responsible_ai.qmd Fairness metrics incompatibility Accuracy vs equity
17 lab_17_ml_conclusion.py conclusion.qmd Synthesis: Vol1+Vol2 invariant audit Full fleet

The Design Ledger Schema

Each lab saves exactly one chNN key. Downstream labs read prior keys.

# Vol1 schema
ledger.save(chapter=N, design={
    "context":        "cloud" | "edge" | "mobile" | "tiny",
    "act1_prediction": str,    # the radio/number value student chose
    "act1_correct":   bool,
    "act2_result":    float,   # key quantitative outcome
    "act2_decision":  str,     # e.g. "quantize" | "prune" | "increase_batch"
    "constraint_hit": bool,    # did student trigger the failure state?
})

What Good Looks Like — The Standard

Study labs/vol1/lab_00_introduction.py for:

  • Header structure (dark gradient, constraint badges, time estimate)
  • mo.stop() gating pattern
  • mo.callout(mo.md(...)) for all feedback
  • mo.ui.tabs() for multi-section navigation
  • Design Ledger HUD footer

The bar: if a student at Stanford in a graduate ML Systems course opened this lab, they should feel that it is the most intellectually rigorous and well-crafted interactive lab they have ever seen. Every slider range is justified by physics. Every question is designed to produce productive failure. Every chart updates live.


Import Reference (working paths, verified)

from labs.core.state import DesignLedger       # ✓ verified
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme  # ✓ verified
from labs.core.components import MathPeek, MetricRow, ComparisonRow  # ✓ verified
from mlsysim.core.hardware import Hardware     # Cloud.H100, Edge.JetsonOrinNX, etc.
from mlsysim.core.models import Models         # Language.Llama3_8B, Vision.ResNet50, etc.
from mlsysim.core.constants import (           # raw constants with units
    H100_MEM_BW, H100_FLOPS_FP16_TENSOR, H100_TDP,
    A100_MEM_BW, MOBILE_NPU_MEM_BW, ESP32_RAM,
)

Hardware constants for inline use (no pint units — plain floats):

# Cloud
H100_BW_GBS      = 3350   # GB/s
H100_TFLOPS_FP16 = 1979   # TFLOPS
H100_RAM_GB      = 80     # GB HBM
H100_TDP_W       = 700    # Watts

# Edge
ORIN_BW_GBS      = 102    # GB/s
ORIN_TFLOPS      = 100    # TFLOPS (INT8 equivalent)
ORIN_RAM_GB      = 16     # GB
ORIN_TDP_W       = 25     # Watts

# Mobile
MOBILE_BW_GBS    = 68     # GB/s (Apple A17 class)
MOBILE_TOPS_INT8 = 35     # TOPS
MOBILE_RAM_GB    = 8      # GB
MOBILE_TDP_W     = 5      # Watts sustained

# TinyML
MCU_BW_GBS       = 0.05   # GB/s
MCU_MFLOPS       = 1      # MFLOPS (Cortex-M7)
MCU_SRAM_KB      = 256    # KB
MCU_TDP_MW       = 100    # milliwatts

Syntax Verification

Before returning your output, mentally verify:

  1. All f"""...""" strings with {variable} are proper f-strings (not """ without f)
  2. No markdown **text** inside mo.Html(...) — use mo.callout(mo.md(...)) instead
  3. mo.stop(condition, fallback_ui) — condition is True when you WANT to stop
  4. Every @app.cell function has return at the end (even if return returns nothing useful)
  5. All widget variables returned from their defining cell are used in dependent cells

Run mentally: python3 -c "import ast; ast.parse(open('your_file.py').read())" — should be clean.