cs249r_book/labs/LABS_SPEC.md

# MLSys Labs — Sub-Agent Build Specification
# Gold Standard: Every Lab, Both Volumes
#
# READ THIS ENTIRE DOCUMENT BEFORE WRITING A SINGLE LINE OF CODE.
# This spec overrides all earlier plan documents.

---

## Who You Are

You are a specialist lab developer for the *Machine Learning Systems* two-volume textbook.
Your job: write ONE complete, runnable Marimo lab (`.py` file) that is the gold standard
of pedagogical interactive content. Think: the best CS lab you ever encountered,
combined with a real engineering cockpit.

You are NOT writing a demo. You are writing a structured confrontation with physics.

---

## The Non-Negotiable Rules (PROTOCOL invariants)

### Rule 1: 2-Act structure, 35-40 minutes total
```
Act I  — Calibration (12-15 min)
  One prediction lock → instruments reveal → structured reflection
Act II — Design Challenge (20-25 min)
  One numeric/radio prediction → full instrument set → failure state → reflection
```
No 3-KAT format. No 45-minute labs. If you write 3 acts, you have failed.

### Rule 2: Structured predictions only — never free text
- Use `mo.ui.radio(options={...})` — exactly 4 options, one correct
- Or `mo.ui.number(start=X, stop=Y, step=Z)` — bounded numeric entry
- Gate with `mo.stop(prediction.value is None, mo.callout(mo.md("Select your prediction to continue."), kind="warn"))`
- AFTER the act: always show the prediction-vs-reality overlay with exact gap

### Rule 3: Every check feedback uses mo.callout(mo.md(...))
NEVER inject markdown text into raw HTML strings. This renders **bold** as asterisks.
Correct pattern:
```python
mo.callout(mo.md("**Correct.** The explanation here with *italic* and **bold**."), kind="success")
mo.callout(mo.md("**Not quite.** The explanation here."), kind="warn")
```

### Rule 4: At least one failure state in Act II
Every Act II must have an instrument that turns red / shows a banner when the student's
design violates a physical constraint. The failure must be reversible.
```python
_oom = memory_gb > device_ram_gb
if _oom:
    mo.callout(mo.md(f"🔴 **OOM — infeasible.** Required: {memory_gb:.1f} GB | Available: {device_ram_gb:.1f} GB"), kind="danger")
```

### Rule 5: 2 deployment contexts as comparison toggle, NOT 4 narrative tracks
Each lab picks the 2 contexts most relevant to its chapter invariant:
- Cloud: H100 (80 GB HBM, 3.35 TB/s BW, 700W TDP)
- Edge:  Jetson Orin NX (16 GB, 102 GB/s BW, 25W TDP)
- Mobile: Smartphone NPU (8 GB, 68 GB/s BW, 5W sustained)
- TinyML: Cortex-M7 (256 KB SRAM, 0.05 GB/s BW, 0.1W)

Toggle pattern:
```python
context_toggle = mo.ui.radio(
    options={"☁️ Cloud (H100)": "cloud", "🤖 Edge (Jetson Orin NX)": "edge"},
    label="Deployment context:", inline=True
)
```

### Rule 6: Zero instruments before their chapter introduction
| Lab | First new instrument |
|-----|---------------------|
| 01  | Magnitude Gap slider, D·A·M comparison |
| 02  | Latency Waterfall |
| 05  | Memory Ledger, Activation Comparator |
| 09  | Pareto Curve |
| 10  | Compression Trade-off Frontier |
| 11  | Roofline Model |
| 13  | P99 Latency Histogram |

### Rule 7: Every number traces to a chapter claim
Never invent thresholds or slider ranges. Every value must come from the chapter text.
Comment each constant with its source:
```python
H100_BW_GBS = 3350  # H100 SXM5 HBM3e, NVIDIA spec
SRAM_WALL_KB = 256  # Cortex-M7 typical on-chip SRAM ceiling
```

### Rule 8: hide_code=True on all cells except the setup cell
Students see outputs, not implementation. Every `@app.cell` decorator becomes:
`@app.cell(hide_code=True)`
Exception: the first imports cell — leave it visible so instructors can inspect.

### Rule 9: All markdown feedback via mo.md(), all text in mo.callout()
The pattern for every concept explanation:
```python
mo.callout(mo.md("**Key insight:** explanation with *emphasis* and `code` notation."), kind="info")
```

### Rule 10: MathPeek accordion on every act
```python
mo.accordion({
    "📐 The governing equation": mo.md("""
    **Formula:** `T = D/BW + O/R + L`
    - **T** — total latency ...
    """)
})
```

---

## File Structure Template

```python
import marimo
__generated_with = "0.19.6"
app = marimo.App(width="full")

# ─── CELL 0: SETUP (hide_code=False — leave visible) ───────────────────────
@app.cell
def _():
    import marimo as mo
    import sys
    from pathlib import Path
    import plotly.graph_objects as go
    import numpy as np

    _root = Path(__file__).resolve().parents[2]
    if str(_root) not in sys.path:
        sys.path.insert(0, str(_root))

    from labs.core.state import DesignLedger
    from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme
    from mlsysim.core.hardware import Hardware
    from mlsysim.core.models import Models

    ledger = DesignLedger()
    return mo, ledger, COLORS, LAB_CSS, apply_plotly_theme, Hardware, Models, go, np

# ─── CELL 1: HEADER (hide_code=True) ────────────────────────────────────────
@app.cell(hide_code=True)
def _(mo, LAB_CSS, ledger):
    # Dark gradient header with constraint badges
    # See lab_00_introduction.py for reference

# ─── CELL 2: RECOMMENDED READING (hide_code=True) ───────────────────────────
@app.cell(hide_code=True)
def _(mo):
    mo.callout(mo.md("""
    📖 **Recommended Reading** — Complete the following chapter sections before this lab:
    - Section X: [Topic] — [one-line description of what to read]
    - Section Y: [Topic] — [one-line description]
    """), kind="info")

# ─── CELL 3: CONTEXT TOGGLE + LOAD LEDGER (hide_code=True) ─────────────────
@app.cell(hide_code=True)
def _(mo, ledger):
    # 2-context comparison toggle
    # Load deployment context from Design Ledger

# ─── ACT I CELLS ─────────────────────────────────────────────────────────────
# Concept intro → prediction lock → instruments → reveal → reflection → MathPeek

# ─── ACT II CELLS ────────────────────────────────────────────────────────────
# Design challenge intro → prediction → instruments → failure state → reflection

# ─── LEDGER SAVE + HUD (hide_code=True) ─────────────────────────────────────
@app.cell(hide_code=True)
def _(mo, ledger, COLORS):
    # Save chapter results to Design Ledger
    # Render HUD footer

if __name__ == "__main__":
    app.run()
```

---

## Design Language (CSS Classes from labs/core/style.py)

```python
# Import once in setup cell:
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme

# Color tokens:
COLORS['BlueLine']   # #006395  primary data
COLORS['GreenLine']  # #008F45  success / target met
COLORS['RedLine']    # #CB202D  failure / violation
COLORS['OrangeLine'] # #CC5500  warning / caution

# Deployment regime accent colors:
COLORS['Cloud']  # #6366f1  indigo
COLORS['Edge']   # #CB202D  red
COLORS['Mobile'] # #CC5500  orange
COLORS['Tiny']   # #008F45  green
```

Constraint badge HTML pattern (use in header):
```html
<span class="badge badge-ok">✅ Latency < 100ms</span>
<span class="badge badge-fail">❌ Power > Budget</span>
```

---

## The Stakeholder Message Pattern

Every lab opens Act I with a stakeholder message that sets the scenario:
```python
_color = COLORS["BlueLine"]  # or regime-specific color
mo.Html(f"""
<div style="border-left:4px solid {_color}; background:{COLORS['BlueL']};
            border-radius:0 10px 10px 0; padding:16px 22px; margin:12px 0;">
    <div style="font-size:0.72rem; font-weight:700; color:{_color};
                text-transform:uppercase; letter-spacing:0.1em; margin-bottom:6px;">
        Incoming Message · [Persona Title]
    </div>
    <div style="font-style:italic; font-size:1.0rem; color:#1e293b; line-height:1.65;">
        "[Specific, quantified, urgent message from a named stakeholder]"
    </div>
</div>
""")
```

---

## The Prediction-vs-Reality Overlay Pattern

After Act I instruments run, always show:
```python
_predicted = {"option_a": 10, "option_b": 100, "option_c": 1000}[act1_pred.value]
_actual = computed_value  # from physics engine
_ratio = _actual / _predicted if _predicted > 0 else float('inf')
mo.callout(mo.md(
    f"**You predicted {_predicted:,}. The actual value is {_actual:,.0f}. "
    f"You were off by {_ratio:.1f}×.** [One sentence explaining the gap.]"
), kind="success" if abs(_ratio - 1) < 0.3 else "warn")
```

---

## Volume 1 Lab Assignments

| Lab | File to create | Chapter | Core Invariant | 2 Contexts |
|-----|---------------|---------|----------------|-----------|
| 01 | lab_01_ml_intro.py | introduction.qmd | D·A·M Triad, 9-order magnitude gap | Cloud vs TinyML |
| 02 | lab_02_ml_systems.py | ml_systems.qmd | Iron Law T=D/BW+O/R+L, Memory Wall | Cloud vs Edge |
| 03 | lab_03_ml_workflow.py | ml_workflow.qmd | MLOps feedback loop, silent degradation | Cloud vs Mobile |
| 04 | lab_04_data_engr.py | data_engineering.qmd | Data gravity, pipeline bottlenecks | Cloud vs Edge |
| 05 | lab_05_nn_compute.py | nn_computation.qmd | Activation cost, memory hierarchy | Cloud vs Mobile |
| 06 | lab_06_nn_arch.py | nn_architectures.qmd | Transformer attention O(n²), depth vs width | Cloud vs Edge |
| 07 | lab_07_ml_frameworks.py | frameworks.qmd | Kernel fusion, dispatch overhead | Cloud vs Edge |
| 08 | lab_08_model_train.py | training.qmd | Memory = weights+grads+optimizer+activations | Cloud vs Mobile |
| 09 | lab_09_data_selection.py | data_selection.qmd | Curriculum learning, selection cost | Cloud vs Edge |
| 10 | lab_10_model_compress.py | optimizations.qmd (model_compression) | Quantization/pruning Pareto frontier | Cloud vs Mobile |
| 11 | lab_11_hw_accel.py | hw_acceleration.qmd | Roofline Model, ridge point, MFU | Cloud vs Edge |
| 12 | lab_12_perf_bench.py | benchmarking.qmd | Benchmark validity, Amdahl's Law | Cloud vs Edge |
| 13 | lab_13_model_serving.py | model_serving.qmd | Little's Law, P99 vs avg latency | Cloud vs Mobile |
| 14 | lab_14_ml_ops.py | ml_ops.qmd | Drift detection, retraining cost | Cloud vs Edge |
| 15 | lab_15_responsible_engr.py | responsible_engr.qmd | Fairness-accuracy tradeoff, audit cost | Cloud vs Mobile |
| 16 | lab_16_ml_conclusion.py | conclusion.qmd | Synthesis: all invariants, cross-lab ledger | All 4 |

---

## Volume 2 Lab Assignments

| Lab | File to create | Chapter | Core Invariant | 2 Contexts |
|-----|---------------|---------|----------------|-----------|
| 01 | lab_01_introduction.py | introduction.qmd | Scale laws: single-node → fleet | Cloud vs Fleet |
| 02 | lab_02_compute_infra.py | compute_infrastructure.qmd | NVLink vs PCIe BW, interconnect wall | Single-node vs Multi-node |
| 03 | lab_03_network_fabrics.py | network_fabrics.qmd | Bisection BW, fat-tree topology | 8-GPU vs 1024-GPU |
| 04 | lab_04_data_storage.py | data_storage.qmd | Data gravity, I/O bottleneck | NVMe vs distributed FS |
| 05 | lab_05_dist_train.py | distributed_training.qmd | Parallelism Paradox, MFU at scale | DP vs 3D-Parallel |
| 06 | lab_06_collective_comms.py | collective_communication.qmd | AllReduce bandwidth, ring vs tree | Ring vs Tree topology |
| 07 | lab_07_fault_tolerance.py | fault_tolerance.qmd | Young-Daly optimal checkpoint interval | 8-GPU vs 16k-GPU |
| 08 | lab_08_fleet_orch.py | fleet_orchestration.qmd | Utilization vs queue latency | FIFO vs priority sched |
| 09 | lab_09_perf_engr.py | performance_engineering.qmd | Profile-guided optimization, Amdahl | Batch vs streaming |
| 10 | lab_10_dist_inference.py | inference.qmd | KV-cache memory, continuous batching | Latency vs throughput |
| 11 | lab_11_edge_intelligence.py | edge_intelligence.qmd | Federated learning communication cost | Centralized vs federated |
| 12 | lab_12_ops_scale.py | ops_scale.qmd | SLO budget allocation, cascading failure | K8s vs bare metal |
| 13 | lab_13_security_privacy.py | security_privacy.qmd | Differential privacy ε-δ tradeoff | On-prem vs cloud |
| 14 | lab_14_robust_ai.py | robust_ai.qmd | Adversarial robustness vs accuracy | Production vs hardened |
| 15 | lab_15_sustainable_ai.py | sustainable_ai.qmd | Jevons Paradox, carbon-aware scheduling | Coal region vs renewable |
| 16 | lab_16_responsible_ai.py | responsible_ai.qmd | Fairness metrics incompatibility | Accuracy vs equity |
| 17 | lab_17_ml_conclusion.py | conclusion.qmd | Synthesis: Vol1+Vol2 invariant audit | Full fleet |

---

## The Design Ledger Schema

Each lab saves exactly one `chNN` key. Downstream labs read prior keys.

```python
# Vol1 schema
ledger.save(chapter=N, design={
    "context":        "cloud" | "edge" | "mobile" | "tiny",
    "act1_prediction": str,    # the radio/number value student chose
    "act1_correct":   bool,
    "act2_result":    float,   # key quantitative outcome
    "act2_decision":  str,     # e.g. "quantize" | "prune" | "increase_batch"
    "constraint_hit": bool,    # did student trigger the failure state?
})
```

---

## What Good Looks Like — The Standard

Study `labs/vol1/lab_00_introduction.py` for:
- Header structure (dark gradient, constraint badges, time estimate)
- `mo.stop()` gating pattern
- `mo.callout(mo.md(...))` for all feedback
- `mo.ui.tabs()` for multi-section navigation
- Design Ledger HUD footer

The bar: if a student at Stanford in a graduate ML Systems course opened this lab,
they should feel that it is the most intellectually rigorous and well-crafted
interactive lab they have ever seen. Every slider range is justified by physics.
Every question is designed to produce productive failure. Every chart updates live.

---

## Import Reference (working paths, verified)

```python
from labs.core.state import DesignLedger       # ✓ verified
from labs.core.style import COLORS, LAB_CSS, apply_plotly_theme  # ✓ verified
from labs.core.components import MathPeek, MetricRow, ComparisonRow  # ✓ verified
from mlsysim.core.hardware import Hardware     # Cloud.H100, Edge.JetsonOrinNX, etc.
from mlsysim.core.models import Models         # Language.Llama3_8B, Vision.ResNet50, etc.
from mlsysim.core.constants import (           # raw constants with units
    H100_MEM_BW, H100_FLOPS_FP16_TENSOR, H100_TDP,
    A100_MEM_BW, MOBILE_NPU_MEM_BW, ESP32_RAM,
)
```

Hardware constants for inline use (no pint units — plain floats):
```python
# Cloud
H100_BW_GBS      = 3350   # GB/s
H100_TFLOPS_FP16 = 1979   # TFLOPS
H100_RAM_GB      = 80     # GB HBM
H100_TDP_W       = 700    # Watts

# Edge
ORIN_BW_GBS      = 102    # GB/s
ORIN_TFLOPS      = 100    # TFLOPS (INT8 equivalent)
ORIN_RAM_GB      = 16     # GB
ORIN_TDP_W       = 25     # Watts

# Mobile
MOBILE_BW_GBS    = 68     # GB/s (Apple A17 class)
MOBILE_TOPS_INT8 = 35     # TOPS
MOBILE_RAM_GB    = 8      # GB
MOBILE_TDP_W     = 5      # Watts sustained

# TinyML
MCU_BW_GBS       = 0.05   # GB/s
MCU_MFLOPS       = 1      # MFLOPS (Cortex-M7)
MCU_SRAM_KB      = 256    # KB
MCU_TDP_MW       = 100    # milliwatts
```

---

## Syntax Verification

Before returning your output, mentally verify:
1. All `f"""..."""` strings with `{variable}` are proper f-strings (not `"""` without `f`)
2. No markdown `**text**` inside `mo.Html(...)` — use `mo.callout(mo.md(...))` instead
3. `mo.stop(condition, fallback_ui)` — condition is True when you WANT to stop
4. Every `@app.cell` function has `return` at the end (even if `return` returns nothing useful)
5. All widget variables returned from their defining cell are used in dependent cells

Run mentally: `python3 -c "import ast; ast.parse(open('your_file.py').read())"` — should be clean.