Files
cs249r_book/labs/plans/vol2/lab_17_ml_conclusion.md
Vijay Janapa Reddi 533cfa6e99 fix: pre-commit hooks — all 48 checks now pass
- book/quarto/mlsys/__init__.py: add repo-root sys.path injection so
  mlsysim is importable when scripts run from book/quarto/ context
- book/quarto/mlsys/{constants,formulas,formatting,hardware}.py: new
  compatibility shims that re-export from mlsysim.core.* and mlsysim.fmt
- mlsysim/viz/__init__.py: remove try/except for dashboard import; use
  explicit "import from mlsysim.viz.dashboard" pattern instead
- .codespell-ignore-words.txt: add "covert" (legitimate security term)
- book/tools/scripts/reference_check_log.txt: delete generated artifact
- Various QMD, bib, md files: auto-formatted by pre-commit hooks
  (trailing whitespace, bibtex-tidy, pipe table alignment)
2026-03-01 17:30:24 -05:00

4.9 KiB

📐 Mission Plan: 17_ml_conclusion (Volume 2: Fleet Scale)

1. Chapter Context

  • Chapter Title: ML Conclusion: The Fleet Architect's Synthesis.
  • Core Invariant: Fleet Synthesis (The C³ Convergence: Compute, Communication, Coordination) and the Compound Capability Law.
  • The Struggle: Synthesizing all fleet-scale principles to solve a final, global engineering crisis. Understanding that individual model scaling is saturated, and the future belongs to Compound AI Systems—orchestrating meshes of models, data, and machines to achieve superhuman reliability.
  • Target Duration: 45 Minutes.

2. The 4-Track Storyboard (Fleet Scale Finales)

Track Persona Fixed North Star Mission The "Scale" Finale
Cloud Titan LLM Architect Maximize Llama-3-70B serving. The Sovereign Cluster. You are scaling to 100,000 GPUs across three continents. You must balance Bisection Bandwidth (Communication) with Carbon Intensity (Sustainability) to achieve 1 zettaFLOP of total training Goodput.
Edge Guardian AV Systems Lead Deterministic 10ms safety loop. The Global Safety Mesh. 10 million autonomous taxis are sharing real-time 'Hazard Embeddings'. You must maintain Fleet Determinism while resisting a massive-scale Sybil Attack on the network.
Mobile Nomad AR Glasses Dev 60FPS AR translation. The Billion-User Metaverse. Your AR translation model is now part of a global, federated mesh. You must synchronize 1 billion Ray-Bans without exceeding the Grid Stability of local power utilities.
Tiny Pioneer Hearable Lead Neural isolation in <10ms under 1mW. The Smart Dust Fleet. You have deployed 1 million hearables. The fleet has become a 'Distributed Brain' for environmental monitoring. You must manage Mesh Entropy to prevent global signal decay.

3. The 3-Part Mission (The KATs)

Part 1: The Scaling Wall Audit (Exploration - 15 Mins)

  • Objective: Identify the ultimate "Bottleneck hop" in a global-scale architecture.
  • The "Lock" (Prediction): "In a cluster spanning three continents, which term of the Fleet Iron Law will dominate your latency: Intra-node NVLink, Inter-rack InfiniBand, or Inter-continental Fiber?"
  • The Workbench:
    • Action: Scale the Fleet Size (N) from 1 to 1,000,000. Adjust the Geographic Dispersion (km).
    • Observation: The Global Bandwidth Cliff. Watch the "Coordination Overhead" explode as the fleet crosses the Light Barrier between regions.
  • Reflect: "Patterson asks: 'Identify the exact node count where your bisection bandwidth saturates.' Use the CI Ratio to justify your choice of Torus vs. Fat-Tree."

Part 2: The Reliability-Sustainability Frontier (Trade-off - 15 Mins)

  • Objective: Balance "Fleet Uptime" vs. "Carbon Footprint" using the Young-Daly and CUE invariants.
  • The "Lock" (Prediction): "Does doubling the checkpoint frequency (1/ au) increase or decrease the total 'Carbon-per-Trained-Weight' (C/W) of your fleet?"
  • The Workbench:
    • Sliders: Checkpoint Frequency, Regional Grid Mix, Node Power Efficiency (\eta).
    • Instruments: Reliability-Sustainability Radar. Shows Uptime, Wasted Work, Carbon, and TCO.
    • The 10-Iteration Rule: Students must find the "Optimal Operating Window" that maximizes Fleet Goodput while staying under the 2030 Net-Zero target.
  • Reflect: "Jeff Dean observes: 'Your reliability is 99.9%, but you are burning 20% of your energy on 'Wasted Work' between failures.' Propose an Asynchronous Checkpoint strategy."

Part 3: Compound System Synthesis (Synthesis - 15 Mins)

  • Objective: Architect a "Compound AI System" using the Compound Capability Law.
  • The "Lock" (Prediction): "To achieve a 100x leap in system reliability, should you focus on building a 100x larger model, or a 10x more resilient ensemble of 10 models?"
  • The Workbench:
    • Interaction: Model Mesh Selector (Router, Verifier, Reasoner). Ensemble Size Slider. Voter Consensus Level.
    • The "Stakeholder" Challenge: The Board of Directors demands "Superhuman Safety" (Five Nines). You must use the Compound Probability Plot to prove that an ensemble of 5 'Small' models beats a single 'Giant' model in both accuracy and cost.
  • Reflect (The Ledger): "Defend your final 'Fleet Legacy.' How did you bridge the gap from single-node physics to global-scale orchestration? Justify why the future belongs to Composition, not just Scaling."

4. Visual Layout Specification

  • Primary: FleetSynthesisRadar (Covering the C³: Compute, Communication, Coordination).
  • Secondary: GlobalLatencyMap (Visualizing the 'Speed of Light' delays across the fleet).
  • Math Peek: Toggle for the Compound Capability Law (P_{sys} = 1 - (1-P_{node})^K) and the Six Principles of Distributed ML.