mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-03 08:08:51 -05:00
- book/quarto/mlsys/__init__.py: add repo-root sys.path injection so
mlsysim is importable when scripts run from book/quarto/ context
- book/quarto/mlsys/{constants,formulas,formatting,hardware}.py: new
compatibility shims that re-export from mlsysim.core.* and mlsysim.fmt
- mlsysim/viz/__init__.py: remove try/except for dashboard import; use
explicit "import from mlsysim.viz.dashboard" pattern instead
- .codespell-ignore-words.txt: add "covert" (legitimate security term)
- book/tools/scripts/reference_check_log.txt: delete generated artifact
- Various QMD, bib, md files: auto-formatted by pre-commit hooks
(trailing whitespace, bibtex-tidy, pipe table alignment)
4.9 KiB
4.9 KiB
📐 Mission Plan: 17_ml_conclusion (Volume 2: Fleet Scale)
1. Chapter Context
- Chapter Title: ML Conclusion: The Fleet Architect's Synthesis.
- Core Invariant: Fleet Synthesis (The C³ Convergence: Compute, Communication, Coordination) and the Compound Capability Law.
- The Struggle: Synthesizing all fleet-scale principles to solve a final, global engineering crisis. Understanding that individual model scaling is saturated, and the future belongs to Compound AI Systems—orchestrating meshes of models, data, and machines to achieve superhuman reliability.
- Target Duration: 45 Minutes.
2. The 4-Track Storyboard (Fleet Scale Finales)
| Track | Persona | Fixed North Star Mission | The "Scale" Finale |
|---|---|---|---|
| Cloud Titan | LLM Architect | Maximize Llama-3-70B serving. | The Sovereign Cluster. You are scaling to 100,000 GPUs across three continents. You must balance Bisection Bandwidth (Communication) with Carbon Intensity (Sustainability) to achieve 1 zettaFLOP of total training Goodput. |
| Edge Guardian | AV Systems Lead | Deterministic 10ms safety loop. | The Global Safety Mesh. 10 million autonomous taxis are sharing real-time 'Hazard Embeddings'. You must maintain Fleet Determinism while resisting a massive-scale Sybil Attack on the network. |
| Mobile Nomad | AR Glasses Dev | 60FPS AR translation. | The Billion-User Metaverse. Your AR translation model is now part of a global, federated mesh. You must synchronize 1 billion Ray-Bans without exceeding the Grid Stability of local power utilities. |
| Tiny Pioneer | Hearable Lead | Neural isolation in <10ms under 1mW. | The Smart Dust Fleet. You have deployed 1 million hearables. The fleet has become a 'Distributed Brain' for environmental monitoring. You must manage Mesh Entropy to prevent global signal decay. |
3. The 3-Part Mission (The KATs)
Part 1: The Scaling Wall Audit (Exploration - 15 Mins)
- Objective: Identify the ultimate "Bottleneck hop" in a global-scale architecture.
- The "Lock" (Prediction): "In a cluster spanning three continents, which term of the Fleet Iron Law will dominate your latency: Intra-node NVLink, Inter-rack InfiniBand, or Inter-continental Fiber?"
- The Workbench:
- Action: Scale the Fleet Size (
N) from 1 to 1,000,000. Adjust the Geographic Dispersion (km). - Observation: The Global Bandwidth Cliff. Watch the "Coordination Overhead" explode as the fleet crosses the Light Barrier between regions.
- Action: Scale the Fleet Size (
- Reflect: "Patterson asks: 'Identify the exact node count where your bisection bandwidth saturates.' Use the CI Ratio to justify your choice of Torus vs. Fat-Tree."
Part 2: The Reliability-Sustainability Frontier (Trade-off - 15 Mins)
- Objective: Balance "Fleet Uptime" vs. "Carbon Footprint" using the Young-Daly and CUE invariants.
- The "Lock" (Prediction): "Does doubling the checkpoint frequency (
1/ au) increase or decrease the total 'Carbon-per-Trained-Weight' (C/W) of your fleet?" - The Workbench:
- Sliders: Checkpoint Frequency, Regional Grid Mix, Node Power Efficiency (
\eta). - Instruments: Reliability-Sustainability Radar. Shows Uptime, Wasted Work, Carbon, and TCO.
- The 10-Iteration Rule: Students must find the "Optimal Operating Window" that maximizes Fleet Goodput while staying under the 2030 Net-Zero target.
- Sliders: Checkpoint Frequency, Regional Grid Mix, Node Power Efficiency (
- Reflect: "Jeff Dean observes: 'Your reliability is 99.9%, but you are burning 20% of your energy on 'Wasted Work' between failures.' Propose an Asynchronous Checkpoint strategy."
Part 3: Compound System Synthesis (Synthesis - 15 Mins)
- Objective: Architect a "Compound AI System" using the Compound Capability Law.
- The "Lock" (Prediction): "To achieve a 100x leap in system reliability, should you focus on building a 100x larger model, or a 10x more resilient ensemble of 10 models?"
- The Workbench:
- Interaction: Model Mesh Selector (Router, Verifier, Reasoner). Ensemble Size Slider. Voter Consensus Level.
- The "Stakeholder" Challenge: The Board of Directors demands "Superhuman Safety" (Five Nines). You must use the Compound Probability Plot to prove that an ensemble of 5 'Small' models beats a single 'Giant' model in both accuracy and cost.
- Reflect (The Ledger): "Defend your final 'Fleet Legacy.' How did you bridge the gap from single-node physics to global-scale orchestration? Justify why the future belongs to Composition, not just Scaling."
4. Visual Layout Specification
- Primary:
FleetSynthesisRadar(Covering the C³: Compute, Communication, Coordination). - Secondary:
GlobalLatencyMap(Visualizing the 'Speed of Light' delays across the fleet). - Math Peek: Toggle for the Compound Capability Law (
P_{sys} = 1 - (1-P_{node})^K) and the Six Principles of Distributed ML.