mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-03 16:18:49 -05:00

Files

Vijay Janapa Reddi 533cfa6e99 fix: pre-commit hooks — all 48 checks now pass

- book/quarto/mlsys/__init__.py: add repo-root sys.path injection so
  mlsysim is importable when scripts run from book/quarto/ context
- book/quarto/mlsys/{constants,formulas,formatting,hardware}.py: new
  compatibility shims that re-export from mlsysim.core.* and mlsysim.fmt
- mlsysim/viz/__init__.py: remove try/except for dashboard import; use
  explicit "import from mlsysim.viz.dashboard" pattern instead
- .codespell-ignore-words.txt: add "covert" (legitimate security term)
- book/tools/scripts/reference_check_log.txt: delete generated artifact
- Various QMD, bib, md files: auto-formatted by pre-commit hooks
  (trailing whitespace, bibtex-tidy, pipe table alignment)

2026-03-01 17:30:24 -05:00

4.4 KiB

Raw Blame History

📐 Mission Plan: 02_compute_infra (Volume 2: Fleet Scale)

1. Chapter Context

Chapter Title: Compute Infrastructure: The Warehouse-Scale Engine.
Core Invariant: Power Density Walls and PUE (Power Usage Effectiveness).
The Struggle: Understanding that at scale, "The Machine" is not a chip, but a Warehouse. Students must navigate the trade-off between Compute Density (TFLOPS per Rack) and Cooling Efficiency (PUE), specifically focusing on the transition from Air to Liquid cooling.
Target Duration: 45 Minutes.

2. The 4-Track Storyboard (Infrastructure Missions)

Track	Persona	Fixed North Star Mission	The "Infrastructure" Crisis
Cloud Titan	LLM Architect	Maximize Llama-3-70B serving.	The 100kW Rack Wall. Your new B200 nodes pull 120kW per rack. The datacenter's air cooling is failing. You must decide whether to de-clock or move to liquid cooling.
Edge Guardian	AV Systems Lead	Deterministic 10ms safety loop.	The Field-Pod Transient. Your 'Fleet Hub' (a local compute node) is overheating in a hot garage. You must balance perception accuracy against thermal throttling to prevent 'Safety Voids'.
Mobile Nomad	AR Glasses Dev	60FPS AR translation.	The Multi-Tenant Latency. You are offloading AR compute to a shared Edge Server. Another app's noisy neighbor is stealing your cache, causing FPS drops.
Tiny Pioneer	Hearable Lead	Neural isolation in <10ms under 1mW.	The Gateway Jam. You are scaling from 1 to 1,000 hearables. The 'Machine' is now the gateway processor that must aggregate 1,000 streams without exploding its mW budget.

3. The 3-Part Mission (The KATs)

Part 1: The Silicon Contract (Selection - 15 Mins)

Objective: Select the optimal "Accelerator Class" (GPU, TPU, ASIC) for your fleet based on Arithmetic Intensity.
The "Lock" (Prediction): "Will a specialized ASIC always outperform a general-purpose GPU if the model's layers have highly irregular access patterns?"
The Workbench:
- Action: Toggle between GPU, TPU, and NPU. Apply your track's specific workload.
- Observation: The Accelerator Efficiency Radar. Compare 'Throughput-per-Watt' across the three architectures.
Reflect: "Patterson asks: 'Why is the Generality Tax higher for the Cloud Titan than for the Tiny Pioneer?' (Reference the silicon area spent on control logic)."

Part 2: The Power Density Wall (Thermal Design - 15 Mins)

Objective: Optimize Rack Density while minimizing the "Cooling Tax" (PUE).
The "Lock" (Prediction): "If you increase rack density by 2x using Air cooling, does the PUE (Total Power / IT Power) improve, stay the same, or degrade?"
The Workbench:
- Sliders: Nodes per Rack, Ambient Temperature, Fan Speed (Air) vs Pump Flow (Liquid).
- Instruments: PUE Gauge. Thermal Heatmap.
- The 10-Iteration Rule: Students must find the "Knee of the Thermal Curve" where the energy cost of cooling exceeds the gain in compute density.
Reflect: "Jeff Dean observes: 'Your PUE is 1.8. You are wasting 80% of your energy on non-compute.' Propose a transition to Liquid Cooling and prove the TCO reduction."

Part 3: The Scaling Cliff (Fleet Economics - 15 Mins)

Objective: Perform a "Build-vs-Buy" audit for your track's infrastructure.
The "Lock" (Prediction): "Over a 3-year period, is it cheaper to rent cloud TFLOPS or own a private pod, factoring in hardware failure (MTBF)?"
The Workbench:
- Interaction: Ownership Toggle. MTBF Slider (Hours between failures). Electricity Price ($/kWh).
- The "Stakeholder" Challenge: The CFO demands a 20% reduction in OpEx. You must use the TCO Waterfall to prove that investing in "Efficient Power Delivery" saves more than "Cheaper Nodes."
Reflect (The Ledger): "Defend your final 'Fleet Infrastructure' choice. Did you prioritize 'Density' or 'Efficiency'? Justify why 'Availability' was your primary constraint."

4. Visual Layout Specification

Primary: PUE_Waterfall (Compute Power vs. Cooling Power vs. Distribution Loss).
Secondary: ThermalSafeZonePlot (Density vs. Temperature).
Math Peek: Toggle for PUE = \frac{ ext{Total Facility Power}}{ ext{IT Equipment Power}}.

4.4 KiB Raw Blame History

📐 Mission Plan: 02_compute_infra (Volume 2: Fleet Scale)

1. Chapter Context

2. The 4-Track Storyboard (Infrastructure Missions)

3. The 3-Part Mission (The KATs)

Part 1: The Silicon Contract (Selection - 15 Mins)

Part 2: The Power Density Wall (Thermal Design - 15 Mins)

Part 3: The Scaling Cliff (Fleet Economics - 15 Mins)

4. Visual Layout Specification

4.4 KiB

Raw Blame History