mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-04 16:48:48 -05:00
- book/quarto/mlsys/__init__.py: add repo-root sys.path injection so
mlsysim is importable when scripts run from book/quarto/ context
- book/quarto/mlsys/{constants,formulas,formatting,hardware}.py: new
compatibility shims that re-export from mlsysim.core.* and mlsysim.fmt
- mlsysim/viz/__init__.py: remove try/except for dashboard import; use
explicit "import from mlsysim.viz.dashboard" pattern instead
- .codespell-ignore-words.txt: add "covert" (legitimate security term)
- book/tools/scripts/reference_check_log.txt: delete generated artifact
- Various QMD, bib, md files: auto-formatted by pre-commit hooks
(trailing whitespace, bibtex-tidy, pipe table alignment)
4.5 KiB
4.5 KiB
📐 Mission Plan: 10_dist_inference (Volume 2: Fleet Scale)
1. Chapter Context
- Chapter Title: Distributed Inference: Fleet-Scale Serving.
- Core Invariant: The Serving Invariant (P99 Latency vs. Throughput Efficiency) and the Serving Cost Dominance Law (OpEx >> CapEx).
- The Struggle: Understanding that at scale, "The Queue is the Model." Students must navigate the trade-off between Request Isolation (low latency) and Batch Saturation (low cost), specifically focusing on how Continuous Batching and PagedAttention bypass the KV-Cache Wall.
- Target Duration: 45 Minutes.
2. The 4-Track Storyboard (Inference Missions)
| Track | Persona | Fixed North Star Mission | The "Serving" Crisis |
|---|---|---|---|
| Cloud Titan | LLM Architect | Maximize Llama-3-70B serving. | The KV-Cache Wall. Your H100s are only 20% utilized because fragmentation in the KV-cache is causing premature OOM. You must implement 'PagedAttention' to reclaim 40% of your VRAM. |
| Edge Guardian | AV Systems Lead | Deterministic 10ms safety loop. | The Fan-out Tail. Your perception loop now queries 10 parallel sub-models. The slowest sub-model's jitter is causing the total response time to fail the 10ms SLA. You must use 'Speculative Execution'. |
| Mobile Nomad | AR Glasses Dev | 60FPS AR translation. | The Offload Jitter. You are offloading AR reasoning to a fleet of Edge nodes. The variable 'Alpha' (start-up latency) of the WiFi-6 mesh is causing AR frame-stutter. |
| Tiny Pioneer | Hearable Lead | Neural isolation in <10ms under 1mW. | The Power-Latency Seesaw. You are serving a noise-isolation fleet. Higher batching saves gateway power but adds 50ms of delay, causing 'Echo' for the user. |
3. The 3-Part Mission (The KATs)
Part 1: The Throughput Knee (Exploration - 15 Mins)
- Objective: Predict and measure the point of system collapse using Queuing Theory.
- The "Lock" (Prediction): "If you increase the request rate (
\lambda) to 90% of your maximum capacity, does the P99 latency increase linearly or exponentially?" - The Workbench:
- Action: Slide the Arrival Rate (
\lambda). Adjust the Batch Window. - Observation: The Latency-Throughput Pareto Curve. Watch the "Knee of the Curve" where latency explodes.
- Action: Slide the Arrival Rate (
- Reflect: "Patterson asks: 'Why is 80% utilization the practical ceiling for a responsive system?' (Reference the
M/M/1queue math)."
Part 2: Sharding the Heavyweight (Trade-off - 15 Mins)
- Objective: Balance Tensor Parallelism (TP) vs. Pipeline Parallelism (PP) for latency-sensitive serving.
- The "Lock" (Prediction): "Does 'Tensor Parallelism' (sharding weights) reduce the latency of a single request more than 'Pipeline Parallelism' (sharding layers)?"
- The Workbench:
- Interaction: Adjust TP Degree vs. PP Degree. Toggle Continuous Batching.
- Instruments: Latency Component Waterfall (Compute vs. Communication vs. Bubbles).
- The 10-Iteration Rule: Students must shard a 70B model across 8 GPUs to hit a 50ms 'Time-to-First-Token' (TTFT) target.
- Reflect: "Jeff Dean observes: 'Your sharding strategy is fast, but your bisection bandwidth is 100% saturated.' Propose a 'Weight-Gather' optimization to reduce the network tax."
Part 3: The Memory Wall (Synthesis - 15 Mins)
- Objective: Optimize KV-Cache management to maximize user concurrency.
- The "Lock" (Prediction): "If you use 'PagedAttention' to eliminate internal fragmentation, how many more concurrent users can you fit in 80GB of HBM?"
- The Workbench:
- Interaction: Fragmentation Slider. KV-Cache Eviction Policy. Request Preemption Budget.
- The "Stakeholder" Challenge: The CFO demands a 50% reduction in 'Cost-per-User'. You must implement Speculative Decoding to reduce the 'Tokens-per-Second' cost without regressing on P99 latency.
- Reflect (The Ledger): "Defend your final 'Fleet Serving Strategy.' Did you prioritize 'Throughput' (Continuous Batching) or 'Responsiveness' (Zero-Batching)? Justify how you solved the 'Tail at Scale' problem."
4. Visual Layout Specification
- Primary:
LatencyThroughputFrontier(X-axis: QPS, Y-axis: P99 Latency). - Secondary:
KVCacheHeatmap(Visualizing memory occupancy and fragmentation). - Math Peek: Toggle for
Serving Cost Dominance LawandTTFT vs TPOTmetrics.