# 📐 Mission Plan: 10_dist_inference (Volume 2: Fleet Scale)

## 1. Chapter Context
*   **Chapter Title:** Distributed Inference: Fleet-Scale Serving.
*   **Core Invariant:** The Serving Invariant (P99 Latency vs. Throughput Efficiency) and the **Serving Cost Dominance Law** (OpEx >> CapEx).
*   **The Struggle:** Understanding that at scale, "The Queue is the Model." Students must navigate the trade-off between **Request Isolation** (low latency) and **Batch Saturation** (low cost), specifically focusing on how **Continuous Batching** and **PagedAttention** bypass the KV-Cache Wall.
*   **Target Duration:** 45 Minutes.

---

## 2. The 4-Track Storyboard (Inference Missions)

| Track | Persona | Fixed North Star Mission | The "Serving" Crisis |
| :--- | :--- | :--- | :--- |
| **Cloud Titan** | LLM Architect | Maximize Llama-3-70B serving. | **The KV-Cache Wall.** Your H100s are only 20% utilized because fragmentation in the KV-cache is causing premature OOM. You must implement 'PagedAttention' to reclaim 40% of your VRAM. |
| **Edge Guardian** | AV Systems Lead | Deterministic 10ms safety loop. | **The Fan-out Tail.** Your perception loop now queries 10 parallel sub-models. The slowest sub-model's jitter is causing the total response time to fail the 10ms SLA. You must use 'Speculative Execution'. |
| **Mobile Nomad** | AR Glasses Dev | 60FPS AR translation. | **The Offload Jitter.** You are offloading AR reasoning to a fleet of Edge nodes. The variable 'Alpha' (start-up latency) of the WiFi-6 mesh is causing AR frame-stutter. |
| **Tiny Pioneer** | Hearable Lead | Neural isolation in <10ms under 1mW. | **The Power-Latency Seesaw.** You are serving a noise-isolation fleet. Higher batching saves gateway power but adds 50ms of delay, causing 'Echo' for the user. |

---

## 3. The 3-Part Mission (The KATs)

### Part 1: The Throughput Knee (Exploration - 15 Mins)
*   **Objective:** Predict and measure the point of system collapse using Queuing Theory.
*   **The "Lock" (Prediction):** "If you increase the request rate ($\lambda$) to 90% of your maximum capacity, does the P99 latency increase linearly or exponentially?"
*   **The Workbench:**
    *   **Action:** Slide the **Arrival Rate** ($\lambda$). Adjust the **Batch Window**.
    *   **Observation:** The **Latency-Throughput Pareto Curve**. Watch the "Knee of the Curve" where latency explodes.
*   **Reflect:** "Patterson asks: 'Why is 80% utilization the practical ceiling for a responsive system?' (Reference the $M/M/1$ queue math)."

### Part 2: Sharding the Heavyweight (Trade-off - 15 Mins)
*   **Objective:** Balance Tensor Parallelism (TP) vs. Pipeline Parallelism (PP) for latency-sensitive serving.
*   **The "Lock" (Prediction):** "Does 'Tensor Parallelism' (sharding weights) reduce the latency of a single request more than 'Pipeline Parallelism' (sharding layers)?"
*   **The Workbench:**
    *   **Interaction:** Adjust **TP Degree** vs. **PP Degree**. Toggle **Continuous Batching**.
    *   **Instruments:** **Latency Component Waterfall** (Compute vs. Communication vs. Bubbles).
    *   **The 10-Iteration Rule:** Students must shard a 70B model across 8 GPUs to hit a 50ms 'Time-to-First-Token' (TTFT) target.
*   **Reflect:** "Jeff Dean observes: 'Your sharding strategy is fast, but your bisection bandwidth is 100% saturated.' Propose a 'Weight-Gather' optimization to reduce the network tax."

### Part 3: The Memory Wall (Synthesis - 15 Mins)
*   **Objective:** Optimize KV-Cache management to maximize user concurrency.
*   **The "Lock" (Prediction):** "If you use 'PagedAttention' to eliminate internal fragmentation, how many more concurrent users can you fit in 80GB of HBM?"
*   **The Workbench:**
    *   **Interaction:** **Fragmentation Slider**. **KV-Cache Eviction Policy**. **Request Preemption Budget**.
    *   **The "Stakeholder" Challenge:** The **CFO** demands a 50% reduction in 'Cost-per-User'. You must implement **Speculative Decoding** to reduce the 'Tokens-per-Second' cost without regressing on P99 latency.
*   **Reflect (The Ledger):** "Defend your final 'Fleet Serving Strategy.' Did you prioritize 'Throughput' (Continuous Batching) or 'Responsiveness' (Zero-Batching)? Justify how you solved the 'Tail at Scale' problem."

---

## 4. Visual Layout Specification
*   **Primary:** `LatencyThroughputFrontier` (X-axis: QPS, Y-axis: P99 Latency).
*   **Secondary:** `KVCacheHeatmap` (Visualizing memory occupancy and fragmentation).
*   **Math Peek:** Toggle for `Serving Cost Dominance Law` and `TTFT vs TPOT` metrics.