# 📐 Mission Plan: 04_data_engr (Deep Analysis) ## 1. Chapter Context * **Chapter Title:** Data Engineering: Dataset Compilation. * **Core Invariant:** Data Gravity ($T = D_{vol}/BW$) and the Energy-Movement Invariant ($E_{move} \gg E_{comp}$). * **The Struggle:** Balancing the "Feeding Tax"—ensuring the data pipeline can keep up with the GPU's consumption rate without destroying the energy budget. * **Target Duration:** 45 Minutes. --- ## 2. The 4-Track Storyboard | Track | Persona | Fixed North Star Mission | The "Data Gravity" | | :--- | :--- | :--- | :--- | | **Cloud Titan** | LLM Architect | Maximize Llama-3-70B serving on a single H100. | **The Feeding Tax.** Disk I/O cannot keep up with HBM speeds. | | **Edge Guardian** | AV Systems Lead | Deterministic 10ms safety loop on NVIDIA Orin. | **The Ingestion Choke.** 8 raw 4K vision streams flood the bus. | | **Mobile Nomad** | AR Glasses Dev | 60FPS AR translation on Meta Ray-Bans. | **Transmission Energy.** Moving bits over Bluetooth drains glasses. | | **Tiny Pioneer** | Hearable Lead | Neural isolation in <10ms under 1mW. | **SRAM Budget.** Buffering audio consumes 50% of total memory. | --- ## 3. The 3-Part Mission (The KATs) ### Part 1: The Data Gravity Audit (Exploration - 15 Mins) * **Objective:** Dimension the physical and economic cost of moving the mission's dataset. * **The "Lock" (Prediction):** "Will it be cheaper to stream your data over Fiber or ship a physical hard drive across the country?" * **The Workbench:** * **Sliders:** Dataset Size (10GB -> 10PB), Distance (km), Link Bandwidth (10G -> 100G). * **Instruments:** `TransferTimeRadar`, `SneakernetCrossoverPlot` (Time vs Distance). * **The 5-Move Rule:** Students must analyze 5 different scale tiers to identify the "Distance Invariant" where each path wins. * **Reflect:** "Reconcile the transfer time with the 'Physics of Data Gravity' from the text. When does bit-volume become a physical barrier?" ### Part 2: The Feeding Tax Solver (Trade-off - 20 Mins) * **Objective:** Maximize GPU Model FLOPS Utilization (MFU) by optimizing the serialization pipeline. * **The "Lock" (Prediction):** "If you switch from JSON to Protobuf, will your GPU utilization increase more than if you upgrade to a faster SSD?" * **The Workbench:** * **Sliders:** Serialization Format (CSV, JSON, Parquet, Protobuf), Worker Count (1-32), Disk Type (HDD -> NVMe). * **Instruments:** `FeedingTaxGauge` (% GPU Idle), `MFU_vs_Ingestion_Plot`. * **The 15-Iteration Rule:** Students must find the exact "Flow Equilibrium" where the CPU's pre-processing rate matches the GPU's consumption rate. * **Reflect:** "Your GPU is 80% idle. Prove whether the bottleneck is in the 'Blueprint' (Algorithm) or the 'Fuel' (Data pipeline) using the MFU plot." ### Part 3: The Zero-Waste Audit (Synthesis - 10 Mins) * **Objective:** Maximize 'Data Selection Gain' to hit accuracy targets within a carbon/energy budget. * **The "Lock" (Prediction):** "Is it more energy-efficient to use 1 million noisy samples or 10,000 curated 'Gold Standard' samples?" * **The Workbench:** * **Sliders:** Filtering Ratio (0-90%), Label Quality (Low -> Expert), Processing Location (Local vs Cloud). * **The "Stakeholder" Challenge:** The **Sustainability Lead** demands a 50% reduction in transmission energy. The student must use the **Energy-Movement Invariant** to propose an architectural change (e.g. local pre-processing). * **Reflect (The Ledger):** Justify your final Data/Compute energy ratio. Explain why "Signal-to-Noise Engineering" is more effective than raw scaling for this mission. --- ## 4. Visual Layout Specification * **Primary:** `IngestionWaterfall` (Storage BW vs. Network BW vs. Compute rate). * **Secondary:** `EnergyRadar` (MAC pJ vs. DRAM pJ vs. Network pJ). * **Transparency:** Toggle for `Data Selection Gain \propto \frac{\text{Entropy}}{\text{Gravity}}`.