📐 Mission Plan: 04_data_engr (Deep Analysis)

1. Chapter Context

Chapter Title: Data Engineering: Dataset Compilation.
Core Invariant: Data Gravity (T = D_{vol}/BW) and the Energy-Movement Invariant (E_{move} \gg E_{comp}).
The Struggle: Balancing the "Feeding Tax"—ensuring the data pipeline can keep up with the GPU's consumption rate without destroying the energy budget.
Target Duration: 45 Minutes.

Track	Persona	Fixed North Star Mission	The "Data Gravity"
Cloud Titan	LLM Architect	Maximize Llama-3-70B serving on a single H100.	The Feeding Tax. Disk I/O cannot keep up with HBM speeds.
Edge Guardian	AV Systems Lead	Deterministic 10ms safety loop on NVIDIA Orin.	The Ingestion Choke. 8 raw 4K vision streams flood the bus.
Mobile Nomad	AR Glasses Dev	60FPS AR translation on Meta Ray-Bans.	Transmission Energy. Moving bits over Bluetooth drains glasses.
Tiny Pioneer	Hearable Lead	Neural isolation in <10ms under 1mW.	SRAM Budget. Buffering audio consumes 50% of total memory.

Objective: Dimension the physical and economic cost of moving the mission's dataset.
The "Lock" (Prediction): "Will it be cheaper to stream your data over Fiber or ship a physical hard drive across the country?"
The Workbench:
- Sliders: Dataset Size (10GB -> 10PB), Distance (km), Link Bandwidth (10G -> 100G).
- Instruments: TransferTimeRadar, SneakernetCrossoverPlot (Time vs Distance).
- The 5-Move Rule: Students must analyze 5 different scale tiers to identify the "Distance Invariant" where each path wins.
Reflect: "Reconcile the transfer time with the 'Physics of Data Gravity' from the text. When does bit-volume become a physical barrier?"

Objective: Maximize GPU Model FLOPS Utilization (MFU) by optimizing the serialization pipeline.
The "Lock" (Prediction): "If you switch from JSON to Protobuf, will your GPU utilization increase more than if you upgrade to a faster SSD?"
The Workbench:
- Sliders: Serialization Format (CSV, JSON, Parquet, Protobuf), Worker Count (1-32), Disk Type (HDD -> NVMe).
- Instruments: FeedingTaxGauge (% GPU Idle), MFU_vs_Ingestion_Plot.
- The 15-Iteration Rule: Students must find the exact "Flow Equilibrium" where the CPU's pre-processing rate matches the GPU's consumption rate.
Reflect: "Your GPU is 80% idle. Prove whether the bottleneck is in the 'Blueprint' (Algorithm) or the 'Fuel' (Data pipeline) using the MFU plot."

Objective: Maximize 'Data Selection Gain' to hit accuracy targets within a carbon/energy budget.
The "Lock" (Prediction): "Is it more energy-efficient to use 1 million noisy samples or 10,000 curated 'Gold Standard' samples?"
The Workbench:
- Sliders: Filtering Ratio (0-90%), Label Quality (Low -> Expert), Processing Location (Local vs Cloud).
- The "Stakeholder" Challenge: The Sustainability Lead demands a 50% reduction in transmission energy. The student must use the Energy-Movement Invariant to propose an architectural change (e.g. local pre-processing).
Reflect (The Ledger): Justify your final Data/Compute energy ratio. Explain why "Signal-to-Noise Engineering" is more effective than raw scaling for this mission.

Primary: IngestionWaterfall (Storage BW vs. Network BW vs. Compute rate).
Secondary: EnergyRadar (MAC pJ vs. DRAM pJ vs. Network pJ).
Transparency: Toggle for Data Selection Gain \propto \frac{\text{Entropy}}{\text{Gravity}}.