Files
cs249r_book/labs/plans/vol1/04_data_engr.md
2026-02-26 15:23:17 -05:00

3.9 KiB

📐 Mission Plan: 04_data_engr (Deep Analysis)

1. Chapter Context

  • Chapter Title: Data Engineering: Dataset Compilation.
  • Core Invariant: Data Gravity (T = D_{vol}/BW) and the Energy-Movement Invariant (E_{move} \gg E_{comp}).
  • The Struggle: Balancing the "Feeding Tax"—ensuring the data pipeline can keep up with the GPU's consumption rate without destroying the energy budget.
  • Target Duration: 45 Minutes.

2. The 4-Track Storyboard

Track Persona Fixed North Star Mission The "Data Gravity"
Cloud Titan LLM Architect Maximize Llama-3-70B serving on a single H100. The Feeding Tax. Disk I/O cannot keep up with HBM speeds.
Edge Guardian AV Systems Lead Deterministic 10ms safety loop on NVIDIA Orin. The Ingestion Choke. 8 raw 4K vision streams flood the bus.
Mobile Nomad AR Glasses Dev 60FPS AR translation on Meta Ray-Bans. Transmission Energy. Moving bits over Bluetooth drains glasses.
Tiny Pioneer Hearable Lead Neural isolation in <10ms under 1mW. SRAM Budget. Buffering audio consumes 50% of total memory.

3. The 3-Part Mission (The KATs)

Part 1: The Data Gravity Audit (Exploration - 15 Mins)

  • Objective: Dimension the physical and economic cost of moving the mission's dataset.
  • The "Lock" (Prediction): "Will it be cheaper to stream your data over Fiber or ship a physical hard drive across the country?"
  • The Workbench:
    • Sliders: Dataset Size (10GB -> 10PB), Distance (km), Link Bandwidth (10G -> 100G).
    • Instruments: TransferTimeRadar, SneakernetCrossoverPlot (Time vs Distance).
    • The 5-Move Rule: Students must analyze 5 different scale tiers to identify the "Distance Invariant" where each path wins.
  • Reflect: "Reconcile the transfer time with the 'Physics of Data Gravity' from the text. When does bit-volume become a physical barrier?"

Part 2: The Feeding Tax Solver (Trade-off - 20 Mins)

  • Objective: Maximize GPU Model FLOPS Utilization (MFU) by optimizing the serialization pipeline.
  • The "Lock" (Prediction): "If you switch from JSON to Protobuf, will your GPU utilization increase more than if you upgrade to a faster SSD?"
  • The Workbench:
    • Sliders: Serialization Format (CSV, JSON, Parquet, Protobuf), Worker Count (1-32), Disk Type (HDD -> NVMe).
    • Instruments: FeedingTaxGauge (% GPU Idle), MFU_vs_Ingestion_Plot.
    • The 15-Iteration Rule: Students must find the exact "Flow Equilibrium" where the CPU's pre-processing rate matches the GPU's consumption rate.
  • Reflect: "Your GPU is 80% idle. Prove whether the bottleneck is in the 'Blueprint' (Algorithm) or the 'Fuel' (Data pipeline) using the MFU plot."

Part 3: The Zero-Waste Audit (Synthesis - 10 Mins)

  • Objective: Maximize 'Data Selection Gain' to hit accuracy targets within a carbon/energy budget.
  • The "Lock" (Prediction): "Is it more energy-efficient to use 1 million noisy samples or 10,000 curated 'Gold Standard' samples?"
  • The Workbench:
    • Sliders: Filtering Ratio (0-90%), Label Quality (Low -> Expert), Processing Location (Local vs Cloud).
    • The "Stakeholder" Challenge: The Sustainability Lead demands a 50% reduction in transmission energy. The student must use the Energy-Movement Invariant to propose an architectural change (e.g. local pre-processing).
  • Reflect (The Ledger): Justify your final Data/Compute energy ratio. Explain why "Signal-to-Noise Engineering" is more effective than raw scaling for this mission.

4. Visual Layout Specification

  • Primary: IngestionWaterfall (Storage BW vs. Network BW vs. Compute rate).
  • Secondary: EnergyRadar (MAC pJ vs. DRAM pJ vs. Network pJ).
  • Transparency: Toggle for Data Selection Gain \propto \frac{\text{Entropy}}{\text{Gravity}}.