mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-02 18:50:17 -05:00
56 lines
3.9 KiB
Markdown
56 lines
3.9 KiB
Markdown
# 📐 Mission Plan: 04_data_engr (Deep Analysis)
|
|
|
|
## 1. Chapter Context
|
|
* **Chapter Title:** Data Engineering: Dataset Compilation.
|
|
* **Core Invariant:** Data Gravity ($T = D_{vol}/BW$) and the Energy-Movement Invariant ($E_{move} \gg E_{comp}$).
|
|
* **The Struggle:** Balancing the "Feeding Tax"—ensuring the data pipeline can keep up with the GPU's consumption rate without destroying the energy budget.
|
|
* **Target Duration:** 45 Minutes.
|
|
|
|
---
|
|
|
|
## 2. The 4-Track Storyboard
|
|
|
|
| Track | Persona | Fixed North Star Mission | The "Data Gravity" |
|
|
| :--- | :--- | :--- | :--- |
|
|
| **Cloud Titan** | LLM Architect | Maximize Llama-3-70B serving on a single H100. | **The Feeding Tax.** Disk I/O cannot keep up with HBM speeds. |
|
|
| **Edge Guardian** | AV Systems Lead | Deterministic 10ms safety loop on NVIDIA Orin. | **The Ingestion Choke.** 8 raw 4K vision streams flood the bus. |
|
|
| **Mobile Nomad** | AR Glasses Dev | 60FPS AR translation on Meta Ray-Bans. | **Transmission Energy.** Moving bits over Bluetooth drains glasses. |
|
|
| **Tiny Pioneer** | Hearable Lead | Neural isolation in <10ms under 1mW. | **SRAM Budget.** Buffering audio consumes 50% of total memory. |
|
|
|
|
---
|
|
|
|
## 3. The 3-Part Mission (The KATs)
|
|
|
|
### Part 1: The Data Gravity Audit (Exploration - 15 Mins)
|
|
* **Objective:** Dimension the physical and economic cost of moving the mission's dataset.
|
|
* **The "Lock" (Prediction):** "Will it be cheaper to stream your data over Fiber or ship a physical hard drive across the country?"
|
|
* **The Workbench:**
|
|
* **Sliders:** Dataset Size (10GB -> 10PB), Distance (km), Link Bandwidth (10G -> 100G).
|
|
* **Instruments:** `TransferTimeRadar`, `SneakernetCrossoverPlot` (Time vs Distance).
|
|
* **The 5-Move Rule:** Students must analyze 5 different scale tiers to identify the "Distance Invariant" where each path wins.
|
|
* **Reflect:** "Reconcile the transfer time with the 'Physics of Data Gravity' from the text. When does bit-volume become a physical barrier?"
|
|
|
|
### Part 2: The Feeding Tax Solver (Trade-off - 20 Mins)
|
|
* **Objective:** Maximize GPU Model FLOPS Utilization (MFU) by optimizing the serialization pipeline.
|
|
* **The "Lock" (Prediction):** "If you switch from JSON to Protobuf, will your GPU utilization increase more than if you upgrade to a faster SSD?"
|
|
* **The Workbench:**
|
|
* **Sliders:** Serialization Format (CSV, JSON, Parquet, Protobuf), Worker Count (1-32), Disk Type (HDD -> NVMe).
|
|
* **Instruments:** `FeedingTaxGauge` (% GPU Idle), `MFU_vs_Ingestion_Plot`.
|
|
* **The 15-Iteration Rule:** Students must find the exact "Flow Equilibrium" where the CPU's pre-processing rate matches the GPU's consumption rate.
|
|
* **Reflect:** "Your GPU is 80% idle. Prove whether the bottleneck is in the 'Blueprint' (Algorithm) or the 'Fuel' (Data pipeline) using the MFU plot."
|
|
|
|
### Part 3: The Zero-Waste Audit (Synthesis - 10 Mins)
|
|
* **Objective:** Maximize 'Data Selection Gain' to hit accuracy targets within a carbon/energy budget.
|
|
* **The "Lock" (Prediction):** "Is it more energy-efficient to use 1 million noisy samples or 10,000 curated 'Gold Standard' samples?"
|
|
* **The Workbench:**
|
|
* **Sliders:** Filtering Ratio (0-90%), Label Quality (Low -> Expert), Processing Location (Local vs Cloud).
|
|
* **The "Stakeholder" Challenge:** The **Sustainability Lead** demands a 50% reduction in transmission energy. The student must use the **Energy-Movement Invariant** to propose an architectural change (e.g. local pre-processing).
|
|
* **Reflect (The Ledger):** Justify your final Data/Compute energy ratio. Explain why "Signal-to-Noise Engineering" is more effective than raw scaling for this mission.
|
|
|
|
---
|
|
|
|
## 4. Visual Layout Specification
|
|
* **Primary:** `IngestionWaterfall` (Storage BW vs. Network BW vs. Compute rate).
|
|
* **Secondary:** `EnergyRadar` (MAC pJ vs. DRAM pJ vs. Network pJ).
|
|
* **Transparency:** Toggle for `Data Selection Gain \propto \frac{\text{Entropy}}{\text{Gravity}}`.
|