# 📐 Mission Plan: 09_data_selection (Data Selection) ## 1. Chapter Context * **Chapter Title:** Data Selection: Signal-to-Noise Engineering. * **Core Invariant:** The Data Quality Multiplier ($N_{noisy} \propto 1/\epsilon^2$ vs $N_{clean} \propto 1/\epsilon$). * **The Struggle:** Understanding that "more data" is not always better. Students must navigate the **Data Wall**—the point where compute abundance meets high-quality data exhaustion—and learn to maximize the **Information-Compute Ratio (ICR)**. * **Target Duration:** 45 Minutes. --- ## 2. The 4-Track Storyboard | Track | Persona | Fixed North Star Mission | The "Data" Crisis | | :--- | :--- | :--- | :--- | | **Cloud Titan** | LLM Architect | Maximize Llama-3-70B serving. | **The Deduplication Tax.** Your web-scraped corpus is 50% redundant. You are wasting $5M in GPU hours training on identical tokens. | | **Edge Guardian** | AV Systems Lead | Deterministic 10ms safety loop. | **The Hard-Negative Crisis.** The model keeps missing 'statue' edge cases. You have 1PB of raw video but only a $50k labeling budget. | | **Mobile Nomad** | AR Glasses Dev | 60FPS AR translation. | **The Noise Penalty.** Your training data has 5% label noise, requiring 10x more training steps to converge, which exceeds your project deadline. | | **Tiny Pioneer** | Hearable Lead | Neural isolation in <10ms under 1mW. | **The Synthetic Bridge.** You have only 500 real samples. You must use synthetic augmentation without creating a 'Domain Gap' that kills field accuracy. | --- ## 3. The 3-Part Mission (The KATs) ### Part 1: The Deduplication Audit (Exploration - 15 Mins) * **Objective:** Quantify the speedup of 'Static Pruning' (removing redundant data) on total training time. * **The "Lock" (Prediction):** "If you remove 30% of the most redundant samples using LSH/MinHash, what is the expected reduction in total training FLOPs?" * **The Workbench:** * **Action:** Adjust the **Deduplication Threshold** (MinHash Similarity). * **Observation:** The **ICR Curve (Information-Compute Ratio)**. Watch the learning signal per compute unit rise as redundant mass is removed. * **Reflect:** "Why does training on duplicate data decrease the efficiency ($\eta$) of your training system? Reconcile this with the Iron Law." ### Part 2: Active Learning ROI (Trade-off - 15 Mins) * **Objective:** Optimize the labeling budget using Uncertainty Sampling. * **The "Lock" (Prediction):** "Will uncertainty sampling reach 90% accuracy with more or fewer samples than random sampling?" * **The Workbench:** * **Action:** Toggle between **Random Sampling** and **Active Learning**. Adjust the 'Selection Batch Size'. * **Observation:** **Accuracy vs. Labeling Cost ($) Plot**. A Pareto frontier showing the ROI of expert labels. * **The 10-Iteration Rule:** Students must find the exact 'Knee of the Curve' where the cost of running the active-learning model exceeds the savings in labeling fees. * **Reflect:** "Jeff Dean asks: 'Is the CPU cost of indexing the entire 1PB dataset higher than the GPU savings from training on fewer samples?' Prove your answer using the dashboard." ### Part 3: The Domain Gap Synthesis (Synthesis - 15 Mins) * **Objective:** Balance Synthetic and Real data to maximize generalization. * **The "Lock" (Prediction):** "What happens to your 'Safety Metric' if you move from 10% Synthetic data to 90% Synthetic data?" * **The Workbench:** * **Interaction:** **Data Mix Ratio Slider** (Synthetic vs. Real). **Domain Randomization Intensity**. * **The "Stakeholder" Challenge:** The **Safety Lead** warns that the synthetic simulator doesn't model 'Rain' correctly. You must find a mix that hits the accuracy target while maintaining a 'FID Score' (Domain Gap) below the safety threshold. * **Reflect (The Ledger):** "Defend your final Data Acquisition Strategy. Did you prioritize 'Quantity' (Synthetic) or 'Quality' (Expert-Labeled Real)? Explain how the $1/\epsilon^2$ noise penalty influenced your choice." --- ## 4. Visual Layout Specification * **Primary:** `ICR_Curve` (Learning Progress vs. Compute FLOPs). * **Secondary:** `LabelingROIPlot` (Accuracy vs. Total Project Cost). * **Math Peek:** Toggle for the `Data Quality Multiplier` and `MinHash Probability` formulas.