Files
cs249r_book/labs/plans/vol1/lab_09_data_selection.md
Vijay Janapa Reddi 533cfa6e99 fix: pre-commit hooks — all 48 checks now pass
- book/quarto/mlsys/__init__.py: add repo-root sys.path injection so
  mlsysim is importable when scripts run from book/quarto/ context
- book/quarto/mlsys/{constants,formulas,formatting,hardware}.py: new
  compatibility shims that re-export from mlsysim.core.* and mlsysim.fmt
- mlsysim/viz/__init__.py: remove try/except for dashboard import; use
  explicit "import from mlsysim.viz.dashboard" pattern instead
- .codespell-ignore-words.txt: add "covert" (legitimate security term)
- book/tools/scripts/reference_check_log.txt: delete generated artifact
- Various QMD, bib, md files: auto-formatted by pre-commit hooks
  (trailing whitespace, bibtex-tidy, pipe table alignment)
2026-03-01 17:30:24 -05:00

4.2 KiB

📐 Mission Plan: 09_data_selection (Data Selection)

1. Chapter Context

  • Chapter Title: Data Selection: Signal-to-Noise Engineering.
  • Core Invariant: The Data Quality Multiplier (N_{noisy} \propto 1/\epsilon^2 vs N_{clean} \propto 1/\epsilon).
  • The Struggle: Understanding that "more data" is not always better. Students must navigate the Data Wall—the point where compute abundance meets high-quality data exhaustion—and learn to maximize the Information-Compute Ratio (ICR).
  • Target Duration: 45 Minutes.

2. The 4-Track Storyboard

Track Persona Fixed North Star Mission The "Data" Crisis
Cloud Titan LLM Architect Maximize Llama-3-70B serving. The Deduplication Tax. Your web-scraped corpus is 50% redundant. You are wasting $5M in GPU hours training on identical tokens.
Edge Guardian AV Systems Lead Deterministic 10ms safety loop. The Hard-Negative Crisis. The model keeps missing 'statue' edge cases. You have 1PB of raw video but only a $50k labeling budget.
Mobile Nomad AR Glasses Dev 60FPS AR translation. The Noise Penalty. Your training data has 5% label noise, requiring 10x more training steps to converge, which exceeds your project deadline.
Tiny Pioneer Hearable Lead Neural isolation in <10ms under 1mW. The Synthetic Bridge. You have only 500 real samples. You must use synthetic augmentation without creating a 'Domain Gap' that kills field accuracy.

3. The 3-Part Mission (The KATs)

Part 1: The Deduplication Audit (Exploration - 15 Mins)

  • Objective: Quantify the speedup of 'Static Pruning' (removing redundant data) on total training time.
  • The "Lock" (Prediction): "If you remove 30% of the most redundant samples using LSH/MinHash, what is the expected reduction in total training FLOPs?"
  • The Workbench:
    • Action: Adjust the Deduplication Threshold (MinHash Similarity).
    • Observation: The ICR Curve (Information-Compute Ratio). Watch the learning signal per compute unit rise as redundant mass is removed.
  • Reflect: "Why does training on duplicate data decrease the efficiency (\eta) of your training system? Reconcile this with the Iron Law."

Part 2: Active Learning ROI (Trade-off - 15 Mins)

  • Objective: Optimize the labeling budget using Uncertainty Sampling.
  • The "Lock" (Prediction): "Will uncertainty sampling reach 90% accuracy with more or fewer samples than random sampling?"
  • The Workbench:
    • Action: Toggle between Random Sampling and Active Learning. Adjust the 'Selection Batch Size'.
    • Observation: Accuracy vs. Labeling Cost ($) Plot. A Pareto frontier showing the ROI of expert labels.
    • The 10-Iteration Rule: Students must find the exact 'Knee of the Curve' where the cost of running the active-learning model exceeds the savings in labeling fees.
  • Reflect: "Jeff Dean asks: 'Is the CPU cost of indexing the entire 1PB dataset higher than the GPU savings from training on fewer samples?' Prove your answer using the dashboard."

Part 3: The Domain Gap Synthesis (Synthesis - 15 Mins)

  • Objective: Balance Synthetic and Real data to maximize generalization.
  • The "Lock" (Prediction): "What happens to your 'Safety Metric' if you move from 10% Synthetic data to 90% Synthetic data?"
  • The Workbench:
    • Interaction: Data Mix Ratio Slider (Synthetic vs. Real). Domain Randomization Intensity.
    • The "Stakeholder" Challenge: The Safety Lead warns that the synthetic simulator doesn't model 'Rain' correctly. You must find a mix that hits the accuracy target while maintaining a 'FID Score' (Domain Gap) below the safety threshold.
  • Reflect (The Ledger): "Defend your final Data Acquisition Strategy. Did you prioritize 'Quantity' (Synthetic) or 'Quality' (Expert-Labeled Real)? Explain how the 1/\epsilon^2 noise penalty influenced your choice."

4. Visual Layout Specification

  • Primary: ICR_Curve (Learning Progress vs. Compute FLOPs).
  • Secondary: LabelingROIPlot (Accuracy vs. Total Project Cost).
  • Math Peek: Toggle for the Data Quality Multiplier and MinHash Probability formulas.