mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-04-30 01:29:07 -05:00

Files

Vijay Janapa Reddi 533cfa6e99 fix: pre-commit hooks — all 48 checks now pass

- book/quarto/mlsys/__init__.py: add repo-root sys.path injection so
  mlsysim is importable when scripts run from book/quarto/ context
- book/quarto/mlsys/{constants,formulas,formatting,hardware}.py: new
  compatibility shims that re-export from mlsysim.core.* and mlsysim.fmt
- mlsysim/viz/__init__.py: remove try/except for dashboard import; use
  explicit "import from mlsysim.viz.dashboard" pattern instead
- .codespell-ignore-words.txt: add "covert" (legitimate security term)
- book/tools/scripts/reference_check_log.txt: delete generated artifact
- Various QMD, bib, md files: auto-formatted by pre-commit hooks
  (trailing whitespace, bibtex-tidy, pipe table alignment)

2026-03-01 17:30:24 -05:00

4.2 KiB

Raw Blame History

📐 Mission Plan: 09_data_selection (Data Selection)

1. Chapter Context

Chapter Title: Data Selection: Signal-to-Noise Engineering.
Core Invariant: The Data Quality Multiplier (N_{noisy} \propto 1/\epsilon^2 vs N_{clean} \propto 1/\epsilon).
The Struggle: Understanding that "more data" is not always better. Students must navigate the Data Wall—the point where compute abundance meets high-quality data exhaustion—and learn to maximize the Information-Compute Ratio (ICR).
Target Duration: 45 Minutes.

2. The 4-Track Storyboard

Track	Persona	Fixed North Star Mission	The "Data" Crisis
Cloud Titan	LLM Architect	Maximize Llama-3-70B serving.	The Deduplication Tax. Your web-scraped corpus is 50% redundant. You are wasting $5M in GPU hours training on identical tokens.
Edge Guardian	AV Systems Lead	Deterministic 10ms safety loop.	The Hard-Negative Crisis. The model keeps missing 'statue' edge cases. You have 1PB of raw video but only a $50k labeling budget.
Mobile Nomad	AR Glasses Dev	60FPS AR translation.	The Noise Penalty. Your training data has 5% label noise, requiring 10x more training steps to converge, which exceeds your project deadline.
Tiny Pioneer	Hearable Lead	Neural isolation in <10ms under 1mW.	The Synthetic Bridge. You have only 500 real samples. You must use synthetic augmentation without creating a 'Domain Gap' that kills field accuracy.

3. The 3-Part Mission (The KATs)

Part 1: The Deduplication Audit (Exploration - 15 Mins)

Objective: Quantify the speedup of 'Static Pruning' (removing redundant data) on total training time.
The "Lock" (Prediction): "If you remove 30% of the most redundant samples using LSH/MinHash, what is the expected reduction in total training FLOPs?"
The Workbench:
- Action: Adjust the Deduplication Threshold (MinHash Similarity).
- Observation: The ICR Curve (Information-Compute Ratio). Watch the learning signal per compute unit rise as redundant mass is removed.
Reflect: "Why does training on duplicate data decrease the efficiency (\eta) of your training system? Reconcile this with the Iron Law."

Part 2: Active Learning ROI (Trade-off - 15 Mins)

Objective: Optimize the labeling budget using Uncertainty Sampling.
The "Lock" (Prediction): "Will uncertainty sampling reach 90% accuracy with more or fewer samples than random sampling?"
The Workbench:
- Action: Toggle between Random Sampling and Active Learning. Adjust the 'Selection Batch Size'.
- Observation: Accuracy vs. Labeling Cost ($) Plot. A Pareto frontier showing the ROI of expert labels.
- The 10-Iteration Rule: Students must find the exact 'Knee of the Curve' where the cost of running the active-learning model exceeds the savings in labeling fees.
Reflect: "Jeff Dean asks: 'Is the CPU cost of indexing the entire 1PB dataset higher than the GPU savings from training on fewer samples?' Prove your answer using the dashboard."

Part 3: The Domain Gap Synthesis (Synthesis - 15 Mins)

Objective: Balance Synthetic and Real data to maximize generalization.
The "Lock" (Prediction): "What happens to your 'Safety Metric' if you move from 10% Synthetic data to 90% Synthetic data?"
The Workbench:
- Interaction: Data Mix Ratio Slider (Synthetic vs. Real). Domain Randomization Intensity.
- The "Stakeholder" Challenge: The Safety Lead warns that the synthetic simulator doesn't model 'Rain' correctly. You must find a mix that hits the accuracy target while maintaining a 'FID Score' (Domain Gap) below the safety threshold.
Reflect (The Ledger): "Defend your final Data Acquisition Strategy. Did you prioritize 'Quantity' (Synthetic) or 'Quality' (Expert-Labeled Real)? Explain how the 1/\epsilon^2 noise penalty influenced your choice."

4. Visual Layout Specification

Primary: ICR_Curve (Learning Progress vs. Compute FLOPs).
Secondary: LabelingROIPlot (Accuracy vs. Total Project Cost).
Math Peek: Toggle for the Data Quality Multiplier and MinHash Probability formulas.

4.2 KiB Raw Blame History

📐 Mission Plan: 09_data_selection (Data Selection)

1. Chapter Context

2. The 4-Track Storyboard

3. The 3-Part Mission (The KATs)

Part 1: The Deduplication Audit (Exploration - 15 Mins)

Part 2: Active Learning ROI (Trade-off - 15 Mins)

Part 3: The Domain Gap Synthesis (Synthesis - 15 Mins)

4. Visual Layout Specification

4.2 KiB

Raw Blame History