Files
cs249r_book/book/quarto/contents/vol1/data_selection/data_selection.qmd
Vijay Janapa Reddi 87ffaf288d Refines content for Volume 1 conclusion
Enhances the conclusion of Volume 1, improving clarity and flow by:

- Refining wording and structure for better readability
- Clarifying the connection between theoretical invariants and practical applications
- Adding information for clarity and context
2026-02-21 07:59:34 -05:00

3287 lines
280 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
quiz: data_selection_quizzes.json
concepts: data_selection_concepts.yml
glossary: data_selection_glossary.json
crossrefs: data_selection_xrefs.json
engine: jupyter
---
# Data Selection {#sec-data-selection}
```{python}
#| echo: false
#| label: chapter-start
from mlsys.registry import start_chapter
start_chapter("vol1:data_selection")
```
::: {layout-narrow}
::: {.column-margin}
\chapterminitoc
:::
\noindent
![A futuristic digital illustration of data selection in machine learning, showing a sleek computing unit on one side with streams of binary code flowing in, where valuable data elements glow golden against a high-tech digital background.](images/png/cover_data_efficiency.png){fig-alt="A futuristic digital illustration of data selection in machine learning, showing a sleek computing unit on one side with streams of binary code flowing in, where valuable data elements glow golden against a high-tech digital background."}
:::
## Purpose {.unnumbered}
\begin{marginfigure}
\mlsysstack{0}{0}{25}{50}{0}{0}{0}{90}
\end{marginfigure}
_Why can a carefully selected 10% of your data match the accuracy of 100%?_
\index{Data Selection!systems optimization rationale}
The highest-impact optimization in machine learning operates upstream, before a single gradient is computed: on the data itself. Naive scaling assumes data is homogeneous, that every sample contributes equally to learning. Reality differs dramatically: in large-scale datasets, a tiny fraction of examples provides the majority of the gradient signal while the vast majority are redundant, noisy, or misaligned with the target distribution. This heterogeneity is not a statistical artifact but a systems optimization opportunity. Data engineering established that data is the source code of ML systems; data selection recognizes that not all source code is equally valuable and asks which lines of that code actually matter.
The practical consequences are enormous: compressing models and accelerating hardware speed up the execution of work, but data selection reduces the *core workload itself*. A training run that takes a week on the full dataset might take a day on a strategically selected subset, and that five-day savings compounds through every iteration of the development cycle: faster experimentation, more hyperparameter searches, quicker response to distribution drift, and lower barriers for teams with limited compute budgets. The shift is paradigmatic: from accumulating data as a massive liability to curating it as a precise resource, where every sample earns its place in the training set by contributing learning signal that no other sample provides.
::: {.content-visible when-format="pdf"}
\newpage
:::
::: {.callout-tip title="Learning Objectives"}
- Explain data selection as a systems optimization that reduces the Total Operations ($O$) term in the **Iron Law**, following the **D·A·M taxonomy**
- Apply the **Information-Compute Ratio (ICR)** framework to evaluate dataset value and diagnose whether training is data-starved or compute-starved
- Compare coreset selection, deduplication, and quality pruning techniques for pre-training data reduction
- Apply the **decision framework** to determine which combination of static pruning, dynamic selection, and synthetic generation fits a given workload
- Design **curriculum learning** and **active learning** strategies that adapt the training data diet as the model learns
- Evaluate how self-supervised pre-training and the **foundation model paradigm** transform data economics through cost amortization
- Analyze the **Selection Inequality**, cost-benefit trade-offs, and engineering challenges in production data selection systems
:::
## Data Selection Fundamentals {#sec-data-selection-data-selection-fundamentals-e839}
\index{Iron Law!Total Operations reduction via data selection}
Data selection asks a deceptively simple question with profound engineering consequences: given a clean, well-engineered dataset, which examples contribute the most learning per unit of compute cost? The preceding chapter on data engineering (@sec-data-engineering) established the infrastructure for collecting, cleaning, and preparing data, producing pipelines that ingest raw signals and yield well-governed, versioned datasets ready for training. That chapter ensured data *quality* through correct labels, consistent schemas, and clean records. Data selection optimizes data *value* by extracting maximum learning from minimum samples, directly shrinking the **Total Operations ($O$)** term in the **Iron Law** (@sec-introduction-iron-law-ml-systems-c32a). The distinction matters: quality asks *whether* data is correct, while value asks *whether* correct data is worth the compute spent processing it.
\index{Scaling Laws!data-compute asymmetry}
For decades, the dominant strategy was straightforward: more data, better models. Scaling laws [@kaplan2020scaling; @hoffmann2022training] confirmed that model performance improves predictably with dataset size, and teams responded rationally by scraping more web pages, labeling more images, and generating more synthetic examples. A critical asymmetry has since emerged. Hardware acceleration (@sec-hardware-acceleration) has outpaced the growth of high-quality data. GPU compute capacity has increased faster than traditional Moore's Law projections (architectural innovations like Tensor Cores and reduced-precision arithmetic stack on top of process-node gains, yielding effective throughput improvements faster than every two years) while the supply of novel, high-quality human-generated text and images grows at roughly 2 $\times$ every five years (@tbl-scaling-asymmetry). The internet has already been scraped. Domain experts cannot label faster. This asymmetry, which researchers call the **Data Wall**\index{Data Wall!definition}[^fn-data-wall] [@villalobos2022will], has inverted the optimization priority from "get more data" to "get more from existing data."
```{python}
#| label: scaling-asymmetry-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SCALING ASYMMETRY TABLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @tbl-scaling-asymmetry in "Data Selection Fundamentals" section
# │
# │ Goal: Quantify the growth rate gap between compute and data.
# │ Show: That compute grows 10×/3yr while data only grows 2×/5yr.
# │ How: Contrast historical growth factors for TFLOPS and high-quality tokens.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: gpu_growth_str, gpu_period_str, web_data_growth_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class SelectionEconomicsAnchor:
"""
Namespace for coreset selection overhead anchor.
"""
dataset_size_m = 1
scoring_time_hrs = 2.8
coreset_pct = 10
scoring_time_str = f"{scoring_time_hrs} hours"
coreset_pct_str = f"{coreset_pct}%"
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
coreset_scoring_time_str = SelectionEconomicsAnchor.scoring_time_str
coreset_pct_str = SelectionEconomicsAnchor.coreset_pct_str
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class ScalingAsymmetry:
"""
Namespace for Scaling Asymmetry Table.
Scenario: Comparing growth rates of Compute vs Data.
"""
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
# Hardware: 10x every 3 years (approx 2.15x/year)
gpu_growth_factor = 10.0
gpu_period_years = 3.0
# Data: 2x every 5 years (approx 1.15x/year)
web_growth_factor = 2.0
web_period_years = 5.0
# Labels: 1.5x every 5 years (approx 1.08x/year)
label_growth_factor = 1.5
label_period_years = 5.0
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
# Annualized growth rates: Rate = Factor^(1/Period)
gpu_annual = gpu_growth_factor ** (1.0 / gpu_period_years)
web_annual = web_growth_factor ** (1.0 / web_period_years)
# Divergence
gap_ratio = gpu_annual / web_annual
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
check(gap_ratio >= 1.5, f"GPU growth ({gpu_annual:.2f}x/yr) isn't fast enough vs Data ({web_annual:.2f}x/yr). Gap: {gap_ratio:.2f}x")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
gpu_growth_str = fmt(gpu_growth_factor, precision=0, commas=False) + "×"
gpu_period_str = f"{int(gpu_period_years)} years"
web_data_growth_str = fmt(web_growth_factor, precision=0, commas=False) + "×"
web_data_period_str = f"{int(web_period_years)} years"
label_data_growth_str = fmt(label_growth_factor, precision=1, commas=False) + "×"
label_data_period_str = f"{int(label_period_years)} years"
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
gpu_growth_str = ScalingAsymmetry.gpu_growth_str
gpu_period_str = ScalingAsymmetry.gpu_period_str
web_data_growth_str = ScalingAsymmetry.web_data_growth_str
web_data_period_str = ScalingAsymmetry.web_data_period_str
label_data_growth_str = ScalingAsymmetry.label_data_growth_str
label_data_period_str = ScalingAsymmetry.label_data_period_str
```
@tbl-scaling-asymmetry quantifies the growth rates underlying this data-compute imbalance:
| **Resource** | **Growth Rate** | **Implication** |
|:------------------------|---------------------------------------------------------------------:|:------------------------------------------------------|
| **GPU Compute** | ~`{python} gpu_growth_str` / `{python} gpu_period_str` | Hardware vendors deliver reliable exponential gains |
| **Training Data (Web)** | ~`{python} web_data_growth_str` / `{python} web_data_period_str` | High-quality web text is finite; much already scraped |
| **Labeled Data** | ~`{python} label_data_growth_str` / `{python} label_data_period_str` | Human annotation throughput is inherently bounded |
| **Synthetic Data** | Unbounded | Bounded by generator quality (risk of model collapse) |
: **Scaling Asymmetry in ML Resources.** Compute grows exponentially while high-quality data grows linearly or sub-linearly, creating an increasing compute-to-data imbalance that makes data selection essential. {#tbl-scaling-asymmetry .striped .hover}
[^fn-data-wall]: **Data Wall**: A term popularized by Epoch AI researchers in 2022. Their analysis projected that high-quality language data (books, academic papers, filtered web text) could be exhausted within one to two decades at then-current scaling rates. The "wall" metaphor emphasizes that unlike compute (which can be purchased) or algorithms (which can be improved), the stock of human-generated training data grows slowly and may represent a hard constraint on scaling.
\index{Data Wall!etymology}
\index{Foundation Models!data exhaustion projections}
Trace the trend line in @fig-running-out-of-human-data: foundation models are consuming the stock of human-generated text at an accelerating rate, with projections suggesting exhaustion of high-quality public data on a timeline measured in years, not decades. This is not a distant concern. It shapes training strategies today.
![**Dataset Growth Approaching Limits**: Foundation models are increasingly trained on vast datasets, approaching the total stock of human-generated text. Current projections suggest that high-quality public text data faces exhaustion on a near-term horizon, forcing a shift toward data selection, synthetic generation, and multimodal learning.](images/png/running_out_of_data.png){#fig-running-out-of-human-data fig-alt="Line chart showing dataset size in tokens on y-axis from 10^10 to 10^14 versus year on x-axis from 2010 to 2030. Blue line shows training data growth with markers for models like GPT-2, GPT-3, and Chinchilla. Orange shaded region shows projected high-quality text exhaustion in the near term."}
The gap between what compute can process and what quality data exists is widening, necessitating the *Continuous Training* loops of MLOps (@sec-ml-operations) to maintain relevance and making intelligent data selection increasingly critical.
::: {.callout-perspective title="The Scaling Asymmetry"}
**The Problem**: Compute scales exponentially. Data does not (@tbl-scaling-asymmetry).
**The Consequence**: Compute budgets now support training runs that far exceed what available high-quality data can fill. The field has become *compute-rich and data-poor*.
:::
This asymmetry inverts the optimization priority. When data was abundant and compute was scarce, the right strategy was algorithmic efficiency: squeeze more accuracy from limited GPU cycles. Now that compute is abundant and *quality data* is scarce, the winning strategy is **data selection**: squeeze more learning from each sample. The technique operates upstream of all other optimizations. By pruning redundancy and selecting high-value samples, we reduce the workload before it ever enters the model or hits the hardware, directly shrinking the Total Operations ($O$) term in the Iron Law (see the callout "Data Selection and the Iron Law" below for a detailed analysis). For companies training frontier models, the bottleneck has shifted from GPU access to the quality and diversity of their training corpora.
This chapter provides the engineering toolkit for intelligent data selection, organized around Part III's **D·A·M taxonomy**, which establishes a deliberate optimization ordering: Data first, then Algorithm, then Machine. Data selection puts the "highest leverage first" principle into practice by addressing whether work is necessary before asking how to simplify or accelerate it. The chapter follows a three-stage optimization pipeline that structures the practical response to the Data Wall:
\index{Static Pruning!definition}
\index{Dynamic Selection!definition}
\index{Synthetic Data Generation!definition}
1. **Static Pruning**: Removing low-value samples before training begins (coresets, deduplication).
2. **Dynamic Selection**: Selecting high-value samples during training (curriculum learning, active learning).
3. **Synthetic Generation**: Creating high-value samples on demand (augmentation, distillation).
Each stage increases the *information density* of the data that reaches the model, and together they form a complementary toolkit: pruning reduces *what* you have, selection focuses *how* you use it, and synthesis expands *what* you can access. Before examining these techniques, we must formalize *what* "data selection" means, *why* it is inherently a systems problem, and *how* to measure its effectiveness.
### Defining Data Selection {#sec-data-selection-defining-data-selection-ef2f}
::: {.callout-definition title="Data Selection"}
***Data Selection***\index{Data Selection!definition} is the process of maximizing the **Information-Compute Ratio**\index{Data Selection!Information-Compute Ratio (ICR)}. It operates upstream of training to identify the smallest subset of data sufficient to define the decision boundary, reducing the **Total Operations** ($O$) term of the Iron Law by eliminating redundant, noisy, or non-informative samples before they consume GPU cycles.
\index{Selection Efficiency!formula}
$$
\text{Selection Efficiency} = \frac{\Delta \text{Model Capability}}{\Delta \text{Data Cost}}
$$
where Data Cost encompasses:
- **Acquisition cost**: Time and money to collect or generate samples
- **Labeling cost**: Human expert annotation effort
- **Storage cost**: Bytes required to persist the dataset
- **Compute cost**: FLOPs to process samples during training
A perfectly efficient dataset would contain only samples that contribute unique information to the model's decision boundary: no redundancy, no noise, no "easy" examples already mastered. In practice, this chapter operationalizes Selection Efficiency through its compute component, the ICR, formalized in @sec-data-selection-informationcompute-ratio-8c0b.
:::
To make this concrete, consider training a model in the **GPT-2/Llama Lighthouse** family (@sec-network-architectures), which spans the autoregressive LLM family from GPT-2's 1.5B parameters to Llama's 7B--70B range, here using a 70B parameter language model:
```{python}
#| label: gpt-llama-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COMPUTE-DATA GAP EXAMPLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Defining Data Selection" section - 70B Llama model example
# │
# │ Goal: Demonstrate the "Data Wall" with a concrete scale example.
# │ Show: That cluster capacity (10T tokens) already exceeds available quality data (5T).
# │ How: Contrast H100 compute throughput with public dataset token counts.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: llama_params_str, h100_count_str, tokens_capacity_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import Bparam, BILLION, TRILLION, SEC_PER_HOUR, MILLION, THOUSAND
from mlsys import Models
from mlsys.formatting import fmt, check
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class ComputeDataGap:
"""
Namespace for Compute-Data Gap calculation.
Scenario: 10k H100s vs Available Quality Tokens.
"""
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
h100_count = 10000
months = 3
model = Models.Language.Llama2_70B
tokens_available = 5e12 # 5T tokens (RedPajama/RefinedWeb scale)
tokens_capacity = 10e12 # Capacity of the cluster
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
gap_ratio = tokens_capacity / tokens_available
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
check(gap_ratio >= 1.0, f"Compute ({tokens_capacity:.1e}) is less than Data ({tokens_available:.1e}). No Data Wall.")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
llama_params_str = fmt(model.parameters.to(Bparam).magnitude, precision=0, commas=False) + "B"
h100_count_str = fmt(h100_count, precision=0, commas=True)
tokens_capacity_str = fmt(tokens_capacity / TRILLION, precision=0, commas=False) + "T"
tokens_available_str = fmt(tokens_available / TRILLION, precision=0, commas=False) + "T"
compute_gap_str = fmt(gap_ratio, precision=0, commas=False)
training_months_str = fmt(months, precision=0, commas=False)
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
llama_params_str = ComputeDataGap.llama_params_str
h100_count_str = ComputeDataGap.h100_count_str
tokens_capacity_str = ComputeDataGap.tokens_capacity_str
tokens_available_str = ComputeDataGap.tokens_available_str
compute_gap_str = ComputeDataGap.compute_gap_str
training_months_str = ComputeDataGap.training_months_str
```
The compute budget (`{python} h100_count_str` H100 GPUs for `{python} training_months_str` months) represents tens of millions of dollars and can process over `{python} tokens_capacity_str` tokens. Yet only ~`{python} tokens_available_str` tokens of deduplicated, filtered web text exist, leaving a `{python} compute_gap_str` $\times$ gap between what compute can process and what quality data can fill. The team faces three options: train on the same data for multiple epochs (diminishing returns after epochs 2--3), lower quality thresholds to include more data (degrades model quality), or invest in data selection through better filtering, curriculum design, and synthetic augmentation to extract more learning from each token. The third option is increasingly the dominant approach.
This data selection imperative applies across model architectures, though the bottlenecks differ. Unlike our compute-bound ResNet-50 Lighthouse, GPT-2/Llama models are **memory bandwidth-bound** during inference (though often compute-bound during training as well) and still benefit enormously from data selection during training. Each token processed requires the same forward/backward pass cost regardless of model bottleneck, so fewer tokens means fewer FLOPs. This universality (data selection benefits every architecture, regardless of its dominant bottleneck) motivates a broader framing: data selection as a *systems* problem rather than a purely statistical one.
### Systems Perspective {#sec-data-selection-systems-perspective-bd61}
The Data Wall establishes *why* data selection matters; the systems perspective reveals *how* to approach it effectively. The conventional ML framing asks: *how do I achieve the same accuracy with fewer samples?* This focuses on statistical sample complexity and generalization theory. While valid, it misses the larger picture.
\index{Data Selection!systems vs ML framing}
In this textbook, we adopt a *systems framing*: *how do I reduce the total cost of achieving target performance across the entire ML lifecycle?* This shifts attention from accuracy curves to resource consumption, as @tbl-ml-vs-systems-framing illustrates.
| **ML Framing** | **Systems Framing** |
|:--------------------------------------|:--------------------------------------------|
| **"Fewer samples for same accuracy"** | "Fewer FLOPs for same accuracy" |
| **"Better generalization"** | "Lower training cost (time, money, energy)" |
| **"Sample complexity bounds"** | "End-to-end resource efficiency" |
| **"Learning theory"** | "Cost engineering" |
: **ML vs. Systems Perspectives on Data Selection.** The ML framing optimizes sample complexity; the systems framing optimizes total resource cost across the pipeline. {#tbl-ml-vs-systems-framing .striped .hover}
The systems framing reveals optimization opportunities invisible to the ML framing. To see *why*, consider *how* data selection interacts with the Iron Law introduced in @sec-introduction-iron-law-ml-systems-c32a.
```{python}
#| label: data-selection-savings-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ IRON LAW SAVINGS CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Data Selection and the Iron Law" callout
# │
# │ Goal: Demonstrate the multiplicative impact of data selection.
# │ Show: That dataset reduction compounds with hardware and model optimizations.
# │ How: Contrast additive vs. multiplicative speedup factors for an 8x total gain.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: training_cost_m_str, dataset_reduction_pct_str, combined_factor_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class IronLawSavings:
"""
Namespace for Iron Law Multiplicative Savings.
Scenario: 2x Data Selection * 2x Compression * 2x Hardware = 8x Total.
"""
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
budget_m = 100 # $100M training run
# Optimization factors
factor_data = 2.0
factor_model = 2.0
factor_hw = 2.0
# Derived
data_pruning_pct = (1 - (1/factor_data)) * 100
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
# Multiplicative effect
total_speedup = factor_data * factor_model * factor_hw
# Savings
compute_savings_m = budget_m * (data_pruning_pct / 100.0)
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
additive_sum = factor_data + factor_model + factor_hw
check(total_speedup > additive_sum, f"Multiplicative speedup ({total_speedup}x) should exceed additive sum ({additive_sum}).")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
training_cost_m_str = fmt(budget_m, precision=0, commas=False)
dataset_reduction_pct_str = fmt(data_pruning_pct, precision=0, commas=False)
compute_savings_m_str = fmt(compute_savings_m, precision=0, commas=False)
combined_factor_str = fmt(total_speedup, precision=0, commas=False)
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
training_cost_m_str = IronLawSavings.training_cost_m_str
dataset_reduction_pct_str = IronLawSavings.dataset_reduction_pct_str
compute_savings_m_str = IronLawSavings.compute_savings_m_str
combined_factor_str = IronLawSavings.combined_factor_str
```
::: {.callout-perspective title="Data Selection and the Iron Law"}
In the **Iron Law of ML Systems** ($T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak} \cdot \eta} + L_{lat}$), data selection is the only technique that reduces the *Total Operations* term at its source. Model compression reduces operations per sample; hardware acceleration increases throughput per operation. Data selection, by contrast, reduces the number of samples processed entirely.
- **Model compression**: Reduces $O$ per forward/backward pass
- **Hardware acceleration**: Increases $R_{peak}$ (peak throughput) and $\eta$ (utilization)
- **Data selection**: Reduces the number of passes through the entire equation
\index{Iron Law!multiplicative savings from data selection}
This makes data selection multiplicatively valuable: when all three optimizations act on the same bottleneck, a 2 $\times$ reduction in dataset size with 2 $\times$ model compression and 2 $\times$ hardware acceleration yields `{python} combined_factor_str` $\times$ total cost reduction, not 6 $\times$.
:::
Consider training cost reduction: a `{python} dataset_reduction_pct_str`% reduction in dataset size does not merely improve sample efficiency; it directly halves the number of forward passes, backward passes, and gradient updates. For a USD `{python} training_cost_m_str` M training run, this translates to USD `{python} compute_savings_m_str` M in compute savings. The relationship is linear and immediate.
\index{Deduplication!storage and I/O cost reduction}
\index{Active Learning!labeling cost reduction}
\index{Green AI!data selection as energy reduction}
These compute savings cascade through the entire infrastructure stack. Large datasets consume petabytes of storage and saturate network bandwidth during distributed training; deduplication and coreset selection reduce storage costs while eliminating I/O bottlenecks that can idle expensive GPU clusters. The savings extend to labeling economics: expert labeling costs (\$5--100+ per sample in domains like medical imaging) often exceed compute costs, and active learning and semi-supervised methods reduce labeling budgets by 10100 $\times$. The environmental implications compound further: training a large language model can emit hundreds of tons of CO₂, making data selection the most direct lever for Green AI, since halving the dataset halves training energy with no accuracy trade-off if done correctly. Smaller curated datasets also enable faster iteration velocity. A team that can iterate in hours rather than days has a compounding advantage in model development.
These cascading benefits illustrate a broader point: where the ML researcher asks "what is the sample complexity of this learning problem?", the systems engineer asks "what is the cost-per-accuracy-point across the entire pipeline, from data acquisition through deployment?" This chapter equips you with the systems engineer's toolkit for that question: techniques to minimize total cost, metrics to quantify efficiency gains, and architectural patterns to implement data selection at scale.
### Information-Compute Ratio {#sec-data-selection-informationcompute-ratio-8c0b}
\index{Information-Compute Ratio!definition}
\index{Pareto Frontier!data selection context}
The systems framing established above calls for a quantitative metric. The Optimize Principles (Part III) introduced the *Pareto Frontier* as the boundary where improving one metric necessarily degrades another, and identified three pillars of efficiency following the D·A·M taxonomy: Data, Algorithm (model compression, @sec-model-compression), and Machine (hardware acceleration, @sec-hardware-acceleration). As the first pillar in the D·A·M ordering, data selection addresses the most critical question: *how much information does each sample contribute to the model's learning per unit of computation?* We formalize this with a central metric: the Information-Compute Ratio.
\index{Roofline Model!data selection interaction}
In the optimization triad (@fig-optimization-triad), data selection plays the role of *Input Optimization*, reducing total workload before it enters the model or hardware. Model compression minimizes the math per parameter; hardware acceleration maximizes the math per second; data selection minimizes the total math required to reach convergence. The three edges of the triad capture the dominant bottlenecks: *Compute Bound* describes systems limited by arithmetic throughput, *I/O Bound* describes systems limited by data movement, and *Sample Efficiency* describes systems limited by the information content of training data.
```{python}
#| label: fig-optimization-triad
#| echo: false
#| out-width: "70%"
#| fig-cap: "**The Optimization Triad**: Machine learning performance relies on three pillars: Algorithms (models), Machine (hardware/software), and Data Selection. While algorithms and machines have traditionally received the most attention, optimizing data selection (Input Optimization) offers a third, powerful lever for scaling performance."
#| fig-alt: "A triangular diagram with three nodes: Algorithms (Model), Machine (Hardware), and Data Selection. Bidirectional arrows connect all three with edge labels: Compute Bound between Algorithms and Machine, I/O Bound between Machine and Data Selection, and Sample Efficiency between Data Selection and Algorithms. Data Selection is highlighted with a bold border. ML Performance appears at the center."
import numpy as np
import matplotlib.patches as mpatches
from mlsys import viz
fig, ax, COLORS, plt = viz.setup_plot(figsize=(5, 4.5))
ax.set_xlim(-2.75, 2.75)
ax.set_ylim(-2.45, 2.8)
ax.set_aspect('equal')
ax.axis('off')
ax.grid(False)
# Triangle vertices (top, bottom-left, bottom-right)
r = 2.0
top = (0, r)
bl = (r * np.cos(7*np.pi/6), r * np.sin(7*np.pi/6))
br = (r * np.cos(11*np.pi/6), r * np.sin(11*np.pi/6))
# Draw circles
for center, color, lw, label in [
(top, '#D6EAF8', 1.0, 'Algorithms\n(Model)'),
(bl, '#D5F5E3', 1.0, 'Machine\n(Hardware)'),
(br, '#FCE4CC', 2.0, 'Data\nSelection'),
]:
circle = plt.Circle(center, 0.85, facecolor=color, edgecolor=COLORS['primary'], linewidth=lw, zorder=2)
ax.add_patch(circle)
fw = 'bold' if 'Data' in label else 'normal'
ax.text(center[0], center[1], label, ha='center', va='center', fontsize=9, fontweight=fw, zorder=3)
# Draw bidirectional arrows between circle edges
def draw_arrow(p1, p2, label, label_side, offset=(0, 0)):
d = np.array(p2) - np.array(p1)
d_norm = d / np.linalg.norm(d)
start = np.array(p1) + d_norm * 0.9
end = np.array(p2) - d_norm * 0.9
ax.annotate("", xy=end, xytext=start, arrowprops=dict(arrowstyle="<->", color=COLORS['primary'], lw=1.5), zorder=1)
mid = (start + end) / 2 + np.array(offset)
ax.text(mid[0], mid[1], label, ha='center', va='center', fontsize=8, color=COLORS['primary'],
bbox=dict(facecolor='white', edgecolor='none', alpha=0.9, pad=2))
draw_arrow(top, bl, 'Compute\nBound', 'left', offset=(-0.45, 0.05))
draw_arrow(bl, br, 'I/O Bound', 'below', offset=(0, -0.3))
draw_arrow(br, top, 'Sample\nEfficiency', 'right', offset=(0.45, 0.05))
# Center label
ax.text(0, -0.15, 'ML\nPerformance', ha='center', va='center', fontsize=10, fontweight='bold', color=COLORS['primary'])
fig.tight_layout(pad=0.1)
plt.show()
```
We can formalize this as the ICR:
$$
\text{ICR} = \frac{\Delta \text{Model Performance}}{\Delta \text{FLOPs}}
$$
\index{Information-Compute Ratio!equivalence to hardware speedup}
As detailed in the "Data Selection and the Iron Law" callout above, data selection turns the Total Operations ($O$) term from a fixed constant into a variable. By maximizing ICR, we reduce the total FLOPs required to reach a target performance level. A 2 $\times$ improvement in ICR is mathematically equivalent to a 2 $\times$ improvement in hardware Peak Throughput ($R_{peak}$), but often much cheaper to achieve. Note that ICR focuses specifically on the compute component of the broader Selection Efficiency metric defined earlier, which also accounts for acquisition, labeling, and storage costs.
A random batch of raw data often has low ICR: it contains redundant examples, noisy samples, or "easy" examples the model has already mastered, wasting GPU cycles on zero-information updates. High-efficiency data pipelines (@fig-data-selection-pipeline) filter, order, and synthesize data to maximize ICR, ensuring that every FLOP contributes to learning. To illustrate, consider *computing ICR* on a concrete coreset selection task. Later in this chapter, @sec-data-selection-measurement-framework-733b provides the complete measurement framework for evaluating these efficiency gains, including the compute-optimal frontier diagnostic that determines whether training is data-starved or compute-starved.
::: {#fig-data-selection-pipeline fig-cap="**The Data Selection Pipeline**: A structured approach to increasing data value. Raw data is first pruned to remove redundancy (Static Pruning), then dynamically selected during training (Active Learning), and finally augmented to increase diversity (Synthesis). Each stage increases the Information-Compute Ratio (ICR)." fig-alt="A flow diagram showing the progression of data: Raw Data -> Static Pruning -> Dynamic Selection -> Synthetic Generation -> High Value Model. Arrows indicate the flow."}
```{=latex}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth]
\definecolor{GreenLine}{HTML}{008F45}
\definecolor{GreenL}{HTML}{D4EFDF}
\definecolor{BlueLine}{HTML}{006395}
\definecolor{BlueL}{HTML}{D6EAF8}
\definecolor{OrangeLine}{HTML}{E37222}
\definecolor{OrangeL}{HTML}{FDEBD0}
\definecolor{RedLine}{HTML}{DA291C}
\definecolor{RedL}{HTML}{FADBD8}
\definecolor{GreyFill}{HTML}{E8E8E8}
\definecolor{GreyLine}{HTML}{888888}
\tikzset{
Box/.style={
draw,
rounded corners,
minimum width=2cm,
minimum height=1cm,
align=center,
line width=1pt
},
Arrow/.style={->, line width=1.2pt, color=black!70}
}
% Nodes
\node[Box, fill=GreyFill, draw=GreyLine] (Raw) at (0,0) {Raw Data};
\node[Box, fill=BlueL, draw=BlueLine, right=1.5cm of Raw] (Static) {1. Static\\Pruning};
\node[below=0.1cm of Static, text=gray, font=\scriptsize] {Pre-training};
\node[Box, fill=GreenL, draw=GreenLine, right=1.5cm of Static] (Dynamic) {2. Dynamic\\Selection};
\node[below=0.1cm of Dynamic, text=gray, font=\scriptsize] {During Training};
\node[Box, fill=OrangeL, draw=OrangeLine, right=1.5cm of Dynamic] (Synth) {3. Synthetic\\Gen};
\node[below=0.1cm of Synth, text=gray, font=\scriptsize] {On-Demand};
\node[circle, draw=RedLine, fill=RedL, line width=1pt, minimum size=1.2cm, right=1.5cm of Synth, align=center] (Model) {Model};
% Arrows
\draw[Arrow] (Raw) -- (Static);
\draw[Arrow] (Static) -- (Dynamic);
\draw[Arrow] (Dynamic) -- (Synth);
\draw[Arrow] (Synth) -- node[above, font=\scriptsize, text=black!70] {High ICR} (Model);
\end{tikzpicture}
```
:::
The following checkpoint verifies understanding of the core ICR concept before the calculation examples.
::: {.callout-checkpoint title="Data Selection Efficiency" collapse="false"}
The goal of data selection is to maximize the ICR.
**Metrics**
- [ ] **ICR Application**: Given two training runs with identical accuracy gains but different compute budgets, can you determine which had higher ICR?
- [ ] **Data Efficiency**: Do you understand why a 50% smaller dataset with 2 $\times$ higher ICR yields the same model for half the training cost?
**The Pipeline**
- [ ] **The Three Stages**: Can you map Static Pruning, Dynamic Selection, and Synthetic Generation to the training lifecycle?
:::
To make the Information-Compute Ratio concrete, consider how coreset selection improves training efficiency on a real workload.
```{python}
#| label: data-selection-setup
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ICR CORESET COMPARISON
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Computing ICR: Coresets" callout example
# │
# │ Goal: Make the Information-Compute Ratio (ICR) concrete.
# │ Show: That coresets achieve 1.8× higher ICR by focusing on difficult samples.
# │ How: Compare learning-per-FLOP for random sampling vs. coreset selection.
# │
# │ Imports: mlsys.constants (RESNET50_FLOPs, GFLOPs, IMAGENET_IMAGES)
# │ Exports: imagenet_size_str, icr_ratio_str, coreset_pct_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import RESNET50_FLOPs, GFLOPs, IMAGENET_IMAGES
from mlsys.formatting import fmt, check
# --- Inputs (ImageNet/ResNet-50 benchmark scenario) ---
imagenet_size_value = Models.Vision.ResNet50.parameters.magnitude # Using parameters as proxy for N if not in Twin
# Better: check constants for IMAGENET_IMAGES
from mlsys.constants import IMAGENET_IMAGES
imagenet_size_value = IMAGENET_IMAGES.magnitude
acc_gain_random_value = 5.0 # % accuracy per epoch
acc_gain_coreset_value = 4.5 # % with 50% coreset
coreset_fraction_value = 0.5 # keep 50% of data
# --- Process (compute ICR for both strategies using Models Twin) ---
m_resnet = Models.ResNet50
resnet50_fwd_gflops_value = m_resnet.inference_flops.to(GFLOPs).magnitude
resnet50_fwdbwd_gflops_value = (m_resnet.inference_flops * 2).to(GFLOPs).magnitude
full_epoch_flops_value = imagenet_size_value * resnet50_fwdbwd_gflops_value * BILLION
icr_random_value = acc_gain_random_value / full_epoch_flops_value
coreset_size_value = int(imagenet_size_value * coreset_fraction_value)
coreset_flops_value = coreset_size_value * resnet50_fwdbwd_gflops_value * BILLION
icr_coreset_value = acc_gain_coreset_value / coreset_flops_value
icr_ratio_value = icr_coreset_value / icr_random_value
acc_diff_value = acc_gain_random_value - acc_gain_coreset_value
# --- Outputs (formatted strings for prose) ---
resnet50_fwd_gflops_str = fmt(m_resnet.inference_flops.to(GFLOPs), precision=1) # e.g. "8.2"
resnet50_fwdbwd_gflops_str = fmt((m_resnet.inference_flops * 2).to(GFLOPs), precision=1) # e.g. "16.4"
full_epoch_flops_str = f"{full_epoch_flops_value:.2e}" # e.g. "2.10e+19"
icr_random_str = f"{icr_random_value:.1e}" # e.g. "2.4e-19"
imagenet_size_str = fmt(imagenet_size_value / MILLION, precision=2) + "M" # e.g. "1.28M"
coreset_size_str = f"{coreset_size_value / 1000:.0f}K" # e.g. "640K"
coreset_flops_str = f"{coreset_flops_value:.1e}" # e.g. "1.05e+19"
icr_coreset_str = f"{icr_coreset_value:.1e}" # e.g. "4.3e-19"
icr_ratio_str = fmt(icr_ratio_value, precision=1, commas=False) # e.g. "1.8"
acc_gain_random_str = fmt(acc_gain_random_value, precision=1, commas=False) # e.g. "5.0"
acc_gain_coreset_str = fmt(acc_gain_coreset_value, precision=1, commas=False) # e.g. "4.5"
acc_diff_str = fmt(acc_diff_value, precision=1, commas=False) # e.g. "0.5"
coreset_pct_str = fmt(coreset_fraction_value * 100, precision=0, commas=False) # e.g. "50"
```
::: {.callout-example title="Computing ICR: Coresets"}
**Scenario**: Training our **ResNet-50 Lighthouse model** (@sec-network-architectures) on ImageNet for one epoch. We compare random batch selection versus EL2N-based coreset selection (EL2N, or Error L2-Norm, scores each sample by how uncertain the model's prediction is; it is defined formally in @sec-data-selection-coreset-selection-algorithms-2c74). ResNet-50's compute-bound nature (high **arithmetic intensity**; see @sec-machine-foundations-roofline-model-2529 for how the Roofline Model determines this classification) makes it an ideal candidate for data selection optimization: reducing dataset size directly reduces training FLOPs with minimal I/O impact.
**Setup**:
- Dataset: ImageNet (`{python} imagenet_size_str` images)
- Model: ResNet-50 Lighthouse (~`{python} resnet50_fwd_gflops_str` GFLOPs per forward pass, ~`{python} resnet50_fwdbwd_gflops_str` GFLOPs forward + backward)
- One epoch: `{python} imagenet_size_str` $\times$ `{python} resnet50_fwdbwd_gflops_str` GFLOPs = **`{python} full_epoch_flops_str` FLOPs**
- Accuracy improvement per epoch (early training): ~`{python} acc_gain_random_str`% points
**Random Selection (baseline)**:
- Process all `{python} imagenet_size_str` samples uniformly
- Accuracy gain: `{python} acc_gain_random_str` percentage points
- ICR_random = `{python} acc_gain_random_str` / (`{python} full_epoch_flops_str`) = **`{python} icr_random_str` per FLOP**
**EL2N Coreset (`{python} coreset_pct_str`% of data)**:
- Process `{python} coreset_size_str` high-uncertainty samples selected by EL2N scoring
- Coreset focuses on decision boundary samples
- Accuracy gain: `{python} acc_gain_coreset_str` percentage points (90% of full data performance)
- Compute: `{python} coreset_size_str` $\times$ `{python} resnet50_fwdbwd_gflops_str` GFLOPs = **`{python} coreset_flops_str` FLOPs**
- ICR_coreset = `{python} acc_gain_coreset_str` / (`{python} coreset_flops_str`) = **`{python} icr_coreset_str` per FLOP**
**Result**: The coreset achieves **`{python} icr_ratio_str` $\times$ higher ICR**, nearly twice the learning per FLOP, by eliminating low-information "easy" samples that contribute little to the decision boundary. The `{python} acc_diff_str` percentage point accuracy difference is often acceptable given the `{python} coreset_pct_str`% compute savings.
:::
The remainder of this chapter explores each stage of the three-stage optimization pipeline introduced above (static pruning, dynamic selection, and synthetic generation) in depth. We begin with static pruning, the techniques that can reduce a dataset by 30 to 50 percent before training even begins.
## Static Pruning {#sec-data-selection-static-pruning-a390}
\index{Static Pruning!pre-training filtration}
Before a single gradient is computed, significant efficiency gains are available by removing low-value samples from the dataset. This pre-training filtration reduces total computation without affecting, and sometimes improving, final model accuracy—all without modifying the training loop or model architecture.
### The Case for Smaller Datasets {#sec-data-selection-case-smaller-datasets-215e}
\index{Data Redundancy!empirical evidence}
The most counterintuitive finding in data selection is that training on *less* data often produces models just as accurate as training on the full dataset. Practitioners have long assumed that more data yields better performance, and while this holds in many scenarios, it obscures a critical reality: typical large-scale datasets contain massive redundancy. Empirical studies on coreset selection and data pruning have consistently demonstrated this redundancy across standard benchmarks.
On CIFAR-10, gradient-based selection methods (EL2N, GraNd) [@paul2021deep] have shown that training on 50% of carefully selected samples matches the accuracy of the full dataset, with aggressive pruning reaching 10--30% of samples while retaining 90%+ of original performance. ImageNet-1K presents a harder challenge because it is less redundant, yet researchers have demonstrated that 20--30% of ImageNet can be pruned with negligible loss, and up to 50% reduction is possible with a small accuracy trade-off (~1 percentage point), yielding 2 $\times$ fewer training FLOPs [@paul2021deep; @sorscher2022beyond]. The pattern extends to language modeling: web-scraped corpora like The Pile[^fn-the-pile] and C4[^fn-c4] contain substantial exact and near-duplicate content, and deduplication studies [@lee2022deduplicating] report 10--30% redundancy ratios, with deduplicated training yielding *better* downstream performance through less memorization and more generalization.
[^fn-the-pile]: **The Pile**: A 825 GB English text dataset created by EleutherAI (an open-source AI research lab) and released in December 2020. The name evokes a deliberately enormous "pile" of diverse text. The Pile aggregates 22 sub-datasets spanning academic papers (PubMed, ArXiv), books (Project Gutenberg, Books3), code (GitHub), web text (CommonCrawl), and specialized sources (Stack Exchange, Wikipedia, USPTO patents). Its design philosophy prioritizes *diversity* over scale: rather than simply crawling more web pages, The Pile combines curated sources covering different domains, writing styles, and knowledge areas. The dataset became widely used for training open-source language models (GPT-Neo, GPT-J, Pythia) and established that data quality and diversity matter as much as raw volume.
[^fn-c4]: **C4 (Colossal Clean Crawled Corpus)**: Created by Google Research as part of the T5 project (Raffel et al., 2020). C4 applies aggressive filtering to Common Crawl web data: removing pages with fewer than 5 sentences, deduplicating at the three-sentence level, filtering "naughty words," and removing non-English content, boilerplate text, and JavaScript code. The result is ~750 GB of cleaned English text. The "Colossal" in the name is deliberate—C4 demonstrated that *cleaning* web data at scale could match the quality of curated datasets, establishing the "large-scale-with-filters" paradigm that subsequent datasets (RefinedWeb, RedPajama, FineWeb) have refined.
These numbers are benchmark-specific. Gains from pruning depend on the dataset's intrinsic redundancy, the selection algorithm, and the model architecture; always validate on your specific task before deploying aggressive pruning in production. The key insight remains: not all data points provide equal value for training.
\index{Data Quality!noise penalty on convergence}
\index{Convergence Rate!clean vs noisy data}
Why does this heterogeneity exist? The answer lies in how neural networks learn decision boundaries. Most samples fall far from any class boundary: a picture of a dog in good lighting is obviously a dog. These "easy" examples provide diminishing returns after the first few epochs because the model has already mastered them. The informative samples cluster near boundaries where classes become ambiguous. Beyond sample redundancy, label quality also dramatically affects data requirements. The following analysis quantifies *the data quality multiplier*: how label noise penalizes convergence.
```{python}
#| label: data-quality-multiplier-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ DATA QUALITY MULTIPLIER
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "The Data Quality Multiplier" callout (Case for Smaller Datasets)
# │
# │ Goal: Demonstrate the quadratic penalty of label noise.
# │ Show: That noisy data requires 100× more samples to reach 1% error.
# │ How: Contrast sample requirements for clean vs. noisy datasets.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: epsilon_str, epsilon_pct_str, n_clean_str, n_noisy_str, ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class QualityMultiplier:
"""
Namespace for Data Quality Multiplier.
Scenario: Comparing sample complexity for Clean (1/N) vs Noisy (1/sqrt(N)) data.
"""
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
epsilon = 0.01 # 1% Target Error
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
# Clean: Error ~ 1/N => N ~ 1/Error
n_clean = 1.0 / epsilon
# Noisy: Error ~ 1/sqrt(N) => N ~ 1/Error^2
n_noisy = 1.0 / (epsilon ** 2)
ratio = n_noisy / n_clean
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
check(ratio >= 50, f"Noisy penalty ({ratio:.1f}x) is too small to justify cleaning investment.")
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
epsilon_str = fmt(epsilon, precision=2, commas=False)
epsilon_pct_str = fmt(epsilon * 100, precision=0, commas=False)
n_clean_str = fmt(n_clean, precision=0, commas=False)
n_noisy_str = fmt(n_noisy, precision=0, commas=True)
ratio_str = fmt(ratio, precision=0, commas=False)
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
epsilon_str = QualityMultiplier.epsilon_str
epsilon_pct_str = QualityMultiplier.epsilon_pct_str
n_clean_str = QualityMultiplier.n_clean_str
n_noisy_str = QualityMultiplier.n_noisy_str
ratio_str = QualityMultiplier.ratio_str
```
::: {.callout-notebook title="The Data Quality Multiplier"}
**The Physics of Noise**: Why is one clean sample worth 100 noisy ones?
**The Math**: Classical learning theory (for convex optimization with SGD) tells us that convergence rates depend on label noise. While deep learning operates in a non-convex regime, the qualitative relationship holds broadly.
1. **Clean Data**: Convergence rate is typically $O(1/N)$. To halve the error, you need **2 $\times$** data.
2. **Noisy Data**: Convergence rate drops to $O(1/\sqrt{N})$. To halve the error, you need **4 $\times$** data.
**The Multiplier**:
To reach a target error $\epsilon$:
* $N_{clean} \propto 1/\epsilon$
* $N_{noisy} \propto 1/\epsilon^2$
**Example**: For target error $\epsilon$ = `{python} epsilon_str` (`{python} epsilon_pct_str`%):
* $N_{clean}$ ≈ `{python} n_clean_str`
* $N_{noisy}$ ≈ `{python} n_noisy_str`
* **Ratio**: `{python} ratio_str` $\times$ more data required if noisy.
**The Systems Conclusion**: Cleaning your data (removing label noise) is a **`{python} ratio_str` $\times$ compute accelerator**.
:::
The practical question then becomes: *how* do we identify which samples to keep?
### Coreset Selection Algorithms {#sec-data-selection-coreset-selection-algorithms-2c74}
\index{Coreset!definition}
\index{Coreset!etymology}
Coreset selection\index{Static Pruning!coreset selection}[^fn-coreset] answers this question by identifying a small subset of data that preserves the statistical properties of the entire dataset.
[^fn-coreset]: **Coreset**: The term "coreset" combines "core" and "set," reflecting its purpose as a core representative subset. The concept emerged from computational geometry in the early 2000s, where researchers sought provably small subsets that approximate solutions to geometric optimization problems. For ML applications, coresets provide theoretical guarantees: a well-constructed coreset of size independent of the original dataset can approximate the full dataset's loss function within a factor of $(1 + \delta)$.
The goal is to find a compact set of examples that allows a model to generalize as well as it would if trained on the full dataset. Several algorithmic families have proven effective, each with distinct computational trade-offs.
\index{k-Center Algorithm!coreset selection}
Geometry-based methods select samples that cover the data distribution without requiring any model training. The k-Center algorithm[^fn-k-center] (also known as Facility Location) selects samples that minimize the maximum distance from any point to its nearest selected center, ensuring coverage of the entire data manifold.
[^fn-k-center]: **k-Center Algorithm**: Dorit Hochbaum and David Shmoys established the modern approach to this problem in 1985 [@hochbaum1985best], proving that their 2-approximation algorithm is "best possible": no polynomial-time algorithm can achieve a better approximation factor unless P=NP. The algorithm's origin in facility location (placing warehouses to minimize maximum customer distance) explains why it transfers well to coreset selection: both seek coverage of a space with minimal representatives.
\index{Herding!coreset selection}
Herding takes a different approach, iteratively selecting samples whose features best approximate the mean of the full dataset, thereby maintaining distributional fidelity. These methods are computationally attractive because they operate purely on feature representations, but they ignore label information entirely.
\index{GraNd!gradient-based coreset scoring}
\index{EL2N!definition}
\index{Forgetting Events!coreset selection}
Gradient-based methods offer higher selection quality by using training dynamics to identify important samples, though they require training a proxy model first. GraNd (Gradient Normed) and EL2N (Error L2-Norm)[^fn-el2n-grand] score samples by gradient magnitude or prediction error early in training; high-scoring samples lie near the decision boundary and are most informative for learning. \index{Proxy Model!coreset score transfer}
Crucially, these scores transfer across architectures: scores computed on a smaller model like ResNet-18 predict importance for larger models like ResNet-50, enabling inexpensive proxy-based selection. Forgetting Events[^fn-forgetting] tracks how often a sample is "forgotten" (correctly classified, then later misclassified) during training, identifying harder and more valuable examples.
[^fn-el2n-grand]: **EL2N and GraNd**: Introduced by Mansheej Paul and colleagues at NeurIPS 2021 in their paper "Deep Learning on a Data Diet." These scores identify important examples using only information from the first few training epochs, unlike forgetting-based methods that require full training. The key insight: samples the model finds uncertain early in training remain important throughout, and these scores transfer across architectures. Scores computed on ResNet-18 predict importance for ResNet-50.
[^fn-forgetting]: **Forgetting Events**: Coined by Mariya Toneva and colleagues at ICLR 2019. A "forgetting event" occurs when a sample transitions from correctly to incorrectly classified during training (the opposite of a learning event). The surprising finding: a large fraction of samples are never forgotten once learned, and these "unforgettable" examples can be safely pruned with minimal accuracy impact.
These gradient-based approaches generally outperform geometry-based methods in selection quality but incur the overhead of proxy model training. This quality advantage justifies the proxy training overhead for most production workloads, as @tbl-coreset-comparison quantifies:
| **Method** | **Compute Cost** | **Requires Training** | **Best For** | **Limitation** |
|:---------------|:---------------------|:----------------------|:----------------------|:--------------------------|
| **k-Center** | O(N²) or O(NK) | No | Coverage, exploration | Ignores label information |
| **Herding** | O(NK) | No | Distribution matching | Assumes Gaussian-like |
| **GraNd** | O(epochs $\times$ N) | Yes (few epochs) | Decision boundaries | Requires proxy training |
| **Forgetting** | O(full training) | Yes (full) | Hard examples | Expensive to compute |
| **EL2N** | O(epochs $\times$ N) | Yes (few epochs) | Uncertainty sampling | Best with proxy model |
: **Coreset Selection Algorithm Comparison.** N = dataset size, K = coreset size. The fundamental trade-off is selection quality versus computational cost: gradient-based methods (GraNd, EL2N, Forgetting) outperform geometry-based methods (k-Center, Herding) because they use training dynamics to identify decision-boundary samples, but this advantage requires proxy model training as an upfront investment. {#tbl-coreset-comparison .striped .hover}
Each algorithm in @tbl-coreset-comparison represents a different answer to the ICR framework's central question: where in the compute-versus-information trade-off should the selection budget be spent to maximize learning signal per FLOP?
@fig-coreset-selection makes the core insight behind coreset methods concrete. Compare the two panels: random sampling (left) selects points uniformly across the feature space, capturing many samples deep within class regions where the model is already confident. Coreset selection (right) concentrates the selection budget on samples near the decision boundary (the yellow uncertainty band) where the model's predictions are most uncertain. These boundary samples are precisely where additional training provides the most learning signal.
::: {#fig-coreset-selection fig-env="figure" fig-pos="htb" fig-cap="**Coreset Selection Strategy**: Random sampling (left) selects uniformly, wasting budget on easy samples far from the decision boundary. Coreset selection (right) prioritizes samples near the boundary where the model is uncertain, capturing more information per sample." fig-alt="Two scatter plots with a diagonal decision boundary. Left plot shows random dots selected. Right plot highlights dots near the boundary as selected."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
% Left plot: Random Sampling
\begin{scope}
\node[font=\bfseries\usefont{T1}{phv}{m}{n}] at (2.5, 4.2) {Random Sampling};
% Decision boundary
\draw[thick, dashed, gray] (0, 0) -- (5, 5);
% Class A points (below line) - circles
\foreach \x/\y in {0.5/0.2, 1.0/0.5, 0.8/1.2, 1.5/0.8, 2.0/1.0,
2.5/1.5, 1.2/0.3, 0.3/0.8, 1.8/1.5, 2.2/0.5,
3.0/2.0, 3.5/2.5, 2.8/1.8, 3.2/1.2, 4.0/2.8} {
\fill[blue!60] (\x, \y) circle (2pt);
}
% Class B points (above line) - triangles
\foreach \x/\y in {0.5/1.5, 1.0/2.0, 0.3/2.5, 1.5/2.5, 2.0/3.0,
2.5/3.5, 1.2/3.2, 0.8/3.8, 1.8/3.5, 2.2/4.0,
3.0/4.0, 3.5/4.5, 2.8/3.8, 3.2/4.2, 4.0/4.5} {
\fill[red!60] (\x, \y) circle (2pt);
}
% Randomly selected (circled) - some easy, some hard
\foreach \x/\y in {0.5/0.2, 1.5/2.5, 3.0/2.0, 0.8/3.8, 2.2/0.5} {
\draw[thick, orange] (\x, \y) circle (5pt);
}
% Axis
\draw[->] (0, 0) -- (5.2, 0) node[right, font=\tiny\usefont{T1}{phv}{m}{n}] {$x_1$};
\draw[->] (0, 0) -- (0, 5.2) node[above, font=\tiny\usefont{T1}{phv}{m}{n}] {$x_2$};
% Label
\node[font=\footnotesize\usefont{T1}{phv}{m}{n}, orange] at (2.5, -0.5) {Selected (random)};
\end{scope}
% Right plot: Coreset Selection
\begin{scope}[xshift=7cm]
\node[font=\bfseries\usefont{T1}{phv}{m}{n}] at (2.5, 4.2) {Coreset Selection};
% Decision boundary
\draw[thick, dashed, gray] (0, 0) -- (5, 5);
% Uncertainty band near boundary
\fill[yellow!20] (0, 0) -- (0, 1) -- (4, 5) -- (5, 5) -- (5, 4) -- (1, 0) -- cycle;
\node[font=\tiny\usefont{T1}{phv}{m}{n}, fill=white, inner sep=1pt] at (3.5, 3.0) {High uncertainty};
% Class A points (below line) - circles
\foreach \x/\y in {0.5/0.2, 1.0/0.5, 0.8/1.2, 1.5/0.8, 2.0/1.0,
2.5/1.5, 1.2/0.3, 0.3/0.8, 1.8/1.5, 2.2/0.5,
3.0/2.0, 3.5/2.5, 2.8/1.8, 3.2/1.2, 4.0/2.8} {
\fill[blue!60] (\x, \y) circle (2pt);
}
% Class B points (above line) - triangles
\foreach \x/\y in {0.5/1.5, 1.0/2.0, 0.3/2.5, 1.5/2.5, 2.0/3.0,
2.5/3.5, 1.2/3.2, 0.8/3.8, 1.8/3.5, 2.2/4.0,
3.0/4.0, 3.5/4.5, 2.8/3.8, 3.2/4.2, 4.0/4.5} {
\fill[red!60] (\x, \y) circle (2pt);
}
% Coreset selected (near boundary) - circled
\foreach \x/\y in {0.8/1.2, 2.5/1.5, 3.0/2.0, 1.0/2.0, 2.0/3.0} {
\draw[thick, green!60!black] (\x, \y) circle (5pt);
}
% Axis
\draw[->] (0, 0) -- (5.2, 0) node[right, font=\tiny\usefont{T1}{phv}{m}{n}] {$x_1$};
\draw[->] (0, 0) -- (0, 5.2) node[above, font=\tiny\usefont{T1}{phv}{m}{n}] {$x_2$};
% Label
\node[font=\footnotesize\usefont{T1}{phv}{m}{n}, green!60!black] at (2.5, -0.5) {Selected (boundary)};
\end{scope}
\end{tikzpicture}
```
:::
\index{Proxy Model!practical workflow}
Given these trade-offs, most practitioners find that EL2N with a small proxy model offers the best balance of selection quality and computational cost. The approach is straightforward: train a lightweight model (for example, ResNet-18 instead of ResNet-50) for 5 to 10 epochs, compute EL2N scores for all samples, then select the highest-scoring subset. The proxy does not need to be accurate; it only needs to identify which samples are hard. This upfront investment in proxy training typically yields substantial returns when the coreset reduces subsequent training by 50% or more. The following example illustrates this workflow in a concrete scenario.
```{python}
#| label: coreset-practice-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ CORESET PRACTICE EXAMPLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Coreset Selection in Practice" callout
# │
# │ Goal: Outline a practical workflow for 10× data reduction.
# │ Show: How a 5-epoch proxy model can identify the most informative 10% of a dataset.
# │ How: Model the selection of 100K high-uncertainty samples from a 1M image pool.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: n_train_images_str, coreset_fraction_pct_str, n_coreset_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (practical coreset scenario) ---
n_train_images_value = 1_000_000 # 1M training images
coreset_fraction_value = 0.1 # keep 10%
n_epochs_proxy_value = 5 # proxy training epochs
# --- Process ---
n_coreset_value = int(n_train_images_value * coreset_fraction_value)
# --- Outputs (formatted strings for prose) ---
n_train_images_str = fmt(n_train_images_value / MILLION, precision=0) + " million" # e.g. "1 million"
coreset_fraction_pct_str = fmt(coreset_fraction_value * 100, precision=0, commas=False) # e.g. "10"
n_coreset_str = fmt(n_coreset_value, precision=0, commas=True) # e.g. "100,000"
n_epochs_proxy_str = fmt(n_epochs_proxy_value, precision=0, commas=False) # e.g. "5"
```
::: {.callout-example title="Coreset Selection in Practice"}
**Scenario**: You have `{python} n_train_images_str` training images and want to reduce to `{python} n_coreset_str` (`{python} coreset_fraction_pct_str`%) for faster experimentation.
**Naive Approach**: Random sampling loses rare classes and edge cases.
**Coreset Approach**:
1. Train a small proxy model for `{python} n_epochs_proxy_str` epochs
2. Compute EL2N scores for all samples
3. Select the `{python} n_coreset_str` samples with highest uncertainty
4. Train your full model on this coreset
**Result**: The coreset often achieves **higher accuracy** than random sampling because it focuses on the decision boundary rather than redundant "easy" examples.
:::
@lst-el2n-coreset demonstrates how to compute EL2N scores and select a coreset using a lightweight proxy model. The code shows two functions: `compute_el2n_scores` trains a proxy model briefly and measures prediction confidence via L2 distance from one-hot labels, while `select_coreset` retains only the highest-uncertainty samples.
::: {#lst-el2n-coreset lst-cap="**EL2N-Based Coreset Selection**: Computing uncertainty scores with a proxy model enables 10 $\times$ data reduction while preserving accuracy. The `compute_el2n_scores` function trains a small model for a few epochs, then measures prediction confidence via L2 distance from one-hot labels. High scores indicate uncertain samples near decision boundaries. The `select_coreset` function retains only these informative samples, discarding redundant easy examples."}
```{.python}
def compute_el2n_scores(model, dataloader, num_epochs=5):
"""Compute EL2N scores.
Returns L2 norm of (prediction - one_hot_label).
"""
# Train proxy model for a few epochs to get meaningful predictions
train_proxy(model, dataloader, num_epochs)
scores = []
model.eval()
for x, y in dataloader:
logits = model(x)
probs = softmax(logits, dim=1)
# One-hot encode labels
one_hot = zeros_like(probs).scatter_(1, y.unsqueeze(1), 1)
# EL2N score = L2 distance from confident prediction
el2n = (probs - one_hot).norm(dim=1) # High = uncertain
scores.extend(el2n.tolist())
return scores
def select_coreset(scores, dataset, fraction=0.1):
"""Select top-k highest-scoring (most uncertain) samples."""
k = int(len(dataset) * fraction)
# Sort by score descending (highest uncertainty first)
indices = argsort(scores, descending=True)[:k]
return Subset(dataset, indices)
# Usage: 10x data reduction with minimal accuracy loss
scores = compute_el2n_scores(proxy_model, full_loader)
coreset = select_coreset(scores, full_dataset, fraction=0.1)
train_full_model(model, coreset) # 10x faster training
```
:::
### Data Deduplication {#sec-data-selection-data-deduplication-6c20}
\index{Deduplication!definition}
While coreset selection identifies which samples to keep based on their informativeness, a complementary approach targets what to remove: exact and near-duplicates. Deduplication provides immediate efficiency gains with no accuracy penalty and requires no model training. This makes it the most accessible optimization in data selection, offering guaranteed compute savings with zero risk of degrading model quality.
\index{Hash-based Deduplication!exact matching}
The simplest form of deduplication (introduced as a data engineering pipeline stage in @sec-data-engineering-systematic-data-processing-aebc, and here elevated to an optimization lever) uses hash-based methods for exact matches. By computing a cryptographic hash (MD5 or SHA-256) for each sample and removing those with identical hashes, practitioners can eliminate byte-for-byte duplicates that inevitably accumulate in large web-scraped corpora. This process is computationally cheap, scaling linearly with dataset size, and can be parallelized trivially.
\index{MinHash!near-duplicate detection}
\index{Locality-Sensitive Hashing!near-duplicate detection}
Near-duplicate detection addresses the more subtle problem of semantically redundant content that differs at the byte level. For text, MinHash[^fn-minhash] with *Locality-Sensitive Hashing*[^fn-lsh] (LSH) approximates Jaccard similarity[^fn-jaccard] efficiently, detecting paraphrased or lightly edited content. The core idea is to create compact "fingerprints" of each document such that similar documents produce similar fingerprints with high probability, enabling fast approximate similarity detection without comparing every document pair.
[^fn-jaccard]: **Jaccard Similarity**\index{Jaccard Similarity!etymology}: Named after Paul Jaccard (1868--1944), a Swiss botanist who introduced the coefficient in 1901 to compare plant species distributions. Defined as $|A \cap B| / |A \cup B|$, it ranges from 0 (no overlap) to 1 (identical sets). It became a foundational tool of information retrieval because it naturally handles "bag of words" comparisons regardless of document length. MinHash [@broder1997resemblance] provides an efficient probabilistic approximation, enabling web-scale deduplication.
[^fn-lsh]: **Locality-Sensitive Hashing (LSH)**\index{LSH!etymology}: Introduced by Indyk and Motwani [@indyk1998approximate] at Stanford (1998). Unlike cryptographic hash functions (where similar inputs produce *dissimilar* outputs), LSH functions produce *similar* outputs for similar inputs with high probability. This enables approximate nearest-neighbor search in sublinear time, making it feasible to find near-duplicates in billion-document corpora without exhaustive pairwise comparison.
[^fn-minhash]: **MinHash**: Invented by Andrei Broder in 1997 [@broder1997resemblance], originally to detect duplicate web pages for the AltaVista search engine. The algorithm uses random hash functions to create compact "signatures" that preserve set similarity: two documents with similar content produce similar signatures with high probability. Broder received the 2012 ACM Kanellakis Award for this work, recognizing its foundational impact on web-scale similarity detection.
\index{Perceptual Hashing!image deduplication}
\index{Embedding-based Deduplication!semantic similarity}
For images, perceptual hashing produces signatures robust to minor transformations like resizing and compression, identifying visually identical images stored in different formats. Embedding-based similarity offers the most powerful detection by computing dense representations (CLIP[^fn-clip] for images, sentence transformers for text) and clustering similar items, though this approach incurs higher computational overhead.
[^fn-clip]: **CLIP (Contrastive Language-Image Pre-training)**: Introduced by Alec Radford et al. at OpenAI in January 2021. CLIP trains a visual encoder and text encoder jointly on 400 million image-text pairs from the internet, learning to align images and their natural language descriptions in a shared embedding space. The name reflects its training objective: contrastive learning between language and image representations. For data selection, CLIP embeddings serve as a universal similarity metric: images that are semantically similar (even if visually different) produce close embeddings, enabling semantic deduplication far more powerful than pixel-level hashing. CLIP's zero-shot transfer capability also enables text-based filtering of image datasets without task-specific training.
For foundation model pre-training, deduplication has become essential rather than optional. Studies on GPT-3 and LLaMA training demonstrate that deduplicated data improves both training efficiency and downstream performance by preventing memorization of repeated content. The benefit is twofold: fewer wasted FLOPs on redundant samples, and better generalization because the model sees more diverse examples per training token.
Deduplication benefits extend beyond text corpora. The DLRM lighthouse presents a unique variant of this challenge centered on *embedding deduplication*.
::: {.callout-lighthouse title="DLRM and Embedding Deduplication"}
Our **DLRM Lighthouse model** (@sec-network-architectures) presents a unique deduplication challenge. Recommendation systems are memory capacity-bound, with embedding tables consuming terabytes of storage for billions of user/item IDs. Much of this capacity is wasted on *cold embeddings*, IDs that appear rarely in training data.
Data selection for DLRM focuses on **interaction deduplication** (removing redundant user-item pairs) and **embedding pruning** (removing or sharing cold embeddings). A 20% reduction in unique interactions can reduce embedding table size by 3040%, directly addressing DLRM's primary bottleneck: memory capacity rather than compute.
:::
### Data Pruning by Quality {#sec-data-selection-data-pruning-quality-3d69}
Deduplication removes redundant samples, but a third category of problematic data remains: samples that actively harm learning. Quality-based pruning eliminates samples that either contribute no meaningful signal or introduce contradictory information that confuses the optimization process.
\index{Quality Pruning!label error detection}
Label error detection represents the most impactful form of quality pruning. Tools like Cleanlab identify samples where the assigned label is likely incorrect based on model confidence patterns across training. A sample that the model consistently predicts as class A but is labeled class B either represents a hard case near the decision boundary or, more commonly, an annotation mistake. Removing or correcting these mislabeled samples prevents the model from learning contradictory signals that degrade its decision boundary.
Outlier removal addresses a different pathology: samples far from any cluster center in feature space. While outliers might represent valuable edge cases, they more often indicate noise, annotation errors, or data corruption. The key is distinguishing between informative outliers (rare but valid examples of a class) and noise (samples that do not belong to any class). Conservative thresholds help avoid discarding genuinely rare examples.
\index{Perplexity Filtering!low-information text removal}
Low-information filtering applies domain-specific heuristics to remove samples that lack sufficient signal for learning. For text corpora, this means removing documents below a perplexity threshold or with low semantic coherence, often indicative of machine-generated spam or garbled content. For image datasets, filtering targets blurry, corrupted, or near-uniform samples that provide little visual information.
Together, these three static pruning techniques, coreset selection, deduplication, and quality filtering, show that careful curation before training yields significant efficiency gains. The compute savings are multiplicative across the entire training process: a 50% dataset reduction means 50% fewer forward passes, backward passes, and gradient updates across all training epochs. For a model trained for 100 epochs, this translates to 50 epochs worth of saved compute, yielding substantial reductions in both training time and energy consumption.
Static pruning answers a question about *what* to keep, but it treats the answer as fixed. Once the pruned dataset is determined, every epoch trains on the same subset. What if the optimal training samples change as the model learns? The next section explores techniques that adapt the training data dynamically, selecting different samples at different stages of training based on what the model has already mastered.
## Dynamic Selection {#sec-data-selection-dynamic-selection-edaa}
\index{Dynamic Selection!training-time optimization}
The optimal training samples do change as the model learns. Early in training, the model benefits from diverse coverage to build broad feature representations; later, it benefits from focusing on hard examples near the decision boundary to refine its predictions. Dynamic selection exploits this insight by optimizing which samples to use *during* training, adapting the data diet based on the model's evolving state.
### Curriculum Learning: Easy to Hard {#sec-data-selection-curriculum-learning-easy-hard-2c4e}
\index{Bengio, Yoshua!curriculum learning}
\index{Curriculum Learning!difficulty scorer and pacing function}
The first dynamic selection technique, **curriculum learning**\index{Dynamic Selection!curriculum learning}[^fn-curriculum] [@bengio2009curriculum; @soviany2022curriculum], structures the order in which data is presented to the model. Instead of random shuffling, it starts with simpler examples and gradually introduces more complex ones, mirroring how humans learn by mastering basics before advancing to harder material.
[^fn-curriculum]: **Curriculum Learning**: Formalized by Yoshua Bengio and colleagues at ICML 2009, drawing explicit inspiration from human education where students master basics before advanced topics. The paper's key insight was that curriculum learning acts as a "continuation method" for non-convex optimization: starting with easy examples smooths the loss landscape, helping the optimizer find better local minima. The paper has accumulated thousands of citations, reflecting its influence on training methodology.
The effectiveness of curriculum learning stems from how neural networks respond to gradient signals at different training stages. Easy examples provide clear, consistent gradients that establish strong feature representations early in training, when the loss landscape is highly irregular. Hard examples introduced too early produce noisy gradient signals that slow convergence or cause the model to memorize outliers rather than learn general patterns. By sequencing examples from easy to hard, curriculum learning smooths the optimization trajectory.
\index{Pacing Function!linear warmup}
Implementing a curriculum requires two components: a difficulty scorer that ranks samples, and a pacing function that controls how quickly hard samples are introduced. A common choice is linear pacing:
$$
\text{samples}_t = \texttt{sort\_by\_difficulty}[:N \cdot \min(1, t/T_{\text{warmup}})]
$$
where $t$ is the current epoch and $T_{warmup}$ is the epoch at which the full dataset becomes available. Early epochs train on the easiest $N \cdot (t/T_{warmup})$ fraction; after warmup, training proceeds on the full dataset.
The difficulty scorer can be designed in several ways, each with different computational requirements and applicability (@tbl-difficulty-scoring).
| **Strategy** | **Difficulty Score** | **Best For** |
|:----------------------|:------------------------------------------|:--------------------------------------------|
| **Loss-Based** | Loss from probe model (low = easy) | General-purpose; requires probe training |
| **Confidence-Based** | Teacher model confidence (high = easy) | When teacher available; distillation setups |
| **Domain Heuristics** | Sentence length, image complexity | No extra compute; domain knowledge required |
| **Self-Paced** | Current model's loss (updated each epoch) | Adaptive; no probe needed |
: **Difficulty Scoring Strategies for Curriculum Learning.** Loss-based and confidence-based methods require additional model inference; domain heuristics are free but require expertise; self-paced methods adapt dynamically during training. {#tbl-difficulty-scoring .striped .hover}
From a systems perspective, curriculum learning improves convergence by reducing wasted gradient updates on samples the model cannot yet learn from. The Information-Compute Ratio is higher in early training because easy samples provide strong learning signal relative to their compute cost. The efficiency gains manifest as faster convergence to target accuracy, not higher final accuracy. @tbl-curriculum-benchmarks summarizes measured speedups from curriculum learning across standard benchmarks:
```{python}
#| label: curriculum-benchmarks-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ CURRICULUM LEARNING BENCHMARKS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @tbl-curriculum-benchmarks (Curriculum Learning section)
# │
# │ Goal: Quantify speedups from curriculum learning across benchmarks.
# │ Show: That speedups depend on dataset redundancy (23% for CIFAR vs. 11% for ImageNet).
# │ How: List reported speedup percentages for standard vision datasets.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: cifar10_speedup_str, imagenet_speedup_str, mentornet_speedup_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (benchmark results from literature) ---
cifar10_baseline_epochs = 150 # standard training
cifar10_curriculum_epochs = 115 # with curriculum
cifar100_baseline_epochs = 220 # standard training
cifar100_curriculum_epochs = 180 # with curriculum
imagenet_baseline_epochs = 90 # standard training
imagenet_curriculum_epochs = 80 # with curriculum
mentornet_baseline_epochs = 90 # noisy labels
mentornet_curriculum_epochs = 70 # with MentorNet
# --- Process (compute speedup percentages) ---
cifar10_speedup_pct = (cifar10_baseline_epochs - cifar10_curriculum_epochs) / cifar10_baseline_epochs * 100
cifar100_speedup_pct = (cifar100_baseline_epochs - cifar100_curriculum_epochs) / cifar100_baseline_epochs * 100
imagenet_speedup_pct = (imagenet_baseline_epochs - imagenet_curriculum_epochs) / imagenet_baseline_epochs * 100
mentornet_speedup_pct = (mentornet_baseline_epochs - mentornet_curriculum_epochs) / mentornet_baseline_epochs * 100
# --- Outputs (formatted strings for table) ---
cifar10_speedup_str = fmt(cifar10_speedup_pct, precision=0, commas=False) # e.g. "23"
cifar100_speedup_str = fmt(cifar100_speedup_pct, precision=0, commas=False) # e.g. "18"
imagenet_speedup_str = fmt(imagenet_speedup_pct, precision=0, commas=False) # e.g. "11"
mentornet_speedup_str = fmt(mentornet_speedup_pct, precision=0, commas=False) # e.g. "22"
```
Observe the varying convergence gains in @tbl-curriculum-benchmarks:
| **Dataset** | **Model** | **Pacing Strategy** | **Epochs to Target Acc.** | **Speedup** |
|:--------------|----------:|:--------------------|-----------------------------------------------------------------------------------------:|---------------------------------------------:|
| **CIFAR-10** | ResNet-18 | Linear warmup | `{python} cifar10_curriculum_epochs` vs. `{python} cifar10_baseline_epochs` baseline | **`{python} cifar10_speedup_str`%** faster |
| **CIFAR-100** | ResNet-32 | Self-paced | `{python} cifar100_curriculum_epochs` vs. `{python} cifar100_baseline_epochs` baseline | **`{python} cifar100_speedup_str`%** faster |
| **ImageNet** | ResNet-50 | Loss-based | `{python} imagenet_curriculum_epochs` vs. `{python} imagenet_baseline_epochs` baseline | **`{python} imagenet_speedup_str`%** faster |
| **ImageNet** | ResNet-50 | MentorNet (noisy) | `{python} mentornet_curriculum_epochs` vs. `{python} mentornet_baseline_epochs` baseline | **`{python} mentornet_speedup_str`%** faster |
: **Curriculum Learning Convergence Speedups.** Target accuracy is 95% of final baseline performance. Gains are larger on redundant datasets (CIFAR-10) and noisy datasets (MentorNet removes approximately 40% noise). ImageNet shows smaller gains because the dataset is less redundant. {#tbl-curriculum-benchmarks .striped .hover}
\index{Self-Paced Learning!adaptive difficulty}
\index{Anti-Curriculum!hard examples first}
The table reveals an important pattern: curriculum learning gains are *inversely proportional to dataset quality*. On highly curated datasets like ImageNet, the `{python} imagenet_speedup_str`% speedup is modest. On noisy or redundant data, gains can exceed 20%. The optimal ordering is also task-dependent: *anti-curriculum* (hard examples first) can work when the decision boundary is complex and easy examples contribute little to defining it, while *self-paced learning* lets the model dynamically adjust difficulty based on its current loss, eliminating the need to pre-define a curriculum. Empirically, self-paced methods often match or exceed hand-designed curricula.
### Active Learning: Human-in-the-Loop {#sec-data-selection-active-learning-humanintheloop-6932}
Curriculum learning optimizes the order in which samples are presented but assumes all samples are already labeled. This assumption breaks down in specialized fields such as medical diagnosis, autonomous driving, and scientific research, where labeling requires domain expertise and can cost \$5\$100 or more per sample. Rather than labeling everything upfront, **active learning**\index{Dynamic Selection!active learning}[^fn-active-learning-theory] [@settles2009active; @ren2021survey] shifts the optimization target: instead of choosing which labeled samples to train on, it chooses which unlabeled samples are worth labeling at all.
[^fn-active-learning-theory]: **Active Learning**: The concept traces to statistical experimental design, but Dana Angluin's work on learning from queries [@angluin1988queries] established theoretical foundations for machine learning. The term "active" contrasts with "passive" learning from pre-labeled data: the learner actively queries an oracle (the human annotator or labeling source) rather than passively receiving examples. Early work in the 1990s demonstrated that active selection could achieve the same accuracy as passive learning with exponentially fewer labels in favorable cases.
Unlike static pruning, which discards samples permanently, active learning maintains an unlabeled pool and queries it strategically over time. Follow the cycle in @fig-active-learning-loop: the model's current uncertainty determines what gets labeled next, creating a feedback loop where each labeling round improves the model's ability to identify what it still needs to learn.
::: {#fig-active-learning-loop fig-cap="**Active Learning Loop**: Instead of labeling all data, the model selects the most 'confusing' or informative samples from an unlabeled pool. These samples are sent to an Oracle (human annotator) and added to the training set. The model is retrained, and the cycle repeats, creating a feedback loop that maximizes information gain per label." fig-alt="A cycle diagram: Unlabeled Pool -> Selection Strategy -> Oracle -> Labeled Set -> Model Training -> back to Selection Strategy."}
```{=latex}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth]
\definecolor{GreenLine}{HTML}{008F45}
\definecolor{GreenL}{HTML}{D4EFDF}
\definecolor{BlueLine}{HTML}{006395}
\definecolor{BlueL}{HTML}{D6EAF8}
\definecolor{OrangeLine}{HTML}{E37222}
\definecolor{OrangeL}{HTML}{FDEBD0}
\definecolor{RedLine}{HTML}{DA291C}
\definecolor{RedL}{HTML}{FADBD8}
\definecolor{GreyFill}{HTML}{E8E8E8}
\definecolor{GreyLine}{HTML}{888888}
\tikzset{
Box/.style={
draw,
rounded corners,
minimum width=2.2cm,
minimum height=1.2cm,
align=center,
line width=1pt
},
Arrow/.style={->, line width=1.2pt, color=black!70}
}
% Nodes
% Unlabeled Pool (Top Left)
\node[Box, fill=GreyFill, draw=GreyLine, dashed] (Pool) at (0, 3) {Unlabeled\\Pool};
% Selection Strategy (Top Right)
\node[Box, fill=BlueL, draw=BlueLine] (Select) at (5, 3) {Selection\\Strategy};
% Oracle (Middle Right)
\node[circle, draw=OrangeLine, fill=OrangeL, line width=1pt, minimum size=1.5cm, align=center] (Oracle) at (5, 0) {Oracle\\(Human)};
% Training Set (Bottom Right)
\node[Box, fill=GreenL, draw=GreenLine] (TrainSet) at (5, -3) {Training\\Set};
% Model (Bottom Left)
\node[Box, fill=RedL, draw=RedLine] (Model) at (0, -3) {Model};
% Arrows
\draw[Arrow] (Pool) -- node[above, font=\scriptsize] {Query} (Select);
\draw[Arrow] (Select) -- node[right, font=\scriptsize] {Uncertainty} (Oracle);
\draw[Arrow] (Oracle) -- node[right, font=\scriptsize] {Labels} (TrainSet);
\draw[Arrow] (TrainSet) -- node[below, font=\scriptsize] {Train} (Model);
\draw[Arrow, dashed] (Model) -- node[left, font=\scriptsize] {Update} (Pool);
\end{tikzpicture}
```
:::
\index{Uncertainty Sampling!active learning query strategy}
\index{Query-by-Committee!active learning strategy}
The effectiveness of active learning depends critically on the query strategy used to select samples for annotation. The simplest approach, uncertainty sampling, selects samples where the model is least confident, such as predictions near 0.5 probability for binary classification. This strategy is computationally cheap and effective in practice. Query-by-committee extends this idea by training multiple models and selecting samples where they disagree most, capturing epistemic uncertainty that a single model might miss.
For practitioners willing to invest more compute, expected model change selects samples that would cause the largest gradient update if labeled. This approach provides a theoretically grounded but expensive alternative. Diversity sampling complements uncertainty-based methods by selecting samples dissimilar from currently labeled data, ensuring the labeled set covers the full input space rather than clustering around ambiguous regions.
Active learning is particularly valuable in domains where labeling requires expertise. In medical imaging, for instance, an AI system diagnosing diseases from X-rays may be confident on common conditions but uncertain about rarer cases. By focusing human annotation on these ambiguous cases, active learning optimizes the use of expensive expert time while accelerating model improvement.
The economic implications are substantial. In production settings, labeling costs often dwarf compute costs because a specialist's time is far more expensive than GPU hours. These query strategies drive each iteration of the active learning loop in @fig-active-learning-loop, and the *active learning ROI* can exceed 10 $\times$, as the following example demonstrates.
```{python}
#| label: active-learning-roi-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ACTIVE LEARNING ROI CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "The Active Learning ROI" callout (medical imaging scenario)
# │
# │ Goal: Quantify the economic return of active learning.
# │ $5/label, naive labeling costs $5M. Active learning achieves the same
# │ accuracy with 50K labels ($250K), saving $4.75M and enabling 20x faster
# │ training.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: n_unlabeled_str, cost_saving_str, speedup_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (medical imaging scenario) ---
n_unlabeled_value = 1_000_000 # scans in pool
cost_per_label_value = 5.00 # $/label (specialist)
budget_value = 500_000 # $ available
deadline_months_value = 1 # time constraint
# --- Process (compare naive vs active learning) ---
cost_all_value = n_unlabeled_value * cost_per_label_value
n_random_value = int(budget_value / cost_per_label_value)
n_random_pct_value = n_random_value / n_unlabeled_value * 100
n_active_value = 50_000 # samples needed with AL
cost_active_value = n_active_value * cost_per_label_value
cost_active_pct_value = (budget_value - cost_active_value) / budget_value * 100
speedup_value = n_unlabeled_value / n_active_value
cost_saving_value = cost_all_value - cost_active_value
# --- Outputs (formatted strings for prose) ---
n_unlabeled_str = fmt(n_unlabeled_value / MILLION, precision=0) + " Million" # e.g. "1 Million"
cost_per_label_str = fmt(cost_per_label_value, precision=2, commas=False) # e.g. "5.00"
budget_str = fmt(budget_value, precision=0, commas=True) # e.g. "500,000"
cost_all_str = fmt(cost_all_value, precision=0, commas=True) # e.g. "5,000,000"
n_random_str = fmt(n_random_value, precision=0, commas=True) # e.g. "100,000"
n_random_pct_str = fmt(n_random_pct_value, precision=0, commas=False) # e.g. "10"
n_active_str = fmt(n_active_value, precision=0, commas=True) # e.g. "50,000"
cost_active_str = fmt(cost_active_value, precision=0, commas=True) # e.g. "250,000"
cost_active_pct_str = fmt(cost_active_pct_value, precision=0, commas=False) # e.g. "50"
speedup_str = fmt(speedup_value, precision=0, commas=False) # e.g. "20"
cost_saving_str = fmt(cost_saving_value / MILLION, precision=2) + " Million" # e.g. "4.75 Million"
deadline_months_str = fmt(deadline_months_value, precision=0, commas=False) # e.g. "1"
```
::: {.callout-notebook title="The Active Learning ROI"}
**Problem**: You are building a medical diagnostic AI. You have a pool of **`{python} n_unlabeled_str` unlabeled scans**. A specialist doctor charges **\$`{python} cost_per_label_str`** to label one scan. You have a budget of **\$`{python} budget_str`** and a deadline of **`{python} deadline_months_str` month**.
**Scenario A: Naive Labeling**
1. **Cost**: Labeling all 1M scans would cost **\$`{python} cost_all_str`** (10 $\times$ over budget).
2. **Time**: You can only afford to label `{python} n_random_str` random scans.
3. **Result**: Your model misses rare pathologies because they weren't in the random `{python} n_random_pct_str`%.
**Scenario B: Active Learning**
1. **Strategy**: Use an uncertainty-based selection to pick the **`{python} n_active_str`** "hardest" scans for the doctor to label.
2. **Cost**: `{python} n_active_str` $\times$ `{python} cost_per_label_str` = **\$`{python} cost_active_str`**. (`{python} cost_active_pct_str`% under budget).
3. **Training Speed**: With `{python} speedup_str` $\times$ less data, each training epoch is **`{python} speedup_str` $\times$ faster**.
4. **Result**: Empirical studies suggest that these `{python} n_active_str` "high-information" samples often achieve higher accuracy than `{python} n_random_str` random samples.
**The Systems Conclusion**: Data Selection is not just a "data trick"; it is a **`{python} speedup_str` $\times$ compute accelerator** and a **\$`{python} cost_saving_str`** cost-saving measure.
:::
Compare the two curves in @fig-active-learning-multiplier: Active Learning shifts the learning curve to the left, achieving target accuracy with far fewer samples than random selection. The curves are illustrative to highlight the qualitative gap.
```{python}
#| label: fig-active-learning-multiplier
#| echo: false
#| fig-cap: "**The Active Learning Multiplier**: Model Accuracy vs. Number of Labeled Samples (Log Scale). Random sampling (gray dashed) yields linear improvements, often requiring massive datasets to capture rare edge cases. Active Learning (green solid) targets informative samples, reaching the same accuracy with fewer labels. Curves are illustrative to show the qualitative advantage."
#| fig-alt: "Line chart of Accuracy vs Labeled Samples (log scale). Green line (Active Learning) rises much faster than gray line (Random Sampling). Shaded area between them shows cost savings."
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ACTIVE LEARNING MULTIPLIER FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-active-learning-multiplier (Active Learning section)
# │
# │ Goal: Visualize the data efficiency advantage of active learning.
# │ Show: That active learning reaches 90% accuracy 4× faster than random sampling.
# │ How: Plot learning curves showing accuracy vs. labeled sample count.
# │
# │ Imports: numpy, mlsys.viz
# │ Exports: (figure output only)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
from mlsys import viz
fig, ax, COLORS, plt = viz.setup_plot()
# --- Plot: Active Learning vs Random Sampling ---
samples = np.logspace(2, 4, 100)
acc_random = 50 + 40 * np.log10(samples/100 + 1) / np.log10(101)
acc_active = 50 + 45 * (1 - np.exp(-samples/1000))
acc_active = np.minimum(acc_active, 95)
acc_random = np.minimum(acc_random, 95)
ax.plot(samples, acc_random, '--', color=COLORS['grid'], label='Random Sampling', linewidth=2)
ax.plot(samples, acc_active, '-', color=COLORS['GreenLine'], label='Active Learning', linewidth=2.5)
ax.fill_between(samples, acc_random, acc_active, color=COLORS['GreenL'], alpha=0.3)
ax.set_xscale('log')
ax.set_xlabel('Labeled Samples')
ax.set_ylabel('Accuracy (%)')
ax.annotate("", xy=(2500, 90), xytext=(9000, 90), arrowprops=dict(arrowstyle="->", color='black'))
ax.text(5000, 91, "4x Data Efficiency", ha='center', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.legend(loc='lower right', fontsize=8)
plt.show()
```
Active learning yields more than cost savings: it directs the model toward precisely the examples that matter most. The Smart Doorbell Lighthouse illustrates this principle in the context of hard negative mining.
::: {.callout-lighthouse title="Mining for Hard Negatives"}
**The "Hard Negative" Problem**: Our **Smart Doorbell** faces a classic data selection challenge. The vast majority of its video feed is empty (easy negatives) or clearly people (easy positives). The model fails on the 0.01% of "Hard Negatives": statues, posters of people, or laundry piles that cast human-like shadows.
Random sampling will miss these rare failures. Instead, the Wake Vision team uses **Active Learning** to specifically query the Oracle (human reviewers) on low-confidence predictions. If the model sees a "statue" and predicts "Person (51%)", that sample is flagged for labeling. This turns the feedback loop from a random walk into a guided search for the decision boundary, reducing the data required to solve the "statue problem" by orders of magnitude compared to random collection.
:::
### Semi-Supervised Learning: Using Unlabeled Data {#sec-data-selection-semisupervised-learning-using-unlabeled-data-51fc}
\index{Semi-supervised Learning!definition}
Consider a medical imaging dataset: a hospital has 50,000 chest X-rays, but only 500 have been reviewed and labeled by radiologists—a labeling rate of 1%. Training a supervised model on 500 examples yields poor accuracy, but the structural patterns in the remaining 49,500 unlabeled images contain information about what healthy and abnormal lungs look like. Semi-supervised learning exploits this abundant unlabeled data to improve the model trained on the scarce labeled examples.
Active learning optimizes which samples to label but still requires human annotation for every selected example. **Semi-supervised learning** takes a more aggressive approach: rather than asking *which* samples to label, it asks whether we can extract learning signal from unlabeled data directly. It uses a small set of labeled examples to guide learning on a much larger unlabeled pool, typically achieving 8095% of fully supervised accuracy with only 1020% of the labels.
The core insight behind semi-supervised learning is that unlabeled data, while it cannot directly teach the mapping from inputs to outputs, contains structural information about the input distribution $P(X)$ that constrains the hypothesis space. A decision boundary that cuts through dense regions of $P(X)$ is unlikely to generalize well because it would assign different labels to similar inputs. Semi-supervised methods use unlabeled data to push decision boundaries toward low-density regions, where class transitions are more likely to occur naturally.
\index{Pseudo-labeling!semi-supervised technique}
Three main techniques implement this insight. *Pseudo-labeling*[^fn-pseudo-labeling] takes the most direct approach: train on labeled data, use the model to generate "pseudo-labels" for high-confidence unlabeled predictions, then retrain on both. The confidence threshold is critical: setting it too low introduces label noise that degrades learning, while setting it too high wastes potentially useful data.
[^fn-pseudo-labeling]: **Pseudo-labeling**: First proposed by Dong-Hyun Lee at the ICML 2013 Workshop on Challenges in Representation Learning. The idea's simplicity belies its power: use a trained model's own confident predictions as ground-truth labels for unlabeled data. The term "pseudo" (from Greek *pseudēs*, false) acknowledges that these labels are not verified by humans—they are the model's best guesses. The technique's effectiveness depends on a virtuous cycle: accurate predictions on easy unlabeled examples expand the training set, which improves the model, which enables accurate predictions on harder examples. The risk is a vicious cycle: incorrect pseudo-labels reinforce errors, a phenomenon called *confirmation bias* in semi-supervised learning.
\index{Consistency Regularization!semi-supervised technique}
\index{Semi-supervised Learning!consistency plus pseudo-labeling}
Consistency regularization[^fn-consistency-reg] takes a different angle by enforcing that the model produces similar predictions for augmented versions of the same input. A robust classifier should be invariant to realistic perturbations like cropping, rotation, or color shifts. Methods like FixMatch[^fn-fixmatch] combine both approaches, assigning pseudo-labels only to samples where the unaugmented prediction is confident but training the model to predict these labels on strongly augmented versions of the same images.
[^fn-consistency-reg]: *Consistency regularization*: Rooted in the *smoothness assumption* in semi-supervised learning: if two inputs $x_1$ and $x_2$ are close in input space, their labels should also be close. Sajjadi et al. (2016) and Laine and Aila (2017, "Temporal Ensembling") formalized this as a training objective: minimize the divergence between a model's predictions on an input and its augmented version. The approach draws on the broader principle of *invariance*—if a perturbation should not change the label, the model should learn to ignore it. This is conceptually distinct from data augmentation (which creates more training examples) because it explicitly enforces prediction consistency as a loss term, even for unlabeled data where the "correct" label is unknown.
[^fn-fixmatch]: **FixMatch**: Introduced by Kihyuk Sohn et al. at NeurIPS 2020, FixMatch elegantly unifies pseudo-labeling and consistency regularization into a single framework. The algorithm applies weak augmentation (flipping, cropping) to generate pseudo-labels, then requires the model to predict those same labels on strongly augmented versions (RandAugment, CTAugment) of the same images. Only high-confidence pseudo-labels (above a threshold, typically 0.95) are used, filtering out unreliable labels. FixMatch achieved 94.93% accuracy on CIFAR-10 with just 250 labels—matching fully supervised performance with 200 $\times$ fewer annotations—establishing it as a landmark in label-efficient learning.
\index{Label Propagation!graph-based semi-supervised}
Label propagation offers a third paradigm through graph-based reasoning: construct a similarity graph over all samples and propagate labels from labeled nodes to their neighbors. This approach works particularly well when the feature space exhibits clear cluster structure.
The systems trade-off in semi-supervised learning is straightforward: it typically achieves the same accuracy as fully supervised training with 510 $\times$ fewer labels but requires more compute because training processes both labeled and unlabeled samples. Since labeling costs often dominate compute costs in production settings, this trade-off is usually favorable. The results of *FixMatch on CIFAR-10* illustrate this label efficiency concretely.
```{python}
#| label: fixmatch-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FIXMATCH LABEL EFFICIENCY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "FixMatch on CIFAR-10" callout and @tbl-fixmatch-cifar10
# │
# │ Goal: Quantify the trade-off between labels and compute in SSL.
# │ Show: That trading 5× more compute for 200× fewer labels yields 8× total savings.
# │ How: Compare labeling and compute costs for FixMatch vs. supervised baseline.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: cifar10_fixmatch_*_str, acc_loss_str, cost_reduction_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (FixMatch benchmark results) ---
cifar10_full_labels = 50000 # CIFAR-10 full size
cifar10_full_acc = 96.1 # % supervised accuracy
cifar10_fixmatch_4k_labels = 4000 # 8% of data
cifar10_fixmatch_4k_acc = 95.7 # % FixMatch accuracy
cifar10_fixmatch_250_labels = 250 # 0.5% of data
cifar10_fixmatch_250_acc = 94.9 # % FixMatch accuracy
cifar10_fixmatch_40_labels = 40 # 0.08% of data
cifar10_fixmatch_40_acc = 88.6 # % FixMatch accuracy
# --- Inputs (cost assumptions) ---
cost_label = 1 # $/label
cost_gpu_hr = 0.50 # $/GPU-hour
supervised_labels = 4000 # baseline comparison
supervised_compute_cost = 50 # $ compute cost
fixmatch_labels = 250 # semi-supervised
fixmatch_compute_cost = 250 # $ (5x more training)
# --- Process (compute efficiencies and costs) ---
cifar10_fixmatch_4k_eff = cifar10_full_labels / cifar10_fixmatch_4k_labels
cifar10_fixmatch_250_eff = cifar10_full_labels / cifar10_fixmatch_250_labels
cifar10_fixmatch_40_eff = cifar10_full_labels / cifar10_fixmatch_40_labels
supervised_label_cost = supervised_labels * cost_label
supervised_total = supervised_label_cost + supervised_compute_cost
fixmatch_label_cost = fixmatch_labels * cost_label
fixmatch_total = fixmatch_label_cost + fixmatch_compute_cost
fixmatch_compute_multiplier_value = fixmatch_compute_cost / supervised_compute_cost
cost_reduction = supervised_total / fixmatch_total
acc_loss = cifar10_full_acc - cifar10_fixmatch_250_acc
# --- Outputs (formatted strings for table and prose) ---
supervised_label_cost_str = fmt(supervised_label_cost, precision=0, commas=True) # e.g. "4,000"
supervised_total_str = fmt(supervised_total, precision=0, commas=True) # e.g. "4,050"
fixmatch_label_cost_str = fmt(fixmatch_label_cost, precision=0, commas=True) # e.g. "250"
fixmatch_total_str = fmt(fixmatch_total, precision=0, commas=True) # e.g. "500"
cost_reduction_str = fmt(cost_reduction, precision=0, commas=False) # e.g. "8"
acc_loss_str = fmt(acc_loss, precision=1, commas=False) # e.g. "1.2"
fixmatch_compute_multiplier_str = fmt(fixmatch_compute_multiplier_value, precision=0, commas=False) # e.g. "5"
cifar10_full_labels_str = fmt(cifar10_full_labels, precision=0, commas=True) # e.g. "50,000"
cifar10_fixmatch_4k_labels_str = fmt(cifar10_fixmatch_4k_labels, precision=0, commas=True) # e.g. "4,000"
cifar10_fixmatch_250_labels_str = fmt(cifar10_fixmatch_250_labels, precision=0, commas=True) # e.g. "250"
cifar10_fixmatch_40_labels_str = fmt(cifar10_fixmatch_40_labels, precision=0, commas=True) # e.g. "40"
```
::: {.callout-example title="FixMatch on CIFAR-10"}
**FixMatch** [@sohn2020fixmatch] combines pseudo-labeling with consistency regularization to achieve high label efficiency (@tbl-fixmatch-cifar10).
| **Label Budget** | **Method** | **Accuracy** | **Label Efficiency** |
|:------------------------------------------------------|:-----------------|-------------------------------------:|----------------------------------------------------------------:|
| **`{python} cifar10_full_labels_str` (100%)** | Fully Supervised | `{python} cifar10_full_acc`% | Baseline |
| **`{python} cifar10_fixmatch_4k_labels_str` (8%)** | FixMatch | `{python} cifar10_fixmatch_4k_acc`% | **`{python} cifar10_fixmatch_4k_eff` $\times$ more efficient** |
| **`{python} cifar10_fixmatch_250_labels_str` (0.5%)** | FixMatch | `{python} cifar10_fixmatch_250_acc`% | **`{python} cifar10_fixmatch_250_eff` $\times$ more efficient** |
| **`{python} cifar10_fixmatch_40_labels_str` (0.08%)** | FixMatch | `{python} cifar10_fixmatch_40_acc`% | `{python} cifar10_fixmatch_40_eff` $\times$ more efficient |
: **FixMatch Label Efficiency on CIFAR-10.** With 250 labels (0.5% of the dataset), FixMatch achieves within 1.2 points of full supervision, demonstrating 200 $\times$ label efficiency. {#tbl-fixmatch-cifar10 .striped .hover}
With only `{python} cifar10_fixmatch_250_labels_str` labeled samples (25 per class), FixMatch achieves `{python} cifar10_fixmatch_250_acc`% accuracy, within `{python} acc_loss_str` points of full supervision using `{python} cifar10_fixmatch_250_eff` $\times$ fewer labels. The technique works by generating pseudo-labels on weakly augmented unlabeled images (only when model confidence exceeds 0.95), then training to predict these labels on strongly augmented versions of the same images.
**The Systems Insight**: Semi-supervised learning trades labeled data for unlabeled data and compute. On CIFAR-10, training FixMatch requires ~`{python} fixmatch_compute_multiplier_str` $\times$ more compute than supervised training (processing 50K unlabeled samples per epoch). When labels cost \$1 each and GPU hours cost \$0.50, the math favors semi-supervised:
- Supervised (`{python} cifar10_fixmatch_4k_labels_str` labels): \$`{python} supervised_label_cost_str` labeling + \$`{python} supervised_compute_cost` compute = **\$`{python} supervised_total_str`**
- FixMatch (`{python} cifar10_fixmatch_250_labels_str` labels): \$`{python} fixmatch_label_cost_str` labeling + \$`{python} fixmatch_compute_cost` compute = **\$`{python} fixmatch_total_str`**
An `{python} cost_reduction_str` $\times$ cost reduction for ~`{python} acc_loss_str` points of accuracy loss.
:::
These gains are substantial, but semi-supervised learning is not universally applicable. The technique assumes that unlabeled data comes from the same distribution as labeled data, and it struggles when unlabeled data contains out-of-distribution samples (the model confidently mislabels them), when class imbalance is severe (pseudo-labels amplify majority class bias), or when the labeled set does not cover all classes (preventing label propagation for unseen classes). Always validate on a held-out set with true labels to catch distribution mismatch.
Despite these limitations, semi-supervised learning reduces label requirements by 510 $\times$ while maintaining accuracy. We have now progressively reduced labeling demands through a clear trajectory: coreset selection and deduplication prune low-value samples before training; curriculum learning optimizes the order of presentation during training; active learning queries only the most informative samples for human annotation; and semi-supervised learning exploits unlabeled data to stretch those annotations further. Each technique has pushed the label requirement lower, but none has eliminated it. This raises a deeper question: do we need *any* task-specific labels at all? What if the structure of data itself---the fact that cat images resemble other cat images, that coherent sentences follow grammatical patterns---could provide the supervision signal?
## Self-Supervised Learning {#sec-data-selection-selfsupervised-learning-7518}
\index{Self-supervised Learning!definition}
\index{Masked Modeling!self-supervised breakthrough}
\index{Self-supervised Learning!etymology}
GPT was trained to predict the next word in a sentence. BERT was trained to fill in masked words. Neither task required a single human label. **Self-supervised learning**[^fn-self-supervised] generalizes this insight: by designing *pretext tasks* that derive supervision from the data's inherent structure, models learn powerful representations from unlabeled data at scale. Where the progression from active learning to semi-supervised learning drove required labels asymptotically toward zero, SSL breaks through that asymptote entirely. It represents the field's most powerful response to the Data Wall introduced in @sec-data-selection-data-selection-fundamentals-e839: rather than searching for more high-quality labeled data in a finite pool, SSL redefines what counts as training data by extracting supervision from the structure of unlabeled corpora that exist at web scale.
[^fn-self-supervised]: **Self-Supervised Learning**: While self-supervision ideas existed earlier, 2018 marked the paradigm's breakthrough year. BERT (Google, October 2018) demonstrated that masked language modeling could produce representations achieving state-of-the-art results on 11 NLP tasks. GPT (OpenAI, June 2018) showed that next-token prediction at scale yielded surprisingly general language understanding. Together, they established pre-training on unlabeled data as the dominant paradigm for NLP, later extended to vision and multimodal domains.
\index{Pretext Tasks!self-supervised learning}
The key insight is that labels represent just one form of supervision. Data structure itself provides rich learning signals that require no human annotation, as @tbl-self-supervised-tasks summarizes.
| **Modality** | **Self-Supervised Task** | **Supervision Signal** |
|:----------------|:-------------------------|:--------------------------------------------|
| **Text** | Masked language modeling | Predict [MASK] from context |
| **Text** | Next-token prediction | Predict next word in sequence |
| **Images** | Contrastive learning | Same image (augmented) vs. different images |
| **Images** | Masked autoencoding | Reconstruct masked patches |
| **Multi-modal** | CLIP-style alignment | Match image-text pairs |
: **Self-Supervised Pretext Tasks by Modality.** Each task extracts supervision from data structure rather than human labels, enabling pre-training on unlimited unlabeled corpora. {#tbl-self-supervised-tasks .striped .hover}
\index{Masked Language Modeling!pretext task}
\index{Next-Token Prediction!pretext task}
\index{Contrastive Learning!pretext task}
\index{Multimodal Alignment!contrastive pre-training}
These pretext tasks generate supervision signals automatically. A model trained to predict masked words necessarily learns grammar, semantics, and world knowledge; a model trained to distinguish augmented views of the same image learns visual features invariant to transformations. The architectural details of these approaches, from contrastive methods like SimCLR and MoCo to masked modeling and generative pre-training, are examined in @sec-network-architectures. From a data selection perspective, the systems implication is what matters: self-supervised pre-training moves the data cost off the critical path. Instead of waiting for labels before training begins, pre-training starts immediately on unlabeled data, often web-scale corpora of billions of samples. This separation of pre-training from task-specific labeling restructures the economics of machine learning.
### The Economics of Amortization {#sec-data-selection-economics-amortization-98bb}
\index{Cost Amortization!self-supervised pre-training}
Understanding *why* self-supervised learning dominates modern ML practice requires examining its economic structure. The shift translates into concrete cost savings through *cost amortization*, where expensive pre-training is performed once and reused across many applications (@tbl-cost-amortization).
| **Approach** | **Labels per Task** | **Compute per Task** | **Data Acquisition** |
|:-------------------------------|--------------------:|----------------------:|:--------------------------|
| **Train from scratch** | 100K1M labeled | 100% full training | Task-specific collection |
| **Fine-tune foundation model** | 1001K labeled | 15% of full training | Reuse pre-training corpus |
: **Cost Amortization in Foundation Model Fine-Tuning.** Pre-training costs are paid once and amortized across all downstream tasks; fine-tuning costs scale with task count but remain small per task. This asymmetry explains why fine-tuning dominates modern ML: the marginal cost of each new task drops by 20 $\times$ or more compared to training from scratch. {#tbl-cost-amortization .striped .hover}
To illustrate this economic transformation, consider a company building ten specialized classifiers for tasks such as fraud detection, content moderation, and medical diagnosis.
```{python}
#| label: foundation-cost-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FOUNDATION MODEL COST AMORTIZATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Economics of Amortization" section and @tbl-cost-amortization
# │
# │ Goal: Demonstrate the economic shift toward foundation models.
# │ Show: That fine-tuning yields a 100× label reduction and 20× marginal compute drop.
# │ How: Contrast total costs for 10 tasks trained from scratch vs. fine-tuned.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: label_cost_drop_str, marginal_compute_reduction_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (cost scenario: 10 classification tasks) ---
cost_scratch_per_task_value = 1000 # GPU-hrs per task
n_tasks_value = 10 # number of tasks
cost_pretrain_value = 10000 # GPU-hrs (one-time)
cost_finetune_value = 50 # GPU-hrs per task
labels_per_task_scratch = 100_000 # labels needed from scratch
cost_per_label = 1 # $/label
labels_per_task_finetune = 1_000 # labels for fine-tuning
# --- Process (compute total costs and reductions) ---
cost_scratch_total_value = cost_scratch_per_task_value * n_tasks_value
cost_foundation_total_value = cost_pretrain_value + (cost_finetune_value * n_tasks_value)
label_cost_scratch_total = labels_per_task_scratch * cost_per_label * n_tasks_value
label_cost_finetune_total = labels_per_task_finetune * cost_per_label * n_tasks_value
label_cost_reduction = label_cost_scratch_total / label_cost_finetune_total
marginal_compute_reduction = cost_scratch_per_task_value / cost_finetune_value
crossover_tasks_value = cost_pretrain_value / (cost_scratch_per_task_value - cost_finetune_value)
# --- Outputs (formatted strings for prose) ---
total_a_hrs_str = f"{cost_scratch_total_value:,}" # e.g. "10,000"
total_b_hrs_str = f"{cost_foundation_total_value:,}" # e.g. "10,500"
labels_per_task_scratch_str = fmt(labels_per_task_scratch, precision=0, commas=True) # e.g. "100,000"
label_cost_scratch_total_str = f"${label_cost_scratch_total / 1_000_000:.0f}M" # e.g. "$1M"
cost_scratch_per_task_str = fmt(cost_scratch_per_task_value, precision=0, commas=True) # e.g. "1,000"
cost_scratch_total_str = fmt(cost_scratch_total_value, precision=0, commas=True) # e.g. "10,000"
labels_per_task_finetune_str = fmt(labels_per_task_finetune, precision=0, commas=True) # e.g. "1,000"
label_cost_finetune_total_str = f"${label_cost_finetune_total / 1_000:.0f}K" # e.g. "$10K"
cost_finetune_value_str = fmt(cost_finetune_value, precision=0, commas=True) # e.g. "50"
cost_pretrain_value_str = fmt(cost_pretrain_value, precision=0, commas=True) # e.g. "10,000"
marginal_compute_reduction_str = fmt(marginal_compute_reduction, precision=0, commas=False) # e.g. "20"
crossover_tasks_str = fmt(crossover_tasks_value, precision=0, commas=False) # e.g. "11"
label_cost_drop_str = fmt(label_cost_reduction, precision=0, commas=False) # e.g. "100"
```
Training each classifier from scratch would require substantial investment in both labeling and compute. With ten tasks each needing `{python} labels_per_task_scratch_str` labels at \$1 per label, the total labeling cost reaches **`{python} label_cost_scratch_total_str`**. The compute burden amounts to `{python} cost_scratch_total_str` GPU-hours across all tasks, with each requiring its own data collection effort. From start to finish, each task takes 612 months to complete.
The fine-tuning approach restructures these costs. Pre-training requires a one-time investment of `{python} cost_pretrain_value_str` GPU-hours on unlabeled data, but this cost is paid only once. Fine-tuning each task then requires just `{python} labels_per_task_finetune_str` labels (`{python} label_cost_finetune_total_str` total across all ten tasks) and only `{python} cost_finetune_value_str` GPU-hours of compute. Each task reaches deployment in 12 weeks after pre-training completes.
The return on investment is substantial across every dimension: labeling costs drop by **`{python} label_cost_drop_str` $\times$** (from `{python} label_cost_scratch_total_str` to `{python} label_cost_finetune_total_str`), per-task marginal compute decreases by **`{python} marginal_compute_reduction_str` $\times$**, and time to deployment accelerates by **2050 $\times$** per task.
This explains *why* the fine-tuning paradigm dominates production ML. The pre-training cost is high but amortized across many downstream applications, while fine-tuning cost remains low on a per-task basis.
```{python}
#| label: foundation-amortization-data
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FOUNDATION MODEL AMORTIZATION (FIGURE DATA)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-amortization-comparison (TikZ bar chart)
# │
# │ Goal: Provide data for visualizing training cost amortization.
# │ Show: The comparable total compute for scratch vs. foundation models across 10 tasks.
# │ How: Re-calculate total GPU-hours for use in the TikZ bar chart.
# │
# │ Imports: (none)
# │ Exports: total_a_hrs_str, total_b_hrs_str
# └─────────────────────────────────────────────────────────────────────────────
# --- Inputs (same as foundation-cost-calc) ---
cost_scratch_per_task_value = 1000 # GPU-hrs per task
n_tasks_value = 10 # number of tasks
cost_pretrain_value = 10000 # GPU-hrs (one-time)
cost_finetune_value = 50 # GPU-hrs per task
# --- Process ---
cost_scratch_total_value = cost_scratch_per_task_value * n_tasks_value
cost_foundation_total_value = cost_pretrain_value + (cost_finetune_value * n_tasks_value)
# --- Outputs (formatted strings for figure annotations) ---
total_a_hrs_str = f"{cost_scratch_total_value:,}" # e.g. "10,000"
total_b_hrs_str = f"{cost_foundation_total_value:,}" # e.g. "10,500"
```
Contrast the two bar charts in @fig-amortization-comparison to see this cost structure in action. Training from scratch (left) incurs the full cost for each task independently. The foundation model approach (right) pays a large upfront pre-training cost but then fine-tunes each task at a fraction of the per-task cost.
```{python}
#| label: fig-amortization-comparison
#| echo: false
#| fig-cap: "**Cost Amortization in Foundation Models**: Training from scratch (left) requires 1,000 GPU-hours per task (10,000 total for 10 tasks). The foundation model approach (right) pays 10,000 GPU-hours upfront for pre-training but reduces each subsequent task to just 50 GPU-hours. At 10 tasks the totals are comparable (10,000 vs 10,500), but the per-task marginal cost drops by 20×, and the crossover favoring the foundation model occurs around 11 tasks."
#| fig-alt: "Two bar charts side by side. Left (Train from Scratch) shows 10 equal bars of 1,000 GPU-hours each, totaling 10,000 hours. Right (Foundation Model) shows one tall pre-training bar of 10,000 GPU-hours followed by 10 short fine-tuning bars of 50 GPU-hours each, totaling 10,500 hours."
import numpy as np
import matplotlib.pyplot as plt
from mlsys import viz
viz.set_book_style()
COLORS = viz.COLORS
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
tasks = np.arange(1, 11)
# Left: Train from Scratch
ax1.bar(tasks, [1]*10, color=COLORS['RedL'], edgecolor=COLORS['RedLine'], linewidth=0.8, width=0.6)
ax1.set_title('Train from Scratch', fontweight='bold', fontsize=11)
ax1.set_xlabel('Task Number')
ax1.set_ylabel('Cost (1000 GPU-hours)')
ax1.set_ylim(0, 12)
ax1.set_xticks(tasks)
ax1.set_yticks([0, 2, 4, 6, 8, 10])
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.grid(False)
ax1.text(5.5, 11, 'Total: 10,000 hrs', ha='center', fontsize=9, fontweight='bold', color=COLORS['RedLine'])
# Right: Foundation Model
fm_tasks = np.arange(0, 11)
fm_labels = ['Pre'] + [str(i) for i in range(1, 11)]
pre_train = [10] + [0]*10
fine_tune = [0] + [0.05]*10
ax2.bar(fm_tasks, pre_train, color=COLORS['BlueL'], edgecolor=COLORS['BlueLine'], linewidth=0.8, width=0.6, label='Pre-training')
ax2.bar(fm_tasks, fine_tune, bottom=pre_train, color=COLORS['GreenL'], edgecolor=COLORS['GreenLine'], linewidth=0.8, width=0.6, label='Fine-tuning')
ax2.set_title('Foundation Model', fontweight='bold', fontsize=11)
ax2.set_xlabel('Task Number')
ax2.set_ylim(0, 12)
ax2.set_xticks(fm_tasks)
ax2.set_xticklabels(fm_labels)
ax2.set_yticks([0, 2, 4, 6, 8, 10])
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.grid(False)
ax2.legend(loc='upper right', fontsize=8, frameon=False)
ax2.text(5.5, 11, 'Total: 10,500 hrs', ha='center', fontsize=9, fontweight='bold', color=COLORS['BlueLine'])
plt.tight_layout()
plt.show()
```
### Foundation Model Paradigm {#sec-data-selection-foundation-model-paradigm-f0f1}
\index{Foundation Models!definition}
\index{Contrastive Learning!batch-based methods}
The amortization economics favor self-supervised learning broadly, though different SSL methods occupy different points on the cost-efficiency frontier. Contrastive approaches [@chen2020simclr; @he2020momentum] require large batch sizes (4,096+ samples) but yield excellent downstream performance with minimal labeled data. Masked modeling methods work with smaller batches at the cost of more training iterations. Generative pre-training scales log-linearly with data volume, making it the preferred approach for foundation models where pre-training cost is amortized across thousands of tasks. The architectural distinctions between these families are examined in @sec-network-architectures; what matters for data selection is the shared conclusion: self-supervised pre-training represents a **1,000 $\times$ or greater multiplier** on the value of labeled data. Instead of labeling millions of task-specific examples, practitioners fine-tune on hundreds or thousands of labeled samples while inheriting knowledge distilled from billions of unlabeled tokens.
This multiplicative advantage creates the *foundation model paradigm*\index{Self-supervised learning!foundation model paradigm}[^fn-foundation-model] [@bommasani2021opportunities] that defines modern ML systems. The data selection principles discussed throughout this chapter (coreset selection, curriculum learning, active learning) remain relevant within the foundation model paradigm. Pre-training corpus curation applies the same deduplication and quality filtering techniques at web scale, and fine-tuning data selection determines which labeled examples maximize downstream task performance.
[^fn-foundation-model]: **Foundation Model**: Term coined by Stanford's Center for Research on Foundation Models in 2021 to describe models like BERT, GPT-3, and DALL-E. The name emphasizes a critical property: these models serve as a "foundation" for many downstream tasks, but this creates dangerous homogenization. Defects in the foundation model propagate to all applications built upon it, making them single points of failure that can "radiate harms" across an ecosystem.
Self-supervised learning addresses the label bottleneck by learning from data structure rather than human annotation. But what happens when the data itself is scarce? When rare classes have too few examples, when edge cases never appear in the wild, or when privacy constraints prevent collecting real samples? The third stage of our data selection pipeline addresses this gap: rather than selecting or curating existing data, we create new data on demand.
## Synthetic Data Generation {#sec-data-selection-synthetic-data-generation-415c}
\index{Synthetic Data Generation!three-stage pipeline}
Static pruning removed redundancy before training. Dynamic selection focused compute on the most informative samples during training. The third and final stage of the data selection pipeline takes the opposite approach: rather than subtracting or selecting from existing data, it creates new high-value samples when real data is scarce, expensive, or lacks diversity. The strategy shifts from curation to **creation**.
### Data Augmentation: Transformation-Based Synthesis {#sec-data-selection-data-augmentation-transformationbased-synthesis-3aa7}
\index{Data Augmentation!definition}
Data augmentation expands a dataset by applying transformations to existing samples. Because many transformations preserve label semantics while creating novel inputs, augmentation effectively multiplies the diversity of a training set without requiring additional data collection.
\index{Cutout!image augmentation}
\index{MixUp!image augmentation}
\index{CutMix!image augmentation}
For image data, augmentation techniques span a range of complexity. Geometric transformations such as rotation, flipping, cropping, and scaling introduce spatial variation that makes models robust to viewpoint changes. Photometric transformations adjust brightness, contrast, saturation, and hue to simulate different lighting conditions and camera characteristics. More advanced techniques like Cutout[^fn-cutout] (which applies random rectangular masks), MixUp[^fn-mixup] [@zhang2018mixup] (which blends two images and their labels), and CutMix[^fn-cutmix] (which pastes patches between images) push augmentation further by creating entirely synthetic training examples that regularize learning.
[^fn-cutout]: **Cutout**: Introduced by Terrance DeVries and Graham Taylor at the University of Guelph in 2017. The technique randomly masks square regions of input images during training, forcing the model to recognize objects from partial information rather than relying on any single discriminative region. The name describes the operation literally: "cutting out" a rectangular patch and replacing it with zeros. Cutout is related to dropout (which randomly zeroes *neurons*), but operates in input space rather than feature space. The simplicity is striking: a single hyperparameter (patch size) yields consistent 1--2% accuracy improvements on CIFAR-10/100 and ImageNet, with negligible computational overhead.
[^fn-cutmix]: **CutMix**: Introduced by Sangdoo Yun et al. at ICCV 2019. CutMix replaces Cutout's zeroed-out region with a patch from a different training image, and proportionally mixes the labels according to patch area. For example, if 30% of image A is replaced by a patch from image B, the label becomes 70% class-A and 30% class-B. This addresses Cutout's weakness: zeroed regions waste information and can confuse the model. CutMix instead forces the model to simultaneously recognize two objects in a single image, providing stronger regularization than either Cutout or MixUp alone. CutMix improves ImageNet top-1 accuracy by ~1% over the baseline while also improving robustness to occlusion.
[^fn-mixup]: **MixUp**: Introduced by Hongyi Zhang and colleagues at ICLR 2018. The elegantly simple idea (train on linear interpolations of image pairs with correspondingly interpolated labels) produces surprisingly strong regularization. The paper showed MixUp reduces memorization of corrupt labels, improves adversarial robustness, and stabilizes GAN training, all from a technique requiring just two lines of code to implement.
\index{MixUp!etymology}
\index{Back-translation!text augmentation}
Text augmentation presents different challenges because language is discrete rather than continuous. Back-translation[^fn-back-translation] offers one solution: translating text to another language and back generates paraphrases that preserve meaning while varying surface form. Simpler approaches include synonym replacement, which swaps words while preserving semantics, and random insertion or deletion, which adds noise that makes models robust to typos and informal input.
[^fn-back-translation]: **Back-translation**: A data augmentation technique borrowed from machine translation research. First proposed by Rico Sennrich et al. at ACL 2016, the method translates a sentence into a foreign language (e.g., English → French) and then back to the original language (French → English). The round-trip produces paraphrases that preserve meaning while varying syntax, vocabulary, and sentence structure. For low-resource NLP tasks, back-translation can effectively double or triple the training set. The technique exploits the inherent ambiguity of translation: "The cat sat on the mat" might become "Le chat s'est assis sur le tapis" in French, which back-translates to "The cat was sitting on the carpet"—semantically equivalent but lexically distinct.
\index{Learned Augmentation Policy!automated search}
\index{Random Augmentation!simplified policy}
Rather than hand-designing these augmentation policies, AutoAugment[^fn-autoaugment] uses reinforcement learning to discover optimal augmentation strategies for specific datasets, while RandAugment[^fn-randaugment] simplifies this by randomly sampling from a fixed set of transformations, achieving similar performance with less computation.
[^fn-autoaugment]: **AutoAugment**: Introduced by Ekin Cubuk et al. at Google Brain (CVPR 2019). AutoAugment treats augmentation policy design as a search problem: a reinforcement learning controller selects augmentation operations (rotate, translate, shear, equalize, etc.), their magnitudes, and application probabilities. The controller is trained to maximize validation accuracy on a proxy task. The key finding was that learned policies transfer across datasets and architectures—policies optimized on a small proxy model and subset of ImageNet improve performance when applied to the full dataset with larger models. The main limitation is search cost: the original paper required 15,000 GPU-hours to find a single policy.
[^fn-randaugment]: **RandAugment**: Introduced by Ekin Cubuk et al. at NeurIPS 2020 as a drastically simpler alternative to AutoAugment. RandAugment eliminates the costly search phase entirely by using just two hyperparameters: *N* (number of transformations to apply) and *M* (magnitude of each transformation). At each training step, *N* transformations are randomly selected from a fixed set and applied at magnitude *M*. Despite its simplicity, RandAugment matches or exceeds AutoAugment's performance across multiple benchmarks, suggesting that the search overhead of learned augmentation policies may be unnecessary when the augmentation space is well-designed.
These learned augmentation policies are particularly effective for resource-constrained models, where overfitting risk is highest. The MobileNet lighthouse illustrates this principle: when model capacity is deliberately reduced for edge deployment, augmentation becomes the primary defense against overfitting.
::: {.callout-lighthouse title="MobileNet and Aggressive Augmentation"}
Our **MobileNet Lighthouse model** (@sec-network-architectures) exemplifies how data augmentation compensates for model capacity constraints. MobileNet's depthwise separable convolutions reduce parameters by 89 $\times$ compared to standard convolutions, but this efficiency comes at a cost: smaller models are more prone to overfitting on limited data.
The solution is **aggressive augmentation**. MobileNet training typically uses stronger augmentation than ResNet-50 training, including RandAugment with higher magnitude, more aggressive cropping, and longer training schedules. The augmentation effectively increases dataset diversity without increasing model capacity, allowing MobileNet to achieve near-ResNet accuracy at a fraction of the parameter count. For edge deployment where both data collection and model size are constrained, augmentation is essential rather than optional.
:::
### Generative Synthesis: Creating New Samples {#sec-data-selection-generative-synthesis-creating-new-samples-a43d}
Augmentation transforms existing samples; synthetic data generation goes further by creating entirely new examples using generative models. This capability becomes essential in three common scenarios: when real data is privacy-sensitive (as with medical records or financial transactions), when edge cases are rare (such as autonomous driving failure scenarios that must be covered but seldom occur), or when data collection is prohibitively expensive (as in robotics or scientific experiments where each sample requires physical resources).
\index{GAN!synthetic data generation}
\index{Diffusion Models!text-to-image synthesis}
Three classes of generative approaches address these needs, each with distinct cost and fidelity trade-offs. Generative Adversarial Networks (GANs) train a generator against a discriminator in an adversarial setup, producing realistic images through competition; StyleGAN, for instance, generates photorealistic faces that have augmented facial recognition datasets. Diffusion models use iterative denoising to produce high-quality images; systems like Stable Diffusion[^fn-stable-diffusion] enable text-to-image synthesis, allowing you to generate targeted training examples from natural language descriptions. Finally, simulation engines such as CARLA for autonomous driving or Unity and Unreal for robotics offer physics-based rendering that generates unlimited labeled data with perfect ground-truth annotations, making them particularly valuable for safety-critical applications where edge case coverage is essential.
### Bridging the Domain Gap {#sec-data-selection-bridging-domain-gap-100b}
[^fn-stable-diffusion]: **Stable Diffusion**: Released by Stability AI in August 2022, based on the latent diffusion model architecture from Robin Rombach et al. at LMU Munich (CVPR 2022). "Stable" refers to the model's training stability achieved by performing diffusion in a compressed latent space (via a variational autoencoder) rather than in pixel space, which reduces computational cost by 10--50 $\times$ compared to pixel-space diffusion models like DALL-E 2. For data synthesis, Stable Diffusion's text-to-image capability enables generating targeted training examples from natural language descriptions (e.g., "a dog on a snowy sidewalk at night"), providing fine-grained control over the synthetic data distribution. The open-source release democratized access to high-quality image generation, making synthetic data augmentation practical for teams without massive compute budgets.
\index{Domain Gap!synthetic vs real data}
Synthetic data's greatest limitation is the **domain gap**[^fn-domain-gap]: the statistical difference between generated and real-world data, as illustrated in @fig-domain-gap. A model trained only on synthetic data learns a decision boundary optimized for the wrong distribution, potentially performing well on synthetic test data while failing on real deployment data.
::: {#fig-domain-gap fig-env="figure" fig-pos="htb" fig-cap="**The Domain Gap Problem**: Synthetic data (blue) and real data (orange) have different distributions. A model trained on synthetic data alone learns a boundary that fails on real data. Domain adaptation techniques aim to align these distributions or learn domain-invariant features." fig-alt="Two overlapping bell curves representing synthetic and real data distributions, with a decision boundary that works for synthetic but misses real data."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\begin{axis}[
width=10cm, height=5cm,
xlabel={Feature Space},
ylabel={Density},
xmin=-3, xmax=7,
ymin=0, ymax=0.5,
axis lines=left,
xtick=\empty,
ytick=\empty,
legend style={at={(0.98,0.98)}, anchor=north east, font=\footnotesize\usefont{T1}{phv}{m}{n}, draw=none},
clip=false
]
% Synthetic data distribution (centered at 0)
\addplot[thick, blue, domain=-3:4, samples=100, fill=blue!20, fill opacity=0.5]
{0.4*exp(-0.5*(x)^2)};
\addlegendentry{Synthetic}
% Real data distribution (centered at 3, slightly different shape)
\addplot[thick, orange, domain=0:7, samples=100, fill=orange!20, fill opacity=0.5]
{0.35*exp(-0.4*(x-3)^2)};
\addlegendentry{Real}
% Decision boundary learned from synthetic
\draw[thick, dashed, red] (axis cs: 1.5, 0) -- (axis cs: 1.5, 0.45);
\node[red, font=\footnotesize\usefont{T1}{phv}{m}{n}, align=center] at (axis cs: 1.5, 0.48) {Synthetic\\boundary};
% Ideal boundary for real data
\draw[thick, dotted, green!60!black] (axis cs: 3, 0) -- (axis cs: 3, 0.45);
% Domain gap annotation
\draw[<->, thick, purple] (axis cs: 0, 0.42) -- (axis cs: 3, 0.42);
\node[purple, font=\footnotesize\usefont{T1}{phv}{m}{n}, fill=white, inner sep=1pt] at (axis cs: 1.5, 0.42) {Domain Gap};
\end{axis}
\end{tikzpicture}
```
:::
[^fn-domain-gap]: **Domain Gap**: Also called "domain shift" or "distribution shift," this concept formalizes the statistical divergence between two data sources. Formally, the domain gap between source $\mathcal{S}$ and target $\mathcal{T}$ can be measured as the divergence $d(\mathcal{S}, \mathcal{T})$ between their distributions, using metrics like Maximum Mean Discrepancy (MMD) or Fréchet Inception Distance (FID). The term gained prominence in computer vision through the work of Saenko et al. (2010) on visual domain adaptation, which showed that classifiers trained on one visual domain (e.g., webcam images) could lose 20--40% accuracy when applied to another (e.g., DSLR photos), even for the same object categories. For synthetic data, the domain gap arises because generative models and simulators inevitably introduce systematic biases absent from real-world data.
\index{Domain Randomization!bridging synthetic-real gap}
Two complementary strategies address this distribution mismatch. Domain randomization[^fn-domain-randomization] takes an aggressive approach: rather than trying to match the real world precisely, it trains on wildly varied synthetic data by randomizing lighting, textures, backgrounds, and camera parameters during generation.
[^fn-domain-randomization]: **Domain Randomization**: Introduced by Josh Tobin et al. at OpenAI (2017) in "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." The counterintuitive insight is that making synthetic data *less* realistic (by randomizing textures, colors, lighting, and physical properties) can actually improve transfer to the real world. By training on wildly varied synthetic environments, the model learns features robust to visual variation, treating the real world as just another point in the distribution of training domains. The technique has proven especially effective for robotic manipulation, where Tobin et al. showed that a model trained entirely in simulation with domain randomization could transfer to a real robot arm without any fine-tuning on real images.
If the model encounters sufficient variation during training, the real world becomes "just another variation" within its learned distribution. This strategy produces strong results for robotics and autonomous driving, where simulation technology is mature enough to generate physically plausible variations across a wide range.
\index{Domain Adaptation!feature alignment}
Domain adaptation takes the opposite approach by explicitly aligning synthetic and real distributions. Feature alignment methods train on synthetic data while simultaneously minimizing the distance between synthetic and real feature distributions, often using adversarial training to learn domain-invariant representations. Fine-tuning offers a simpler path: pre-train on abundant synthetic data to learn general features, then fine-tune on a small real dataset to adapt to deployment conditions. Self-training combines these ideas by using a synthetic-trained model to pseudo-label real unlabeled data, then retraining on the combined labeled set.
In practice, the best results often come from mixing synthetic and real data rather than relying on either source alone. @tbl-synthetic-mix summarizes typical outcomes across different mixing ratios.
| **Synthetic Fraction** | **Typical Outcome** |
|:-----------------------------|:-------------------------------------------|
| **100% synthetic** | Poor real-world generalization |
| **80% synthetic + 20% real** | Good performance, significant cost savings |
| **50% synthetic + 50% real** | Best performance in many domains |
| **100% real** | Baseline (expensive) |
: **Synthetic-to-Real Data Mixing Ratios.** Pure synthetic data suffers from distribution shift; pure real data is expensive. The optimal ratio varies by domain but typically falls in the 5080% synthetic range when simulation fidelity is high. {#tbl-synthetic-mix .striped .hover}
\index{Model Collapse!recursive synthetic training}
The optimal mix depends on simulation fidelity, domain complexity, and the cost differential between synthetic and real data. When synthetic data comes from ML models rather than simulators, there is a risk of *model collapse*: training on model-generated data amplifies errors and reduces diversity over generations. This concern is particularly acute for foundation models, where synthetic data from earlier model generations may contaminate future training corpora. With appropriate safeguards, synthetic data generation remains a powerful tool. The following example illustrates how to combine multiple data selection techniques (augmentation, noise injection, and simulation) into a coherent strategy for a real deployment scenario.
::: {.callout-example title="KWS Data Selection"}
**Scenario**: Our **Keyword Spotting Lighthouse model** (@sec-network-architectures), a DS-CNN with **200 K** parameters, represents the extreme end of data selection challenges. You are building a wake-word detector ("Hey Device") for a microcontroller with 256 KB SRAM (see @sec-ml-systems-tinyml-ubiquitous-sensing-scale-a67b for hardware constraints). The model must be tiny (~50 KB quantized), but you need 10,000+ labeled audio samples to train it, samples that do not yet exist.
**The Data Collection Problem**:
- Recording 10,000 real utterances requires 500+ speakers for diversity
- Professional recording costs \$25 per sample (\$2050K total)
- Target deployment environment (noisy kitchen, car interior) differs from recording studio
**Data Selection Solution Stack**:
1. **Seed Data (500 samples)**: Record 50 speakers $\times$ 10 utterances in controlled conditions
2. **Augmentation (5,000 samples)**: Apply pitch shift, time stretch, speed variation to 10 $\times$ the seed data
3. **Noise Injection (10,000 samples)**: Mix clean audio with environmental noise (kitchen appliances, HVAC, traffic) sampled from AudioSet
4. **Negative Mining**: Use acoustic similarity to find hard negatives ("Hey Siri", "Hey Google") from public datasets
5. **Simulation (optional)**: Text-to-speech synthesis with diverse voice models
**Result**: 500 real recordings → 10,000+ training samples at 5% of the cost. The noise injection serves as domain randomization, improving deployment robustness.
**Key Insight for TinyML**: When the target model is tiny, the data selection challenge shifts from "reduce terabytes to gigabytes" to "create a useful dataset from almost nothing." Augmentation and simulation become essential rather than optional.
:::
### Knowledge Distillation: Compressing Information {#sec-data-selection-knowledge-distillation-compressing-information-ce68}
\index{Knowledge Distillation!as data selection technique}
\index{Hinton, Geoffrey!knowledge distillation}
The techniques above create new input samples, but there is another form of synthesis that creates enhanced labels. Knowledge distillation[^fn-distillation] [@hinton2015distilling], examined in depth as a compression technique in @sec-model-compression, also serves as a data selection technique where a smaller "student" model learns from a larger "teacher" model's outputs rather than raw labels. This section treats distillation as a *data selection* technique, where the teacher's outputs serve as enriched training data that carries more information per sample than hard labels. @sec-model-compression examines the complementary perspective: distillation as a model compression technique for producing smaller, faster student models suitable for resource-constrained deployment.
[^fn-distillation]: **Knowledge Distillation**: Introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in 2015. Hinton coined the evocative term "dark knowledge" for the information in soft probability distributions: the teacher reveals not just which class is correct but which incorrect classes are most plausible. The temperature parameter in the softmax function controls how much dark knowledge is exposed: higher temperatures produce softer distributions that transfer more nuanced inter-class relationships.
\index{Dark Knowledge!soft probability distributions}
\index{Temperature Parameter!softmax for distillation}
The key insight is that the teacher's soft predictions contain more information than hard labels: a teacher predicting [0.7, 0.2, 0.1] for three classes reveals inter-class relationships (classes 1 and 2 are more similar) that a hard label [1, 0, 0] obscures entirely.
This richer supervision signal enables student models to learn more efficiently from the same data. From a systems perspective, distillation is particularly powerful for creating synthetic labels at scale: run a large model (such as GPT-4) on unlabeled data to generate high-quality annotations, then train a smaller model on these synthetic labels. The smaller model inherits much of the teacher's capability at a fraction of the inference cost, amortizing the expensive teacher computation across many student deployments.
Together, augmentation, generative synthesis, and distillation complete the third stage of our data selection pipeline. Where static pruning removes redundancy and dynamic selection focuses compute on high-value samples, synthetic generation fills gaps by creating samples that never existed. These three stages form a complementary toolkit: pruning reduces what you have, selection focuses how you use it, and synthesis expands what you can access.
The preceding sections examined selection techniques across all three pipeline stages. We now consolidate their characteristics into a decision framework for practical application.
## Decision Framework {#sec-data-selection-decision-framework-0261}
@tbl-data-selection summarizes the three-stage optimization pipeline introduced at the beginning of this chapter.
| **Stage** | **When Applied** | **Techniques** | **Typical Gains** |
|:----------------------------|:-----------------|:------------------------------------------------------|---------------------------------------:|
| **1. Static Pruning** | Before training | Coreset Selection, Deduplication, Quality Filtering | 3050% dataset reduction |
| **2. Dynamic Selection** | During training | Curriculum Learning, Active Learning, Semi-Supervised | 1030% faster convergence |
| **3. Synthetic Generation** | On-demand | Augmentation, Generative Models, Distillation | 210 $\times$ effective data expansion |
: **Three-Stage Data Selection Pipeline.** Each stage increases ICR by different mechanisms: pruning removes low-value samples, dynamic selection focuses compute on high-value samples, and synthesis creates new high-value samples. {#tbl-data-selection .striped .hover}
@tbl-technique-selection provides a decision guide for selecting techniques based on your specific constraints.
| **Constraint** | **Best Technique** | **Why** |
|:-------------------------------|:------------------------|:-----------------------------------------------------|
| **Limited labeling budget** | Active Learning | Maximizes label ROI by selecting informative samples |
| **High redundancy in data** | Deduplication + Coreset | Removes waste before training begins |
| **Rare classes or edge cases** | Synthetic Generation | Creates samples that do not exist in raw data |
| **Slow convergence** | Curriculum Learning | Improves gradient quality in early training |
| **Privacy requirements** | Synthetic Data | Train on generated data, not real user data |
| **Large model, small dataset** | Knowledge Distillation | Use teacher model's knowledge as "data" |
: **Technique Selection Guide by Primary Constraint.** Recommended data selection techniques mapped to the dominant resource constraint in the ML pipeline. {#tbl-technique-selection .striped .hover}
@tbl-technique-selection maps individual constraints to techniques, but real projects face multiple constraints simultaneously. The decision tree below (@fig-technique-decision-tree) structures the selection process hierarchically: start by identifying your *primary* bottleneck, then follow the branches to narrow the field.
::: {#fig-technique-decision-tree fig-cap="**Data Selection Technique Selection Tree**: Start at the top by identifying your primary bottleneck, then follow the branches to find the most appropriate technique. Leaf nodes show recommended methods. Multiple paths may apply; combine techniques as needed." fig-alt="A decision tree flowchart with diamond decision nodes and rectangular technique recommendations. Starts with bottleneck identification and branches to specific techniques."}
```{=latex}
\usetikzlibrary{shapes.geometric}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, >=stealth, node distance=1.5cm and 1.5cm]
\definecolor{GreenLine}{HTML}{008F45}
\definecolor{GreenL}{HTML}{D4EFDF}
\definecolor{BlueLine}{HTML}{006395}
\definecolor{BlueL}{HTML}{D6EAF8}
\definecolor{OrangeLine}{HTML}{E37222}
\definecolor{OrangeL}{HTML}{FDEBD0}
\definecolor{RedLine}{HTML}{DA291C}
\definecolor{RedL}{HTML}{FADBD8}
\tikzset{
Decision/.style={
diamond,
draw=BlueLine,
fill=BlueL,
align=center,
line width=1pt,
aspect=2,
inner sep=2pt,
font=\footnotesize\bfseries
},
Leaf/.style={
rectangle,
rounded corners,
draw=GreenLine,
fill=GreenL,
align=center,
line width=1pt,
minimum height=1cm,
font=\footnotesize\bfseries
},
Arrow/.style={->, line width=1pt, color=black!70, rounded corners}
}
% Root
\node[Decision] (Root) {Primary\\bottleneck?};
% Level 1 Decisions
\node[Decision, below left=1.5cm and 2.5cm of Root] (LabelCost) {Labeling\\cost};
\node[Decision, below=2cm of Root] (ComputeCost) {Compute\\cost};
\node[Decision, below right=1.5cm and 2.5cm of Root] (DataScarcity) {Data\\scarcity};
% Level 2 Decisions/Leaves
% Under LabelCost
\node[Decision, below left=1.5cm and 0.5cm of LabelCost] (OracleAvail) {Oracle\\available?};
\node[Leaf, below right=1.5cm and 0.5cm of LabelCost] (SelfSup) {Self-\\Supervised};
% Under ComputeCost
\node[Decision, below=1.5cm of ComputeCost] (Redundant) {Redundant\\data?};
% Under DataScarcity
\node[Decision, below left=1.5cm and 0.5cm of DataScarcity] (SimAvail) {Simulator\\available?};
\node[Leaf, below right=1.5cm and 0.5cm of DataScarcity] (Distill) {Knowledge\\Distillation};
% Leaves
% Under OracleAvail
\node[Leaf, below left=1.5cm and -0.5cm of OracleAvail] (Active) {Active\\Learning};
\node[Leaf, below right=1.5cm and -0.5cm of OracleAvail] (Semi) {Semi-\\Supervised};
% Under Redundant
\node[Leaf, below left=1.5cm and -0.5cm of Redundant] (Dedup) {Dedup +\\Coreset};
\node[Leaf, below right=1.5cm and -0.5cm of Redundant] (Curriculum) {Curriculum\\Learning};
% Under SimAvail
\node[Leaf, below left=1.5cm and -0.5cm of SimAvail] (Synthetic) {Synthetic\\Generation};
\node[Leaf, below right=1.5cm and -0.5cm of SimAvail] (Augment) {Data\\Augmentation};
% Connections
% Root -> Level 1
\draw[Arrow] (Root) -| node[above, pos=0.25, font=\scriptsize] {Labeling \$\$\$} (LabelCost);
\draw[Arrow] (Root) -- node[right, font=\scriptsize] {Compute \$\$\$} (ComputeCost);
\draw[Arrow] (Root) -| node[above, pos=0.25, font=\scriptsize] {Not enough data} (DataScarcity);
% LabelCost -> Level 2
\draw[Arrow] (LabelCost) -| node[above, pos=0.25, font=\scriptsize] {Yes} (OracleAvail);
\draw[Arrow] (LabelCost) -| node[above, pos=0.25, font=\scriptsize] {Large pool} (SelfSup);
% ComputeCost -> Level 2
\draw[Arrow] (ComputeCost) -- (Redundant);
% DataScarcity -> Level 2
\draw[Arrow] (DataScarcity) -| node[above, pos=0.25, font=\scriptsize] {Domain} (SimAvail);
\draw[Arrow] (DataScarcity) -| node[above, pos=0.25, font=\scriptsize] {Teacher} (Distill);
% OracleAvail -> Leaves
\draw[Arrow] (OracleAvail) -| node[above, pos=0.25, font=\scriptsize] {Yes} (Active);
\draw[Arrow] (OracleAvail) -| node[above, pos=0.25, font=\scriptsize] {No} (Semi);
% Redundant -> Leaves
\draw[Arrow] (Redundant) -| node[above, pos=0.25, font=\scriptsize] {High} (Dedup);
\draw[Arrow] (Redundant) -| node[above, pos=0.25, font=\scriptsize] {Low} (Curriculum);
% SimAvail -> Leaves
\draw[Arrow] (SimAvail) -| node[above, pos=0.25, font=\scriptsize] {Yes} (Synthetic);
\draw[Arrow] (SimAvail) -| node[above, pos=0.25, font=\scriptsize] {No} (Augment);
\end{tikzpicture}
```
:::
The following walkthrough elaborates on each path, guiding practitioners from initial bottleneck identification through implementation.
### Step 1: Assess Your Bottleneck {.unnumbered}
Identify which resource constraint most severely limits your training pipeline. If labeling cost dominates your budget, consider label efficiency techniques such as Active Learning, Semi-Supervised, or Self-Supervised learning. These methods maximize the value extracted from each human annotation. When compute cost is the primary concern, prioritize dataset reduction through Coreset selection, Deduplication, and Curriculum Learning, all of which reduce the number of training iterations required. If data scarcity is the primary problem, pursue data creation through Augmentation, Synthesis, and Distillation to expand your effective training set beyond what raw collection provides.
### Step 2: Check Prerequisites {.unnumbered}
With the bottleneck identified, verify that the corresponding techniques are feasible given your infrastructure and data. Each approach carries specific requirements that must be met before implementation can begin (@tbl-technique-prerequisites).
| **Technique** | **Prerequisites** |
|:-------------------------|:------------------------------------------------------------|
| **Active Learning** | Access to oracle, unlabeled pool, retraining infrastructure |
| **Coreset Selection** | Proxy model or embedding extractor, full dataset accessible |
| **Curriculum Learning** | Difficulty scoring method, pacing schedule |
| **Semi-Supervised** | Some labeled data, unlabeled data from same distribution |
| **Self-Supervised** | Large unlabeled corpus, pre-training compute budget |
| **Augmentation** | Domain knowledge of invariances, augmentation library |
| **Synthetic Generation** | Generative model or simulator, domain gap mitigation |
: **Technique Prerequisites.** Each technique carries specific infrastructure and data requirements that must be verified before implementation. A technique with excellent theoretical gains but unmet prerequisites will fail in practice, making this checklist the first step in technique selection. {#tbl-technique-prerequisites .striped .hover}
### Step 3: Estimate ROI {.unnumbered}
Meeting the prerequisites is necessary but not sufficient. Before committing engineering resources, estimate the return on investment for each candidate technique:
$$
\text{ROI} = \frac{\text{(Baseline Cost)} - \text{(Technique Cost + Implementation Cost)}}{\text{Technique Cost + Implementation Cost}}
$$
A technique with high theoretical gains but high implementation cost may deliver lower ROI than a simpler approach. Deduplication, for example, often achieves the highest ROI because implementation cost is minimal and gains are immediate. Active Learning, by contrast, requires oracle access, retraining infrastructure, and selection algorithm development, so its ROI depends heavily on how many labeling cycles you expect to amortize that investment across.
### Step 4: Combine Techniques {.unnumbered}
The techniques in this chapter are not mutually exclusive; in practice, the most effective pipelines combine multiple approaches. A typical production workflow begins by deduplicating the raw corpus for immediate gains at minimal cost. This cleaned dataset then undergoes coreset selection to identify the most informative samples. During training, curriculum learning orders these samples to optimize gradient quality, while data augmentation increases effective diversity at runtime. Finally, starting from a self-supervised foundation model rather than random initialization allows the pipeline to leverage knowledge learned from massive unlabeled corpora.
Each stage compounds the efficiency gains of previous stages, turning individual percentage improvements into multiplicative savings.
The decision framework above answers the *what* of data selection: which samples to prune, when to select dynamically, and how to synthesize new data. Understanding these algorithmic choices is essential, but algorithms alone do not translate into faster training. A perfectly designed coreset algorithm that takes 10 hours to select samples for a 2-hour training run yields no practical benefit. Similarly, a curriculum learning strategy that requires scanning the entire dataset to determine difficulty rankings may idle GPUs while CPUs compute scores. The *how* of implementation matters as much as the *what* of algorithm choice. Concretely, a 2x improvement in your Information-Compute Ratio (ICR) is mathematically equivalent to doubling your hardware's peak throughput ($R_{peak}$) for that training run.
This gap between algorithmic elegance and practical value raises several systems questions. How do you avoid selection overhead negating your theoretical gains? How do you handle non-sequential I/O patterns that confuse prefetching logic? How do you coordinate selection decisions across distributed workers without introducing synchronization bottlenecks? The following sections address these engineering challenges, bridging the gap between data selection theory and production reality.
## Selection Engineering {#sec-data-selection-selection-engineering-a4eb}
\index{Data Selection Systems!engineering patterns}
Choosing the right data selection technique is necessary but not sufficient. The decision framework identified *which* algorithms to apply; now we must ensure those algorithms actually deliver their promised speedups when deployed on real hardware with real data pipelines. A naive active learning loop that scans the entire dataset every epoch to select the "best" samples will turn a compute-bound training job into an I/O-bound bottleneck. This section examines the architectural patterns required to implement data selection in production.
### The Selection Bottleneck {#sec-data-selection-selection-bottleneck-4d00}
\index{Selection Latency!dynamic selection bottleneck}
Dynamic data selection introduces a new bottleneck: *selection latency*. In standard training, the data loader reads the next batch sequentially. In active learning or curriculum learning, the system must evaluate a selection function $f(x)$ over a large candidate pool to determine the next batch. Concretely, scoring a 1M image dataset with a large model can take `{python} coreset_scoring_time_str`, potentially negating the savings from a `{python} coreset_pct_str` coreset if not performed with a smaller proxy model.
\index{Selection Inequality!definition}
For a selection strategy to be systems-efficient, it must satisfy the **Selection Inequality** expressed in @eq-selection-inequality:
$$ T_{selection} + T_{train}(N_{subset}) < T_{train}(N_{total}) $$ {#eq-selection-inequality}
Here $T_{selection}$ is the time spent scoring the pool and $T_{train}$ is the compute time. If $f(x)$ requires a forward pass of a large model, the cost of selection can exceed the cost of training, producing negative ROI. A concrete scenario illustrates this trade-off.
```{python}
#| label: selection-inequality-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SELECTION INEQUALITY WORKED EXAMPLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Selection Inequality in Practice" callout
# │
# │ Goal: Demonstrate the selection inequality with a concrete 1M image scenario.
# │ Show: That proxy scoring (0.6 hrs) is essential to preserve coreset efficiency gains.
# │ How: Contrast total training time for full-model vs. proxy scoring.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: score_a_str, savings_b_pct_str, trap_pct_str, etc.
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (1M image coreset scenario) ---
n_images_value = 1_000_000 # total images
n_coreset_value = 100_000 # 10% coreset
n_epochs_value = 100 # training epochs
resnet50_time_per_image_value = 0.01 # sec/image (full model)
resnet18_time_per_image_value = 0.002 # sec/image (proxy)
trap_sel_hrs_value = 50 # hrs for 7B model scoring
# --- Process (compare options A, B, and trap scenario) ---
score_a_sec_value = n_images_value * resnet50_time_per_image_value
train_a_sec_value = n_coreset_value * n_epochs_value * resnet50_time_per_image_value
total_a_sec_value = score_a_sec_value + train_a_sec_value
total_a_hrs_value = total_a_sec_value / SEC_PER_HOUR
score_b_sec_value = n_images_value * resnet18_time_per_image_value
train_b_sec_value = train_a_sec_value # same training time
total_b_sec_value = score_b_sec_value + train_b_sec_value
total_b_hrs_value = total_b_sec_value / SEC_PER_HOUR
baseline_sec_value = n_images_value * n_epochs_value * resnet50_time_per_image_value
baseline_hrs_value = baseline_sec_value / SEC_PER_HOUR
savings_a_hrs_value = baseline_hrs_value - total_a_hrs_value
savings_a_pct_value = savings_a_hrs_value / baseline_hrs_value * 100
savings_b_hrs_value = baseline_hrs_value - total_b_hrs_value
savings_b_pct_value = savings_b_hrs_value / baseline_hrs_value * 100
b_beats_a_hrs_value = total_a_hrs_value - total_b_hrs_value
trap_total_hrs_value = trap_sel_hrs_value + train_a_sec_value / SEC_PER_HOUR
trap_overhead_pct_value = trap_sel_hrs_value / (baseline_hrs_value - trap_total_hrs_value) * 100
# --- Outputs (formatted strings for prose) ---
score_a_str = fmt(score_a_sec_value, precision=0, commas=True) # e.g. "10,000"
score_a_hrs_str = fmt(score_a_sec_value/SEC_PER_HOUR, precision=1, commas=False) # e.g. "2.8"
train_a_str = fmt(train_a_sec_value, precision=0, commas=True) # e.g. "100,000"
train_a_hrs_str = fmt(train_a_sec_value/SEC_PER_HOUR, precision=1, commas=False) # e.g. "27.8"
total_a_hrs_str = fmt(total_a_hrs_value, precision=1, commas=False) # e.g. "30.6"
score_b_str = fmt(score_b_sec_value, precision=0, commas=True) # e.g. "2,000"
score_b_hrs_str = fmt(score_b_sec_value/SEC_PER_HOUR, precision=1, commas=False) # e.g. "0.6"
total_b_hrs_str = fmt(total_b_hrs_value, precision=1, commas=False) # e.g. "28.3"
baseline_str = fmt(baseline_sec_value, precision=0, commas=True) # e.g. "1,000,000"
baseline_hrs_str = fmt(baseline_hrs_value, precision=0, commas=False) # e.g. "278"
savings_a_str = fmt(savings_a_hrs_value, precision=0, commas=False) # e.g. "247"
savings_a_pct_str = fmt(savings_a_pct_value, precision=0, commas=False) # e.g. "89"
savings_b_str = fmt(savings_b_hrs_value, precision=0, commas=False) # e.g. "250"
savings_b_pct_str = fmt(savings_b_pct_value, precision=0, commas=False) # e.g. "90"
b_beats_a_str = fmt(b_beats_a_hrs_value, precision=1, commas=False) # e.g. "2.2"
trap_total_str = fmt(trap_total_hrs_value, precision=1, commas=False) # e.g. "77.8"
trap_pct_str = fmt(trap_overhead_pct_value, precision=0, commas=False) # e.g. "25"
```
::: {.callout-example title="Selection Inequality in Practice"}
**Scenario**: You have 1 million training images and want to select a 100k coreset (10%) using EL2N scoring.
**Option A: Full Model Selection**
- Score all 1M images with your target ResNet-50: 1M $\times$ 0.01 s = **`{python} score_a_str` seconds** (`{python} score_a_hrs_str` hours)
- Train on 100k coreset for 100 epochs: 100k $\times$ 100 $\times$ 0.01 s = **`{python} train_a_str` seconds** (`{python} train_a_hrs_str` hours)
- **Total: `{python} total_a_hrs_str` hours**
**Option B: Proxy Model Selection**
- Score all 1M images with a small proxy (ResNet-18): 1M $\times$ 0.002 s = **`{python} score_b_str` seconds** (`{python} score_b_hrs_str` hours)
- Train on 100k coreset for 100 epochs: **`{python} train_a_str` seconds** (`{python} train_a_hrs_str` hours)
- **Total: `{python} total_b_hrs_str` hours**
**Baseline: No Selection**
- Train on full 1M dataset for 100 epochs: 1M $\times$ 100 $\times$ 0.01 s = **`{python} baseline_str` seconds** (`{python} baseline_hrs_str` hours)
**Analysis**:
- Option A saves `{python} savings_a_str` hours vs. baseline (`{python} savings_a_pct_str`% reduction) ✓
- Option B saves `{python} savings_b_str` hours vs. baseline (`{python} savings_b_pct_str`% reduction) ✓
- Option B beats Option A by `{python} b_beats_a_str` hours. Proxy selection yields better ROI.
**The Trap**: If your selection required 50 hours (e.g., running a 7B parameter model), you would spend `{python} trap_total_str` hours total, still better than baseline, but the selection overhead consumes `{python} trap_pct_str`% of your savings.
**Rule of thumb**: Selection time should be <10% of subset training time for good ROI.
:::
The following analysis formalizes the 10% heuristic as *the selection inequality*, normalizing selection cost in epoch-equivalents.
```{python}
#| label: selection-inequality-math-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SELECTION INEQUALITY MATH DERIVATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "The Selection Inequality" callout (formal derivation)
# │
# │ Goal: Formalize the selection inequality using epoch-normalized costs.
# │ Show: That iterative selection can be slower than baseline training.
# │ How: Contrast one-shot vs. per-epoch selection overheads for a 100-epoch run.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: n_epochs_full_str, speedup_efficient_str, cost_total_iterative_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (normalized epoch costs) ---
n_epochs_full = 100 # baseline epochs
subset_fraction = 0.1 # keep 10%
cost_selection_full = 1 # 1 epoch equivalent
proxy_factor = 0.1 # proxy is 10x faster
# --- Process (compute costs for different strategies) ---
n_epochs_subset = n_epochs_full * subset_fraction
cost_total_efficient = cost_selection_full + n_epochs_subset
speedup_efficient = n_epochs_full / cost_total_efficient
cost_selection_iterative = n_epochs_full * 1 # selection every epoch
cost_total_iterative = cost_selection_iterative + n_epochs_subset
cost_selection_proxy = cost_selection_full * proxy_factor
cost_total_proxy = cost_selection_proxy + n_epochs_subset
# --- Outputs (formatted strings for prose) ---
n_epochs_full_str = fmt(n_epochs_full, precision=0, commas=False) # e.g. "100"
subset_fraction_pct_str = fmt(subset_fraction * 100, precision=0, commas=False) # e.g. "10"
cost_selection_full_str = fmt(cost_selection_full, precision=0, commas=False) # e.g. "1"
n_epochs_subset_str = fmt(n_epochs_subset, precision=0, commas=False) # e.g. "10"
cost_total_efficient_str = fmt(cost_total_efficient, precision=0, commas=False) # e.g. "11"
speedup_efficient_str = fmt(speedup_efficient, precision=0, commas=False) # e.g. "9"
cost_selection_iterative_str = fmt(cost_selection_iterative, precision=0, commas=False) # e.g. "100"
cost_total_iterative_str = fmt(cost_total_iterative, precision=0, commas=False) # e.g. "110"
proxy_factor_inv_str = fmt(1/proxy_factor, precision=0, commas=False) # e.g. "10"
cost_selection_proxy_str = fmt(cost_selection_proxy, precision=1, commas=False) # e.g. "0.1"
cost_total_proxy_str = fmt(cost_total_proxy, precision=1, commas=False) # e.g. "10.1"
subset_fraction_str = fmt(subset_fraction, precision=1, commas=False) # e.g. "0.1"
```
::: {.callout-notebook title="The Selection Inequality"}
**Problem**: You are using active learning to select the best `{python} subset_fraction_pct_str`% of samples for training. Your selection algorithm requires running the full model on the unlabeled pool. Is this efficient?
**The Math**:
1. **Full Training**: `{python} n_epochs_full_str` epochs. Total cost = `{python} n_epochs_full_str` $\times$ C_epoch.
2. **Selection (Full Model)**: Scoring the full dataset is equivalent to **`{python} cost_selection_full_str` epoch** of training. T_selection = `{python} cost_selection_full_str` $\times$ C_epoch.
3. **Subset Training**: `{python} n_epochs_full_str` epochs on `{python} subset_fraction_pct_str`% data = `{python} n_epochs_full_str` $\times$ `{python} subset_fraction_str` $\times$ C_epoch = `{python} n_epochs_subset_str` $\times$ C_epoch.
4. **Total Time**: `{python} cost_selection_full_str` + `{python} n_epochs_subset_str` = **`{python} cost_total_efficient_str`** $\times$ C_epoch.
5. **Speedup**: `{python} n_epochs_full_str` / `{python} cost_total_efficient_str` ≈ **`{python} speedup_efficient_str` $\times$**.
**The Trap**: If your selection algorithm is iterative (e.g., repeating selection every epoch), T_selection becomes `{python} n_epochs_full_str` $\times$ 1 = `{python} cost_selection_iterative_str` $\times$ C_epoch. Total time = `{python} cost_selection_iterative_str` + `{python} n_epochs_subset_str` = `{python} cost_total_iterative_str` $\times$ C_epoch. You are now **slower** than the baseline.
**The Failure Condition**: If the cost of selecting data exceeds the cost of training on the discarded data, you have failed. The goal is to spend compute to save *more* compute. As discussed in the coreset selection section, proxy models solve this problem by reducing T_selection by an order of magnitude.
:::
Look at the stacked bars in @fig-selection-inequality to see this trade-off in action: efficient selection (center) saves `{python} savings_b_pct_str`% of total compute, while expensive selection (right) consumes all the savings in overhead.
::: {#fig-selection-inequality fig-cap="**The Selection Inequality**\index{Data Selection Systems!selection inequality}: Data selection only improves end-to-end efficiency if the overhead of selection plus training on the subset is less than training on the full dataset. A lightweight selection function (proxy model, cached embeddings) keeps selection overhead low; an expensive selection function (full model forward pass) can negate the savings." fig-alt="Stacked bar chart comparing three approaches: Baseline shows a single tall bar (100) for full training; Efficient Selection shows two short stacked bars (5 selection overhead plus 40 subset training) totaling 45 with a 55 percent savings annotation; Expensive Selection shows two stacked bars (60 selection overhead plus 40 subset training) totaling 100 with a No savings annotation."}
```{python}
#| label: fig-selection-inequality
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SELECTION INEQUALITY FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-selection-inequality (Selection Bottleneck section)
# │
# │ Goal: Visualize the total training cost components under selection.
# │ How: Plot stacked bars for baseline, efficient, and expensive selection scenarios.
# │
# │ Imports: numpy, mlsys.viz
# │ Exports: (figure output only)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
from mlsys import viz
fig, ax, COLORS, plt = viz.setup_plot(figsize=(8, 6))
# --- Plot: Selection Inequality Stacked Bars ---
categories = ['Baseline', 'Efficient Selection', 'Expensive Selection']
full_train_cost = np.array([100, 0, 0])
selection_overhead = np.array([0, 5, 60])
subset_train_cost = np.array([0, 40, 40])
x = np.arange(len(categories))
width = 0.6
p1 = ax.bar(x, full_train_cost, width, label='Full Training', color=COLORS['BlueFill'], edgecolor=COLORS['BlueLine'])
p2 = ax.bar(x, selection_overhead, width, bottom=full_train_cost, label='Selection Overhead', color=COLORS['OrangeL'], edgecolor=COLORS['OrangeLine'])
p3 = ax.bar(x, subset_train_cost, width, bottom=full_train_cost + selection_overhead, label='Subset Training', color=COLORS['GreenFill'], edgecolor=COLORS['GreenLine'])
ax.set_ylabel('Total Time (Normalized)')
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.set_ylim(0, 130)
ax.legend(loc='upper right', ncol=3, fontsize=9)
ax.annotate("", xy=(1, 45), xytext=(1, 100), arrowprops=dict(arrowstyle="<->", color=COLORS['GreenLine'], lw=2))
ax.text(1.1, 72, "55% Savings", color=COLORS['GreenLine'], fontweight='bold', va='center', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.annotate("", xy=(2, 100), xytext=(0, 100), arrowprops=dict(arrowstyle="-", linestyle="--", color=COLORS['grid']))
ax.text(2, 105, "No Savings!", color=COLORS['RedLine'], fontweight='bold', ha='center', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
plt.show()
```
:::
The lesson from @fig-selection-inequality is unambiguous: selection overhead can negate the benefits of training on a smaller subset. Before examining hardware-aware optimizations, verify your understanding of this trade-off.
::: {.callout-checkpoint title="The Selection Inequality" collapse="false"}
Data selection is not free. It introduces a new term to the Iron Law.
**The Equation**
- [ ] **Selection Cost**: Do you understand why $T_{selection} + T_{train}(subset)$ must be less than $T_{train}(full)$ for the technique to be valid?
- [ ] **Overhead Management**: How do proxy models and cached embeddings keep $T_{selection}$ low?
**Systems Implications**
- [ ] **I/O patterns**: Why does random access (required for dynamic selection) kill data loader throughput compared to sequential reads?
:::
### Hardware Empathy: The Random Access Penalty {#sec-data-selection-hardware-empathy-random-access-penalty-ef16}
The selection inequality addresses compute overhead, but data selection introduces a second, often overlooked cost: I/O pattern degradation. Data selection strategies like coresets or dynamic sampling often require **random access** to samples across the dataset, jumping to sample 47,231, then 892,104, then 3,417 based on selection scores. Standard training uses sequential reads that benefit from hardware readahead and OS page caching; random access patterns devastate throughput, especially on distributed filesystems or traditional hard drives. @tbl-io-performance quantifies this penalty across storage tiers.
| **Storage Tier** | **Sequential Throughput** | **Random I/O (IOPS)** | **Random Throughput (approx)** | **Random Penalty** |
|:-----------------|--------------------------:|----------------------:|-------------------------------:|-------------------:|
| **HDD (7.2k)** | ~150 MB/s | ~80 | ~0.3 MB/s | **500 $\times$** |
| **SATA SSD** | ~550 MB/s | ~10k | ~40 MB/s | **14 $\times$** |
| **NVMe SSD** | ~3,500 MB/s | ~500k | ~2,000 MB/s | **1.75 $\times$** |
| **Cloud (S3)** | ~100 MB/s (per conn) | ~1050 ms (lat) | Very Low (per conn) | **Extreme** |
: **The Cost of Randomness.** Comparative I/O throughput for sequential vs. random 4KB reads across different storage tiers. Standard data loaders optimize for sequential throughput, while data selection strategies often incur the random access penalty. {#tbl-io-performance .striped .hover}
\index{Vector Index!embedding-based selection}
High-efficiency systems mitigate this penalty through several techniques. Small proxy models (a 10M parameter "student" scoring on behalf of a 7B "teacher") reduce selection cost by an order of magnitude while preserving ranking quality. Embedding indices (e.g., FAISS[^fn-faiss]) transform selection from $O(N)$ linear scans into $O(\log N)$ lookups. Both approaches share a common principle: decoupling selection from training enables independent optimization.
[^fn-faiss]: **FAISS (Facebook AI Similarity Search)**: Open-sourced by Meta AI Research in 2017 [@johnson2019billion]. FAISS provides highly optimized implementations of approximate nearest-neighbor search for dense vectors, using techniques like inverted file indices (IVF), product quantization (PQ), and hierarchical navigable small world graphs (HNSW). For data selection, FAISS enables efficient similarity-based operations at billion-scale: finding the $k$ nearest neighbors for each sample (used in coreset selection), identifying near-duplicate embeddings (for deduplication), and clustering large embedding datasets (for stratified sampling). A single GPU can search billions of vectors in milliseconds, making embedding-based selection computationally tractable for web-scale datasets.
\index{Shard-based Data Loading!sequential I/O optimization}
\index{Shuffle Buffer!approximate random sampling}
Data loaders also require architectural adaptation. Modern formats such as WebDataset and FFCV group thousands of samples into shards, enabling efficient bulk reads even when target samples are scattered. Shuffle buffers provide a practical approximation: the loader reads large sequential shards into memory and samples randomly within the buffer, preserving sequential I/O throughput while achieving the statistical benefits of random sampling.
### Data Echoing: Amortizing I/O Costs {#sec-data-selection-data-echoing-amortizing-io-costs-b4e0}
The optimizations discussed so far address I/O bandwidth, but modern data selection pipelines introduce another bottleneck: CPU computation. Synthetic data generation and heavy augmentation shift the constraint from disk speed to augmentation throughput. Heavy augmentations like 3D rotations and MixUp, or on-the-fly generative synthesis, can leave the GPU idle if the CPU cannot keep pace with sample production. When the data pipeline produces samples slower than the GPU can consume them, GPU utilization drops and training time extends, negating the efficiency gains from smarter data selection.
\index{Data Echoing!amortizing I/O costs}
Data echoing[^fn-data-echoing] [@choi2020dataechoing] offers an elegant solution to this CPU-GPU imbalance. The technique reuses batches of data multiple times before fetching new samples, effectively trading sample diversity for GPU utilization. When the data pipeline (reading, decoding, augmenting) is slower than GPU processing, the GPU idles waiting for data.
[^fn-data-echoing]: **Data Echoing**: Introduced by Dami Choi et al. at Google Brain (ICML 2020). The technique exploits the observation that in modern training pipelines, the data preprocessing stage (CPU-bound: decoding, resizing, augmenting) is often slower than the training step (GPU-bound: forward pass, backward pass). Rather than letting the GPU idle, data echoing "echoes" (repeats) each preprocessed batch multiple times with different random augmentations, keeping the GPU busy while the CPU prepares fresh data. The key insight is that moderate repetition (2--4 $\times$) with varied augmentations degrades convergence minimally compared to the alternative of GPU idling. The paper showed that data echoing after augmentation preserves more than 95% of the convergence benefit of unique samples while nearly eliminating data pipeline stalls.
Data echoing fills this gap by "echoing" (repeating) each batch $e$ times, applying different augmentations to each repetition so that the model still sees varied inputs.
The optimal echo factor depends on the ratio $R$ of upstream processing time to downstream training time:
$$
R = \frac{T_{\text{data pipeline}}}{T_{\text{GPU training}}}
$$
If $R > 1$ (data pipeline is the bottleneck), set echo factor $e \leq R$ to fully utilize GPU capacity. If $R < 1$ (GPU is the bottleneck), data echoing provides no benefit. The following worked example calculates these trade-offs for a realistic scenario.
```{python}
#| label: data-echoing-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ DATA ECHOING ROI CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Worked Example: Data Echoing ROI" callout
# │
# │ Goal: Demonstrate when data echoing provides positive ROI.
# │ How: Calculate GPU utilization and training duration with and without echoing.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: pipeline_throughput_str, pipeline_ratio_str, idle_pct_str, echo_hrs_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (ImageNet training with heavy augmentation) ---
pipeline_throughput_value = 300 # images/sec (CPU-bound)
gpu_throughput_value = 800 # images/sec (GPU capacity)
n_epochs_echo_value = 90 # standard ImageNet epochs
imagenet_size_value = 1_280_000 # ~1.28M images
echo_factor_value = 2 # repeat each batch 2x
# --- Process (compute throughputs and training times) ---
ratio_r_value = gpu_throughput_value / pipeline_throughput_value
gpu_idle_pct_value = (1 - pipeline_throughput_value / gpu_throughput_value) * 100
no_echo_throughput_value = pipeline_throughput_value
no_echo_sec_value = n_epochs_echo_value * imagenet_size_value / no_echo_throughput_value
no_echo_hrs_value = no_echo_sec_value / SEC_PER_HOUR
gpu_util_no_echo_value = pipeline_throughput_value / gpu_throughput_value * 100
echo_throughput_value = pipeline_throughput_value * echo_factor_value
echo_sec_value = n_epochs_echo_value * imagenet_size_value / echo_throughput_value
echo_hrs_value = echo_sec_value / SEC_PER_HOUR
# --- Outputs (formatted strings for prose) ---
pipeline_throughput_str = fmt(pipeline_throughput_value, precision=0, commas=False) # e.g. "300"
gpu_throughput_str = fmt(gpu_throughput_value, precision=0, commas=False) # e.g. "800"
pipeline_ratio_str = fmt(ratio_r_value, precision=2, commas=False) # e.g. "2.67"
idle_pct_str = fmt(gpu_idle_pct_value, precision=0, commas=False) # e.g. "63"
no_echo_sec_str = fmt(no_echo_sec_value, precision=0, commas=True) # e.g. "384,000"
no_echo_hrs_str = fmt(no_echo_hrs_value, precision=0, commas=False) # e.g. "107"
gpu_util_str = fmt(gpu_util_no_echo_value, precision=0, commas=False) # e.g. "38"
echo_sec_str = fmt(echo_sec_value, precision=0, commas=True) # e.g. "192,000"
echo_hrs_str = fmt(echo_hrs_value, precision=0, commas=False) # e.g. "53"
echo_factor_str = fmt(echo_factor_value, precision=0, commas=False) # e.g. "2"
effective_throughput_str = fmt(echo_throughput_value, precision=0, commas=False) # e.g. "600"
```
::: {.callout-example title="Worked Example: Data Echoing ROI"}
**Scenario**: Training ResNet-50 on ImageNet with heavy augmentation (RandAugment + MixUp).
**Measurements**:
- Data pipeline throughput: `{python} pipeline_throughput_str` images/second (reading, decoding, augmenting on CPU)
- GPU training throughput: `{python} gpu_throughput_str` images/second (forward + backward pass)
- Ratio $R = T_{\text{pipeline}} / T_{\text{GPU}}$ = (1/`{python} pipeline_throughput_str`) / (1/`{python} gpu_throughput_str`) = `{python} gpu_throughput_str`/`{python} pipeline_throughput_str` ≈ `{python} pipeline_ratio_str` (GPU waiting `{python} idle_pct_str`% of time)
**Without Echoing**:
- Effective throughput: `{python} pipeline_throughput_str` images/second (limited by data pipeline)
- Training time for 90 epochs: 90 $\times$ 1.28M / `{python} pipeline_throughput_str` = **`{python} no_echo_sec_str` seconds (`{python} no_echo_hrs_str` hours)**
- GPU utilization: ~`{python} gpu_util_str`%
**With Echo Factor $e$ = `{python} echo_factor_str`**:
- Each batch is processed twice with different augmentations
- Effective throughput: `{python} effective_throughput_str` images/second (still below GPU capacity)
- Unique images per second: `{python} pipeline_throughput_str` (unchanged)
- Training time: 90 $\times$ 1.28M / `{python} effective_throughput_str` = **`{python} echo_sec_str` seconds (`{python} echo_hrs_str` hours)** if echoed data is equally valuable
**Echoed data has diminishing returns**: Research shows echoed samples provide approximately 7090% of the value of fresh samples, depending on augmentation diversity. Empirically, Choi et al. measured a **3.25 $\times$ speedup** on ResNet-50 ImageNet training when reading data over a network, with minimal accuracy degradation.
**The Trade-Off**: Data echoing trades sample diversity for GPU utilization. It works best when:
1. Augmentation is diverse (each echo sees different transforms)
2. The dataset is already somewhat redundant
3. The echo factor $e$ stays below the critical threshold (~$4\times$ for ImageNet)
Above this threshold, the model starts memorizing and accuracy degrades.
:::
Data echoing also interacts with batch normalization. When the same image appears multiple times in a batch (or across nearby batches), batch normalization statistics become less representative of the true data distribution. This correlation violates the independence assumption underlying batch normalization's effectiveness. Practitioners address this by excluding consecutive echoes from the same batch or by maintaining separate batch normalization statistics for echoed samples.
These engineering patterns provide production-ready implementations of data selection principles. Proxy selection reduces the computational cost of identifying valuable samples. Sharded formats and shuffle buffers reconcile random access algorithms with sequential storage hardware. Data echoing maximizes GPU utilization when the data pipeline becomes the bottleneck. Together, they transform data selection from an algorithmic idea into a deployable system.
Engineering patterns solve the *how*, but practitioners still need to answer a more fundamental question: *should I invest in data selection at all?* A deduplication pipeline that costs \$50K to build but saves \$10K per training run requires a cost model to justify. The next section provides the quantitative framework for these investment decisions.
## Cost Modeling {#sec-data-selection-cost-modeling-9b02}
\index{Cost Modeling!data selection economics}
The systems framing of data selection demands quantitative answers: *Should I label 10,000 more samples or buy more GPU hours? When does active learning pay for itself? What is the ROI of investing in deduplication infrastructure?*
### Quantifying Data Costs and ROI {#sec-data-selection-quantifying-data-costs-roi-f0b4}
\index{Total Cost of Data!formula}
Answering these questions requires understanding what training data actually costs. Total expense encompasses the full lifecycle of data acquisition, preparation, and utilization, extending well beyond storage fees. @tbl-cost-components breaks down the four cost components:
$$
C_{\text{total}} = C_{\text{acquire}} + C_{\text{label}} + C_{\text{store}} + C_{\text{process}}
$$
where:
| **Component** | **Formula** | **Typical Range** |
|:-------------------------|:------------------------------------------------------|----------------------------------------------:|
| **$C_{\text{acquire}}$** | $N \times c_{\text{sample}}$ | \$0.001\$10/sample (web scrape vs. licensed) |
| **$C_{\text{label}}$** | $N_{\text{labeled}} \times c_{\text{label}}$ | \$0.10\$100/sample (crowd vs. expert) |
| **$C_{\text{store}}$** | $S_{\text{bytes}} \times c_{\text{storage}} \times T$ | \$0.02\$0.10/GB/month |
| **$C_{\text{process}}$** | $N \times E \times c_{\text{FLOP}}$ | Proportional to training FLOPs |
: **Total Cost of Training Data.** The four cost components span the full data lifecycle. Labeling costs ($C_{\text{label}}$) vary by three orders of magnitude depending on whether crowd workers or domain experts are required, making this the component most amenable to optimization through data selection. {#tbl-cost-components .striped .hover}
For a concrete example, consider training a vision model:
```{python}
#| label: cost-breakdown-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ TRAINING COST BREAKDOWN
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Cost Breakdown: ImageNet-Scale Training" callout
# │
# │ Goal: Demonstrate that data costs often exceed compute costs.
# │ How: Calculate total costs for raw data, labeling, storage, and training.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: c_raw_str, c_label_str, c_total_str, p_data_str, p_compute_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (ImageNet training cost scenario) ---
c_raw_value = 50000 # $ for licensed dataset
n_labels_value = 1_200_000 # images to label
cost_per_label_value = 0.05 # $/label (crowd)
c_store_value = 200 # $ storage (150GB × 12mo)
c_train_value = 25000 # $ GPU compute
# --- Process (compute totals and percentages) ---
c_label_value = n_labels_value * cost_per_label_value
c_total_value = c_raw_value + c_label_value + c_store_value + c_train_value
p_data_value = (c_raw_value + c_label_value + c_store_value) / c_total_value * 100
p_compute_value = c_train_value / c_total_value * 100
# --- Outputs (formatted strings for table) ---
c_raw_str = f"${c_raw_value:,}" # e.g. "$50,000"
c_label_str = f"${c_label_value:,.0f}" # e.g. "$60,000"
c_store_str = f"${c_store_value}" # e.g. "$200"
c_train_str = f"${c_train_value:,}" # e.g. "$25,000"
c_total_str = f"${c_total_value:,.0f}" # e.g. "$135,200"
p_data_str = f"{p_data_value:.0f}%" # e.g. "81%"
p_compute_str = f"{p_compute_value:.0f}%" # e.g. "19%"
n_labels_str = fmt(n_labels_value / MILLION, precision=1) + "M" # e.g. "1.2M"
cb_cost_per_label_str = fmt(cost_per_label_value, precision=2, commas=False) # e.g. "0.05"
storage_gb = 150 # GB stored
storage_months = 12 # months
storage_str = f"{storage_gb} GB × {storage_months} months" # e.g. "150 GB × 12 months"
train_epochs = 100 # training epochs
train_gpus = 8 # A100 GPUs
train_hours = 24 # hours
train_desc_str = f"{train_epochs} epochs × {train_gpus} A100s × {train_hours} h"
```
::: {.callout-example title="Cost Breakdown: ImageNet-Scale Training"}
| **Cost Component** | **Calculation** | **Amount** |
|:-----------------------------------------------------------------------------------|:-----------------|:-----------------------------------------------------------------|
| **Raw data (`{python} n_labels_str` images)** | Licensed dataset | `{python} c_raw_str` |
| **Labels (`{python} n_labels_str` $\times$ USD `{python} cb_cost_per_label_str`)** | Crowd annotation | `{python} c_label_str` |
| **Storage (`{python} storage_str`)** | Cloud storage | `{python} c_store_str` |
| **Training (`{python} train_desc_str`)** | GPU compute | `{python} c_train_str` |
| **Total** | | **`{python} c_total_str`** |
| **Data vs. Compute ratio** | | **`{python} p_data_str` data, `{python} p_compute_str` compute** |
This ratio, where data costs dominate, is typical for supervised learning. The ratio inverts for self-supervised learning on web-scraped data, where compute dominates.
:::
### ROI Framework for Data Selection Techniques {#sec-data-selection-roi-framework-data-selection-techniques-dade}
\index{Return on Investment!data selection framework}
Understanding total costs enables rational decisions about which efficiency techniques merit investment. Every technique carries both a cost (implementation effort, compute overhead) and a benefit (reduced data requirements, faster training). Comparing these trade-offs requires a common framework: Return on Investment (ROI).
$$
\text{ROI} = \frac{\text{Savings} - \text{Investment}}{\text{Investment}} \times 100\%
$$
The challenge lies in quantifying both sides accurately. Different techniques offer distinct cost-benefit profiles, summarized in @tbl-roi-profiles:
| **Technique** | **Investment (Cost)** | **Savings (Benefit)** |
|:----------------------|:--------------------------------------------------------|:-------------------------------------------------------------|
| **Deduplication** | One-time compute for hashing + infrastructure | Reduced storage, fewer epochs for same accuracy |
| **Coreset Selection** | Proxy model training + selection compute | Train on 1050% of data with minimal accuracy loss |
| **Active Learning** | Inference on unlabeled pool + human-in-the-loop latency | 210 $\times$ reduction in labeling budget for same acc. |
| **Data Augmentation** | CPU/GPU cycles for transforms | Effective dataset size increase without new data acquisition |
: **ROI Profiles for Data Selection Techniques.** Each technique occupies a different point in the investment-versus-savings space. Deduplication offers the lowest-risk entry point (minimal investment, guaranteed returns), while active learning offers the highest potential savings but requires the most infrastructure. {#tbl-roi-profiles .striped .hover}
### Break-Even Analysis {#sec-data-selection-breakeven-analysis-9a38}
ROI calculations assume that techniques deliver their promised benefits, but actual outcomes vary. For any technique, there exists a **break-even point** where investment equals savings. Below this threshold, the technique costs more than it saves; above it, the technique generates value. Identifying this threshold determines whether a technique makes sense for a given project.
```{python}
#| label: breakeven-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ACTIVE LEARNING BREAK-EVEN ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Break-Even Analysis" section (Example: Active Learning Break-Even)
# │
# │ Goal: Calculate the break-even point for active learning investments.
# │ $50/round inference cost, active learning achieves target accuracy with
# │ 2,000 labels + $500 inference vs random sampling's 5,000 labels. The
# │ 80% ROI shows active learning is economically justified when labeling
# │ exceeds inference cost.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: cost_random_total_str, cost_active_total_str, roi_pct_str, be_n_random_str, be_n_active_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (active learning cost scenario) ---
cost_label = 10 # $/label
n_initial = 1000 # initial labeled set
n_queries_per_round = 100 # samples per round
cost_inference = 50 # $/round (scoring pool)
n_random = 5000 # random sampling needs
n_active = 2000 # active learning needs
n_rounds = 10 # AL query rounds
# --- Process (compute costs and ROI) ---
cost_random_total = n_random * cost_label
cost_active_label = n_active * cost_label
cost_active_inference = n_rounds * cost_inference
cost_active_total = cost_active_label + cost_active_inference
roi_pct = (cost_random_total - cost_active_total) / cost_active_total * 100
# --- Outputs (formatted strings for prose) ---
cost_label_str = fmt(cost_label, precision=0, commas=False) # e.g. "10"
n_initial_str = fmt(n_initial, precision=0, commas=True) # e.g. "1,000"
cost_initial_str = fmt(n_initial * cost_label, precision=0, commas=True) # e.g. "10,000"
n_queries_str = fmt(n_queries_per_round, precision=0, commas=False) # e.g. "100"
cost_inference_str = fmt(cost_inference, precision=0, commas=False) # e.g. "50"
be_n_random_str = fmt(n_random, precision=0, commas=True) # e.g. "5,000"
be_n_active_str = fmt(n_active, precision=0, commas=True) # e.g. "2,000"
n_rounds_str = fmt(n_rounds, precision=0, commas=False) # e.g. "10"
cost_random_total_str = fmt(cost_random_total, precision=0, commas=True) # e.g. "50,000"
cost_active_label_str = fmt(cost_active_label, precision=0, commas=True) # e.g. "20,000"
cost_active_inference_str = fmt(cost_active_inference, precision=0, commas=True) # e.g. "500"
cost_active_total_str = fmt(cost_active_total, precision=0, commas=True) # e.g. "20,500"
roi_pct_str = fmt(roi_pct, precision=0, commas=False) # e.g. "144"
```
Suppose labeling costs \$`{python} cost_label_str`/sample and active learning requires:
- Initial labeled set: `{python} n_initial_str` samples (\$`{python} cost_initial_str`)
- Oracle queries per round: `{python} n_queries_str` samples
- Inference cost per round: \$`{python} cost_inference_str` (scoring unlabeled pool)
- Target accuracy achievable with `{python} be_n_random_str` random samples
If active learning reaches target accuracy with only `{python} be_n_active_str` labeled samples:
**Random labeling cost** = `{python} be_n_random_str` $\times$ \$`{python} cost_label_str` = **\$`{python} cost_random_total_str`**
**Active learning cost** = `{python} be_n_active_str` $\times$ \$`{python} cost_label_str` + `{python} n_rounds_str` rounds $\times$ \$`{python} cost_inference_str` = **\$`{python} cost_active_total_str`**
**ROI** = (\$`{python} cost_random_total_str` \$`{python} cost_active_total_str`) / \$`{python} cost_active_total_str` $\times$ 100% = **`{python} roi_pct_str`%**
The break-even occurs when the labeling reduction equals the selection overhead. If active learning only reduces labeling by 20%, and selection overhead is high, ROI may be negative.
### Amortization across Training Runs {#sec-data-selection-amortization-across-training-runs-f6b8}
Break-even analysis captures a snapshot in time, but many data selection investments span multiple projects. Techniques with high upfront costs yield significant returns when their benefits compound across repeated training runs. **Amortized ROI** accounts for this temporal dimension, as @tbl-dedup-costs and @tbl-amortized-roi illustrate for a deduplication pipeline:
$$
\text{Amortized ROI} = \frac{N_{runs} \times \text{Per-Run Savings} - \text{One-Time Investment}}{\text{One-Time Investment}}
$$
```{python}
#| label: amortization-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ DEDUPLICATION AMORTIZATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Amortization: The Time Value of Data Selection" section
# │
# │ Goal: Demonstrate how infrastructure ROI depends on model reuse.
# │ How: Calculate amortized ROI for data deduplication infrastructure over time.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: cost_build_str, savings_per_run_str, roi_1_str, roi_50_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Inputs (deduplication infrastructure scenario) ---
cost_build = 50000 # $ engineering time
cost_compute_once = 5000 # $ one-time MinHash compute
savings_per_run = 10000 # $ saved per training run
# --- Process (compute amortized ROI at different run counts) ---
cost_investment = cost_build + cost_compute_once
runs = [1, 5, 10, 50]
rois = [(r * savings_per_run - cost_investment) / cost_investment * 100 for r in runs]
# --- Outputs (formatted strings for tables) ---
cost_build_str = fmt(cost_build, precision=0, commas=True) # e.g. "50,000"
cost_compute_once_str = fmt(cost_compute_once, precision=0, commas=True) # e.g. "5,000"
savings_per_run_str = fmt(savings_per_run, precision=0, commas=True) # e.g. "10,000"
roi_1_str = fmt(rois[0], precision=0, commas=False) # e.g. "-82"
roi_5_str = fmt(rois[1], precision=0, commas=False) # e.g. "-9"
roi_10_str = fmt(rois[2], precision=0, commas=False) # e.g. "82"
roi_50_str = fmt(rois[3], precision=0, commas=False) # e.g. "809"
```
| **Component** | **Cost** |
|:------------------------------------------|-----------------------------------------------:|
| **Build deduplication pipeline** | \$`{python} cost_build_str` (engineering time) |
| **Compute MinHash signatures (one-time)** | \$`{python} cost_compute_once_str` |
| **Per-run savings (20% less data)** | \$`{python} savings_per_run_str`/run |
: **Deduplication Infrastructure Cost Components.** The one-time investment covers engineering effort and initial compute; the per-run savings accrue with every subsequent training run on the deduplicated data. {#tbl-dedup-costs .striped .hover}
| **Number of Runs** | **Amortized ROI** |
|:-------------------|--------------------------------------------:|
| 1 run | `{python} roi_1_str`% (net loss) |
| 5 runs | `{python} roi_5_str`% (near break-even) |
| 10 runs | +`{python} roi_10_str`% (positive) |
| 50 runs | +`{python} roi_50_str`% (highly profitable) |
: **Amortized ROI over Multiple Training Runs.** A deduplication pipeline that loses money on its first use becomes highly profitable when amortized across 10+ runs, illustrating why infrastructure investments should be evaluated over their full expected lifetime. {#tbl-amortized-roi .striped .hover}
This pattern reveals which circumstances favor infrastructure investment. Data selection investments deliver the highest returns under three conditions: training runs repeat frequently (hyperparameter search, model iterations, or scheduled retraining); datasets are shared across multiple teams or model architectures; and the technique generalizes broadly. Deduplication exemplifies a high-transfer investment because it benefits all models trained on the cleaned dataset. Task-specific coresets, by contrast, may not transfer across architectures, limiting their amortization potential. For one-off training runs, simple techniques like random sampling or basic augmentation often yield better ROI than sophisticated methods requiring substantial infrastructure investment. The following guidelines summarize these considerations.
::: {.callout-war-story title="The Test Set Leak"}
**The Context**: For years, ImageNet and CIFAR-10 were the gold standards for computer vision. Researchers competed to squeeze every 0.1% accuracy gain, assuming higher scores meant better generalization.
**The Failure**: In 2019, researchers at UC Berkeley (Recht et al.) discovered that these datasets contained significant near-duplicates between the training and test sets. A model could "solve" a test image simply by memorizing a nearly identical training image.
**The Consequence**: When the researchers constructed a truly independent test set (CIFAR-10.1) by re-collecting images from the original source with the same methodology, model accuracy dropped by 1114%. The "superhuman" performance was partly an illusion of memorization.
**The Systems Lesson**: Data leakage is the silent killer of generalization. If your test set is not rigorously deduplicated against your training set, your accuracy metric is a lie. You are measuring memory, not intelligence [@recht2019imagenet].
:::
::: {.callout-perspective title="When to Invest in Data Selection"}
**High ROI scenarios:**
- Labeling is expensive (medical, legal, scientific domains)
- Dataset is large and redundant (web-scraped corpora)
- Training runs are repeated frequently (hyperparameter search, retraining)
- Iteration speed matters more than final accuracy
**Low ROI scenarios:**
- Labeling is cheap or already done
- Dataset is small and curated
- Single training run (one-time cost)
- Accuracy matters more than efficiency
:::
These cost models assume a single machine with centralized access to the full dataset. Production ML training, however, distributes data across many workers, introducing coordination overhead that complicates every technique discussed so far. A coreset algorithm designed for a single GPU may behave differently when its dataset is sharded across hundreds of workers, and that difference can erode or amplify the ROI calculated above.
## Distributed Selection {#sec-data-selection-distributed-selection-d03e}
\index{Distributed Training!data selection challenges}
The preceding sections assumed centralized access to the full dataset: a single-machine view where one process can see the entire dataset, compute global statistics, and make coordinated selection decisions. This assumption simplifies algorithm design: coreset selection can rank all samples globally, curriculum learning can establish a universal difficulty ordering, and active learning can query the single most uncertain example. Production ML training breaks this assumption. When data is sharded across hundreds of workers, each seeing only a local slice, difficult questions arise: How do you compute a global coreset when no single node sees all samples? How do you maintain consistent curriculum difficulty rankings when the model updates asynchronously across workers?
The distributed training infrastructure that underlies these challenges (collective communication, fault tolerance, and elastic scheduling) constitutes an advanced topic beyond this chapter's scope. This section focuses specifically on how data selection techniques adapt to distributed settings, and where they fail to do so.
### Strategies for Distributed Selection {#sec-data-selection-strategies-distributed-selection-6119}
In standard distributed training, data parallelism is straightforward: shard the dataset across workers, each processes its shard independently. Data selection techniques, however, introduce *selection dependencies* (@tbl-selection-dependencies):
| **Technique** | **Single-Node Assumption** | **Distributed Challenge** |
|:------------------------|:--------------------------------|:---------------------------------------------|
| **Coreset Selection** | Global view of dataset | Each worker sees only its shard |
| **Active Learning** | Centralized uncertainty scoring | Scoring requires model synchronization |
| **Curriculum Learning** | Global difficulty ordering | Workers may have different "hardest" samples |
| **Deduplication** | Hash table fits in memory | Distributed hash tables add latency |
: **Selection Dependencies in Distributed Training.** Each data selection technique assumes centralized data access that distributed training violates. The distributed challenge column identifies the specific coordination problem that must be solved. {#tbl-selection-dependencies .striped .hover}
These selection dependencies admit several architectural solutions, each navigating a different point in the consistency-scalability trade-off space.
The most straightforward approach centralizes selection while distributing training. A coordinator node performs selection on the full dataset, then distributes selected indices to workers. This preserves selection quality but introduces a single bottleneck:
```
Coordinator: score_all_samples() → selected_indices
Broadcast: selected_indices → all workers
Workers: train on subset(local_shard, selected_indices)
```
The semantics remain clean, but the coordinator becomes a single point of failure and a bandwidth bottleneck for large selections. For modest cluster sizes, this overhead is acceptable; for thousand-node deployments, it becomes prohibitive.
Hierarchical selection addresses this scalability limitation by distributing the selection computation itself. Each worker performs local selection on its shard, then a coordinator merges results:
```
Workers: local_selected = select_top_k(local_shard)
Coordinator: global_selected = merge_and_rerank(all local_selected)
Broadcast: final_indices → all workers
```
This approach reduces coordinator load substantially, but introduces a quality trade-off: the system may miss globally important samples that appear unimportant within their local shard. A sample that is only moderately difficult on one worker might be the hardest example in the entire dataset when considered globally.
\index{Distributed MinHash!cross-shard deduplication}
When even hierarchical approaches prove too expensive, approximate global selection offers a fallback. These methods trade exactness for scalability through distributed approximate algorithms. Distributed MinHash enables deduplication by having each worker compute MinHash signatures independently; signatures are then aggregated to find near-duplicates across shards without requiring any single node to see all the data. Similarly, federated uncertainty sampling allows workers to compute local uncertainty scores, with a global threshold determined by score distribution statistics rather than exact ranking.
### Consistency Challenges in Active Learning {#sec-data-selection-consistency-challenges-active-learning-59c9}
The approximate selection strategies above assume static selection criteria, but active learning introduces an additional complication: the model changes during selection. Consider what happens when Worker A scores samples using the model at step $t$ while Worker B simultaneously updates the model to step $t+1$. Worker A's scores are now stale and may select samples that the updated model would rank differently.
Several strategies mitigate this staleness problem, each with distinct overhead characteristics. Synchronous scoring forces all workers to pause training and score simultaneously, guaranteeing consistency but at substantial cost in GPU utilization. Periodic score refresh offers a middle ground by re-scoring every $k$ epochs rather than every batch, trading freshness for reduced overhead. The most robust approach selects samples that exhibit high uncertainty under multiple model checkpoints, ensuring that selection decisions remain valid even as the model evolves. The following example demonstrates how these distributed selection strategies combine in practice.
```{python}
#| label: distributed-overhead-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ DISTRIBUTED CORESET SELECTION OVERHEAD
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Distributed Coreset Selection" callout (8x A100 cluster example)
# │
# │ Goal: Quantify the end-to-end overhead of distributed coreset selection.
# │ How: Sum latencies for embedding, deduplication, scoring, and selection phases.
# │
# │ Imports: (none)
# │ Exports: t_embed_str, t_dedup_str, t_score_str, t_total_overhead_str
# └─────────────────────────────────────────────────────────────────────────────
# --- Inputs (distributed selection timing on 8x A100) ---
t_embed_value = 20 # minutes (parallel)
t_dedup_value = 15 # minutes (distributed hash)
t_score_value = 30 # minutes (parallel proxy)
t_select_value = 2 # minutes (centralized)
# --- Process ---
t_total_overhead_value = t_embed_value + t_dedup_value + t_score_value + t_select_value
# --- Outputs (formatted strings for prose) ---
t_embed_str = f"{t_embed_value} minutes" # e.g. "20 minutes"
t_dedup_str = f"{t_dedup_value} minutes" # e.g. "15 minutes"
t_score_str = f"{t_score_value} minutes" # e.g. "30 minutes"
t_select_str = f"{t_select_value} minutes" # e.g. "2 minutes"
t_total_overhead_str = f"{t_total_overhead_value} minutes" # e.g. "67 minutes"
```
::: {.callout-example title="Distributed Coreset Selection"}
**Scenario**: Select a 10% coreset from ImageNet (1.2M images) using 8 workers with 4 GPUs each.
**Architecture**:
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
% Coordinator node
\node[draw, rectangle, rounded corners, fill=blue!10, minimum width=10cm, minimum height=2cm, align=left] (coord) at (0, 2.5) {
\textbf{Coordinator Node}\\[2pt]
\textbullet\ Maintains global embedding index (FAISS)\\
\textbullet\ Merges local selections\\
\textbullet\ Broadcasts final coreset indices
};
% Worker nodes
\node[draw, rectangle, rounded corners, fill=green!10, minimum width=2.5cm, minimum height=1.8cm, align=center] (w0) at (-4, -1) {
\textbf{Worker 0}\\
150K images\\
Local EL2N
};
\node[draw, rectangle, rounded corners, fill=green!10, minimum width=2.5cm, minimum height=1.8cm, align=center] (w1) at (0, -1) {
\textbf{Worker 1}\\
150K images\\
Local EL2N
};
\node[draw, rectangle, rounded corners, fill=green!10, minimum width=2.5cm, minimum height=1.8cm, align=center] (wn) at (4, -1) {
\textbf{Worker N}\\
150K images\\
Local EL2N
};
% Arrows
\draw[->, thick] (w0.north) -- node[left, font=\footnotesize\usefont{T1}{phv}{m}{n}] {local\_scores} (w0.north |- coord.south);
\draw[->, thick] (w1.north) -- node[right, font=\footnotesize\usefont{T1}{phv}{m}{n}] {local\_scores} (w1.north |- coord.south);
\draw[->, thick] (wn.north) -- node[right, font=\footnotesize\usefont{T1}{phv}{m}{n}] {local\_scores} (wn.north |- coord.south);
% Ellipsis between workers
\node at (2, -1) {\Large$\cdots$};
\node at (-2, -1) {\Large$\cdots$};
\end{tikzpicture}
```
**Pipeline**:
1. **Embedding phase** (parallel): Each worker computes ResNet-18 embeddings for its shard → store in shared filesystem
2. **Deduplication phase** (distributed): Coordinator builds FAISS index, workers query for near-duplicates → remove 15% duplicates
3. **Scoring phase** (parallel): Each worker computes EL2N scores on its deduplicated shard using proxy model
4. **Selection phase** (centralized): Coordinator collects top-20% scores from each worker, re-ranks globally, selects final 10%
5. **Broadcast**: Selected indices distributed to all workers for training
**Performance** (measured on $8\times$ A100 cluster):
- Embedding: `{python} t_embed_str` (parallel)
- Deduplication: `{python} t_dedup_str` (distributed hash join)
- Scoring: `{python} t_score_str` (parallel, 5 epochs proxy training)
- Selection: `{python} t_select_str` (centralized)
- **Total overhead: `{python} t_total_overhead_str`** for $10\times$ training speedup
**Key insight**: The `{python} t_total_overhead_str` selection overhead pays for itself if full training takes >12 hours. For ImageNet with modern architectures, full training is ~24 hours, so coreset selection has clear positive ROI.
:::
This positive ROI can erode quickly when workers must coordinate frequently during training. Distributed data selection always incurs a *coordination tax*: the overhead of maintaining consistent selection across workers. This tax must be smaller than the efficiency gains, or distributed selection yields negative ROI. As a rule of thumb, if selection overhead exceeds 10% of training time, simplify the selection strategy or increase the selection interval.
So far we have examined data selection techniques individually and in distributed settings. Real ML systems, however, combine data selection with model compression, hardware acceleration, and distributed training simultaneously, and these optimizations interact in ways that can amplify or undermine each other. Understanding these interactions is essential for designing efficient end-to-end pipelines.
## Cross-Layer Interactions {#sec-data-selection-crosslayer-interactions-1f39}
\index{Cross-Layer Interactions!data selection with other optimizations}
Data selection does not exist in isolation. A coreset-trained model will eventually be quantized for deployment. A curriculum-learning pipeline will run on specialized accelerators. An actively-learned dataset will feed into distributed training. These downstream optimizations interact with data selection in ways that can amplify gains or introduce unexpected conflicts. Understanding these interactions helps practitioners design end-to-end efficient systems rather than optimizing components independently.
### Model Compression {#sec-data-selection-model-compression-9aef}
Model compression (@sec-model-compression) reduces the size of the trained model through pruning, quantization, and distillation. The training dataset directly affects how compressible the resulting model becomes. Perhaps counterintuitively, models trained on smaller, higher-quality datasets may be *more* compressible than those trained on larger, noisier ones.
The mechanism relates to how models encode information. A model trained on repetitive data learns redundant features that pruning later removes. The training compute required to learn those features was wasted, only to be discarded during compression. By contrast, a model trained on diverse, informative samples learns compact, non-redundant representations from the start, making subsequent compression more effective. Empirical evidence supports this relationship: in experiments on ImageNet, models trained on 50% coresets selected by EL2N compress to 4-bit precision with 2% less accuracy loss than models trained on the full dataset, because the curated training produced cleaner weight distributions that quantize more gracefully.
Data selection and model compression are therefore *complementary*. The techniques in this chapter can reduce both training cost *and* post-training compression effort. When planning an efficiency pipeline, apply data selection first; the resulting model will be easier to compress.
::: {.callout-war-story title="The 99% Sparsity Trap"}
**The Context**: Researchers at Google Brain investigated the impact of pruning on model performance. They pruned a ResNet model to 90%+ sparsity, removing the vast majority of weights.
**The Failure**: They found that while FLOPs decreased by 90%, the inference latency on standard hardware (GPUs/TPUs) often *increased*.
**The Consequence**: Standard hardware is optimized for dense matrix multiplication. Sparse matrices require irregular memory access (checking indices, jumping addresses). Without specialized hardware support (like NVIDIA's Sparse Tensor Cores) or structured pruning (removing entire channels), the overhead of managing sparsity outweighed the reduction in FLOPs.
**The Systems Lesson**: FLOPs are not latency. A 99% reduction in operations can yield a 0% reduction in time if the remaining operations are memory-bound or cache-inefficient. Optimization must target the hardware's actual bottleneck, not just an abstract metric [@hooker2020hardware].
:::
### Hardware Acceleration {#sec-data-selection-hardware-acceleration-cf40}
While model compression affects what happens after training, hardware acceleration determines how efficiently training itself proceeds. Hardware acceleration (@sec-hardware-acceleration) increases throughput through specialized accelerators, kernel optimization, and parallelization. Data selection affects which hardware bottlenecks dominate, and this relationship is more nuanced than simple speedup calculations suggest, as @tbl-bottleneck-shifts illustrates.
| **Scenario** | **Likely Bottleneck** | **Hardware Optimization** |
|:------------------------------|:------------------------------------|:------------------------------------------|
| **Large, sequential dataset** | Memory bandwidth | Larger batch sizes, gradient accumulation |
| **Small, curated dataset** | Compute (GPU idle waiting for data) | Faster data loaders, data echoing |
| **Dynamic selection** | Selection compute | Proxy models, cached embeddings |
: **Bottleneck Shifts from Data Selection.** Data selection changes the dataset characteristics, which in turn shifts which hardware resource becomes the bottleneck. Practitioners must re-profile their systems after applying aggressive data reduction to ensure hardware optimizations target the correct constraint. {#tbl-bottleneck-shifts .striped .hover}
Data selection can therefore shift the system from one bottleneck regime to another. A technique that reduces dataset size by 80% may move the bottleneck from I/O to GPU compute, requiring different hardware optimizations. Before applying aggressive data reduction, profile your system to understand which bottleneck you're targeting.
### Distributed Training {#sec-data-selection-distributed-training-9266}
The hardware bottleneck analysis above assumes single-machine training. The interactions become more complex when scaling to multiple machines, because data selection affects different parallelism strategies in distinct ways.
Under strong scaling, where a fixed dataset is distributed across more workers, data selection reduces communication overhead by reducing gradient updates per epoch. Fewer samples means fewer synchronization points, and communication costs often dominate at large worker counts. Under weak scaling, where each worker processes more data as the cluster grows, data selection techniques can maintain accuracy while adding workers without proportionally increasing total data. This capability proves essential when data collection rather than compute is the bottleneck. Even within straightforward data parallelism, smaller curated datasets reduce per-worker shard sizes, potentially improving cache utilization and reducing I/O stalls on each node.
These benefits must be weighed against the distributed selection challenges discussed in @sec-data-selection-distributed-selection-d03e. A technique that works well on a single GPU may incur prohibitive coordination overhead across 1,000 workers, negating its efficiency gains.
### The Optimization Stack {#sec-data-selection-optimization-stack-ba66}
\index{Optimization Stack!multiplicative effects}
The preceding sections examined pairwise interactions, but production systems apply all these optimizations together. Trace the full optimization stack in @fig-optimization-stack, from data to deployment: each stage in this pipeline amplifies or attenuates the effects of others.
```{python}
#| label: fig-optimization-stack
#| echo: false
#| fig-cap: "**The Optimization Stack**: The complete pipeline from raw data to deployed system, showing how optimizations at each stage propagate downstream. Data artifacts (rounded boxes) flow through processing stages (rectangular boxes). Optimizations early in the pipeline, particularly data selection, have multiplicative effects because they reduce the workload for all subsequent stages."
#| fig-alt: "Pipeline diagram with two rows. Top row shows Raw Data flowing through Data Selection to Curated Data, then through Training to produce a Model. Bottom row shows the Model flowing through Compression to a Compact Model, then through Hardware optimization to a Deployed System."
import matplotlib.patches as mpatches
from mlsys import viz
fig, ax, COLORS, plt = viz.setup_plot(figsize=(10, 3.5))
ax.set_xlim(-1, 14)
ax.set_ylim(-2.5, 1.5)
ax.set_aspect('equal')
ax.axis('off')
ax.grid(False)
bw, bh = 2.0, 0.7
arrow_kw = dict(arrowstyle='->', color='#555555', lw=1.5)
def box(ax, x, y, label, is_data=True):
fc = COLORS['BlueL'] if is_data else COLORS['OrangeL']
ec = COLORS['BlueLine'] if is_data else COLORS['OrangeLine']
style = "round,pad=0.12" if is_data else "square,pad=0.05"
rect = mpatches.FancyBboxPatch((x - bw/2, y - bh/2), bw, bh, boxstyle=style,
facecolor=fc, edgecolor=ec, linewidth=1.2, zorder=2)
ax.add_patch(rect)
ax.text(x, y, label, ha='center', va='center', fontsize=9, fontweight='bold', zorder=3)
# Row 1
for x, label, is_data in [(0, 'Raw Data', True), (3, 'Data Selection', False), (6, 'Curated Data', True),
(9, 'Training', False), (12, 'Model', True)]:
box(ax, x, 0.5, label, is_data)
# Row 2
for x, label, is_data in [(3, 'Compression', False), (6, 'Compact Model', True),
(9, 'Hardware', False), (12, 'Deployed System', True)]:
box(ax, x, -1.2, label, is_data)
# Row 1 arrows
for x1, x2 in [(1, 2), (4, 5), (7, 8), (10, 11)]:
ax.annotate('', xy=(x2, 0.5), xytext=(x1, 0.5), arrowprops=arrow_kw)
# Connector: Model -> Compression (down and left)
ax.annotate('', xy=(12, -0.1), xytext=(12, 0.15), arrowprops=dict(arrowstyle='-', color='#555555', lw=1.5))
ax.plot([12, 3], [-0.1, -0.1], color='#555555', lw=1.5, zorder=1)
ax.annotate('', xy=(3, -0.85), xytext=(3, -0.1), arrowprops=arrow_kw)
# Row 2 arrows
for x1, x2 in [(4, 5), (7, 8), (10, 11)]:
ax.annotate('', xy=(x2, -1.2), xytext=(x1, -1.2), arrowprops=arrow_kw)
plt.show()
```
The pipeline in @fig-optimization-stack reveals why data selection occupies a strategic position: it sits at the head of the optimization stack, following the D·A·M taxonomy introduced in @sec-data-selection-data-selection-fundamentals-e839: Data first, then Algorithm, then Machine. Reducing the dataset by 50% through intelligent selection does not just halve data processing time; it halves the training compute, which in turn produces a model that may require less aggressive compression, which then demands less from hardware acceleration. Each downstream stage inherits the efficiency gains (or quality losses) from upstream decisions.
This *multiplicative effect* means that every FLOP saved in data processing is a FLOP that never needs to be executed, compressed, or accelerated. Conversely, poor data selection that degrades model quality forces downstream stages to compensate, whether through longer training, less aggressive compression, or over-provisioned hardware.
How do we quantify this multiplicative effect? How do we know whether a 50% dataset reduction actually delivers 50% compute savings, or whether it has inadvertently degraded model quality in ways that surface only in production? Answering these questions requires a rigorous measurement framework: metrics that capture both the efficiency gains and the quality costs of data selection decisions.
## Measurement Framework {#sec-data-selection-measurement-framework-733b}
\index{Measurement Framework!data selection metrics}
The techniques in this chapter (coreset selection, active learning, augmentation) all claim to improve efficiency, and rigorous measurement separates effective techniques from intuition.
### Core Metrics {#sec-data-selection-core-metrics-32e9}
\index{Performance-Per-Data!metric definition}
Three metrics form the core measurement toolkit for evaluating data selection effectiveness.
#### Performance-Per-Data {.unnumbered}
The most direct metric, Performance-Per-Data (PPD), measures accuracy gain per sample:
$$
\text{PPD}(n) = \frac{\text{Accuracy}(n) - \text{Accuracy}(0)}{n}
$$
where $n$ is the number of training samples. A higher PPD indicates more efficient use of data. The key insight is that PPD exhibits **diminishing returns**: the first 10,000 samples contribute far more to model performance than the next 10,000.
#### Area Under the Learning Curve {.unnumbered}
Rather than comparing at a single point, AULC integrates performance across all dataset sizes:
$$
\text{AULC} = \int_0^N \text{Accuracy}(n) \, dn
$$
A data-efficient strategy has higher AULC because it achieves good accuracy faster. This metric is particularly useful for comparing coreset selection algorithms.
#### Data Compression Ratio {.unnumbered}
For coreset methods, the Data Compression Ratio (DCR) measures how much data reduction is achieved at a target accuracy:
$$
\text{DCR} = \frac{N_{\text{full}}}{N_{\text{coreset}}} \text{ at } \text{Accuracy}_{\text{target}}
$$
A DCR of 5 $\times$ means the coreset achieves target accuracy with 20% of the data.
### The Compute-Optimal Frontier {#sec-data-selection-computeoptimal-frontier-24ba}
The metrics above measure individual techniques. How, then, do you diagnose whether your *overall* training strategy is data-limited or compute-limited? The diagnostic question is: given my current performance, am I limited by data quality or by training compute? Scaling laws provide the answer.
Research on neural scaling laws [@kaplan2020scaling; @hoffmann2022training] established that model performance follows predictable power laws with respect to compute, data, and model size. These laws provide more than theoretical interest: they offer a diagnostic framework for understanding whether your training is limited by data quality or compute budget. For the relationship between scaling laws and the **Information Roofline** (the theoretical ceiling on what can be learned from any dataset), see @sec-dam-taxonomy.
\index{Chinchilla!compute-optimal training}
The Chinchilla study[^fn-chinchilla] [@hoffmann2022training] revealed a key insight: for any fixed compute budget, there exists an **optimal balance** between model size and training data. Train on too little data relative to model size, and you waste compute on an undertrained model. Train on too much data with too small a model, and you waste data on a model that cannot absorb it.
[^fn-chinchilla]: **Chinchilla**: A 70-billion-parameter language model trained by DeepMind (Hoffmann et al., March 2022), named after the small South American rodent. The paper's central finding upended the "bigger is better" assumption: GPT-3 (175B parameters trained on 300B tokens) was significantly *undertrained* relative to its size. Chinchilla, with only 70B parameters but trained on 1.4T tokens (4.7 $\times$ more data), outperformed GPT-3 on most benchmarks. The "Chinchilla scaling law" prescribes that model parameters and training tokens should scale roughly equally: doubling model size should be accompanied by doubling training data. This finding redirected the field's emphasis from model scaling toward data scaling, making data selection techniques directly relevant to frontier model training.
This optimal balance defines a **compute-optimal frontier**\index{Compute-optimal frontier!Chinchilla scaling laws}: the best achievable performance at each compute budget when data and model size are properly balanced (@fig-compute-optimal-frontier).
```{python}
#| label: fig-compute-optimal-frontier
#| echo: false
#| fig-cap: "**The Compute-Optimal Frontier**: For any training compute budget, there is a best achievable performance when data and model size are optimally balanced (green curve). Operating points below the frontier indicate inefficiency. **Data-starved** systems (orange) have compute capacity but insufficient quality data; the techniques in this chapter move them toward the frontier. **Compute-starved** systems (red) have quality data but insufficient training budget; hardware acceleration or distributed training helps here. The goal is to operate *on* the frontier, extracting maximum performance from available resources."
#| fig-alt: "A log-log plot with Training Compute on x-axis and Model Performance on y-axis. A green curve shows the optimal frontier. Orange point below curve labeled Data-starved. Red point below curve labeled Compute-starved. Purple point on curve labeled Optimal."
import numpy as np
from mlsys import viz
fig, ax, COLORS, plt = viz.setup_plot(figsize=(8, 6))
# Compute-optimal frontier curve
x = np.logspace(0, 3, 200)
y = 95 - 45 * np.exp(-0.8 * np.log(x))
# Suboptimal region fill
ax.fill_between(x, 50, y, color='gray', alpha=0.08)
# Frontier curve
ax.plot(x, y, color=COLORS['GreenLine'], linewidth=2.5, label='Compute-optimal frontier', zorder=3)
# Data-starved point
ax.scatter([100], [72], color=COLORS['OrangeLine'], s=100, zorder=4, edgecolors='white')
ax.annotate('', xy=(100, 83), xytext=(100, 73), arrowprops=dict(arrowstyle='->', color=COLORS['OrangeLine'], lw=1.5, linestyle='dashed'))
ax.text(150, 78, 'Better data\nselection', fontsize=8, color=COLORS['OrangeLine'], va='center',
bbox=dict(facecolor='white', edgecolor='none', alpha=0.9, pad=2))
ax.text(100, 69, 'Data-starved', fontsize=9, fontweight='bold', color=COLORS['OrangeLine'], ha='center', va='top',
bbox=dict(facecolor='white', edgecolor='none', alpha=0.9, pad=2))
# Compute-starved point
ax.scatter([10], [62], color=COLORS['RedLine'], s=100, zorder=4, edgecolors='white')
ax.annotate('', xy=(28, 74), xytext=(11, 63), arrowprops=dict(arrowstyle='->', color=COLORS['RedLine'], lw=1.5, linestyle='dashed'))
ax.text(12, 66, 'More\ntraining', fontsize=8, color=COLORS['RedLine'], va='bottom',
bbox=dict(facecolor='white', edgecolor='none', alpha=0.9, pad=2))
ax.text(10, 59, 'Compute-starved', fontsize=9, fontweight='bold', color=COLORS['RedLine'], ha='center', va='top',
bbox=dict(facecolor='white', edgecolor='none', alpha=0.9, pad=2))
# Optimal point
ax.scatter([300], [88.5], color=COLORS['VioletLine'], s=100, zorder=4, edgecolors='white')
ax.text(340, 89.5, 'Optimal', fontsize=9, fontweight='bold', color=COLORS['VioletLine'],
bbox=dict(facecolor='white', edgecolor='none', alpha=0.9, pad=2))
# Suboptimal region label
ax.text(25, 54, 'Suboptimal region', fontsize=9, color='gray')
ax.set_xscale('log')
ax.set_xlabel('Training Compute (FLOPs)')
ax.set_ylabel('Model Performance (Accuracy)')
ax.set_xlim(1, 1000)
ax.set_ylim(50, 100)
ax.set_xticks([1, 10, 100, 1000])
ax.set_xticklabels(['$10^{18}$', '$10^{19}$', '$10^{20}$', '$10^{21}$'])
ax.set_yticks([50, 60, 70, 80, 90, 100])
ax.legend(loc='upper left', fontsize=9, frameon=False)
plt.show()
```
#### Diagnosing Your Position {.unnumbered}
The frontier provides a practical diagnostic framework:
- **Data-starved** (orange): You have training compute available, but performance falls short of what the frontier predicts. The bottleneck is data quality or quantity. *Solution*: Apply the techniques from this chapter (deduplication, coreset selection, curriculum learning, or synthetic augmentation) to extract more learning per sample.
- **Compute-starved** (red): You have high-quality data, but insufficient compute to fully exploit it. Adding more data will not help. *Solution*: Invest in hardware acceleration (@sec-hardware-acceleration), longer training runs, or distributed training.
- **On the frontier** (purple): Data and compute are balanced. You are extracting maximum value from both resources. Further improvement requires increasing *both* data quality and compute proportionally.
#### The Chinchilla Rule of Thumb {.unnumbered}
For compute-optimal training, the number of training tokens should scale roughly as $D_{opt} \propto C^{0.5}$. Doubling your compute budget means you should increase data by about 40%, not 100%. This explains why the Data Wall is so constraining: as compute grows exponentially, the demand for quality data grows with its square root, but even that slower growth outpaces the supply of high-quality human-generated content.
#### Applying the Diagnostic {.unnumbered}
If your training run underperforms expectations, ask: *Am I data-starved or compute-starved?* A simple test: train for 2 $\times$ longer. If performance improves substantially, you were compute-starved. If it plateaus quickly, you are data-starved and need better data, not more training. The techniques in this chapter address the data-starved regime; hardware acceleration and distributed training address the compute-starved regime.
Watch how the two curves diverge in @fig-ppd-curve: a data-efficient selection strategy (blue) reaches the performance plateau much faster than random sampling (gray). The gap between the curves at any dataset size represents the efficiency opportunity: compute that could be saved by smarter data curation.
::: {#fig-ppd-curve fig-cap="**Diminishing Returns of Data**: Random sampling (gray) versus data-efficient selection (blue). The efficient strategy achieves higher performance with less data, reaching the convergence plateau much earlier. The red arrow shows the efficiency gap at a fixed dataset size." fig-alt="A plot with X-axis 'Dataset Size' and Y-axis 'Performance'. Two curves start at 0. The 'Random' curve rises slowly. The 'Efficient' curve rises steeply and plateaus early."}
```{python}
#| label: fig-ppd-curve
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ DIMINISHING RETURNS FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-ppd-curve (Measuring Data Selection section)
# │
# │ Goal: Visualize the performance benefits of data selection.
# │ Show: That efficient selection reaches the accuracy plateau faster than random sampling.
# │ How: Plot exponential saturation curves for random vs. curated datasets.
# │
# │ Imports: numpy, mlsys.viz
# │ Exports: (figure output only)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
from mlsys import viz
fig, ax, COLORS, plt = viz.setup_plot(figsize=(8, 6))
# --- Plot: Efficient vs Random Data Selection ---
x = np.linspace(0, 100, 200)
y_random = 95 * (1 - np.exp(-0.04 * x))
y_efficient = 95 * (1 - np.exp(-0.15 * x))
ax.plot(x, y_random, '--', color=COLORS['grid'], label='Random Sampling', linewidth=2)
ax.plot(x, y_efficient, '-', color=COLORS['BlueLine'], label='Efficient Selection', linewidth=2.5)
ax.fill_between(x, y_random, y_efficient, color=COLORS['BlueL'], alpha=0.1)
ax.set_xlabel('Dataset Size (% of Total)')
ax.set_ylabel('Model Accuracy (%)')
ax.set_xlim(0, 100)
ax.set_ylim(0, 100)
idx = 40
x_val, y_eff, y_rnd = x[idx], y_efficient[idx], y_random[idx]
ax.annotate("", xy=(x_val, y_eff), xytext=(x_val, y_rnd),
arrowprops=dict(arrowstyle="<->", color=COLORS['RedLine'], lw=1.5))
ax.text(x_val+2, (y_eff+y_rnd)/2, "Efficiency Gap\n(Saved Compute)", color=COLORS['RedLine'], fontsize=9, va='center', fontweight='bold', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.legend(loc='lower right', fontsize=9)
plt.show()
```
:::
The practical question for practitioners: at what point should you stop collecting data and start curating it? When does adding more samples waste compute rather than improve accuracy? These questions require rigorous metrics that quantify diminishing returns, compare selection strategies, and evaluate the cost-effectiveness of different data sources.
Data selection techniques all make implicit claims about the value of different samples, and validating that a curated dataset actually preserves model quality requires systematic benchmarking across three dimensions. Coverage metrics validate that coreset selection preserved representation across classes and demographic groups. Distribution alignment metrics (such as KL divergence and PSI; see @sec-data-foundations-measuring-drift-divergence-a6dd) detect whether the curated training set drifted from the deployment distribution. Label quality metrics (inter-annotator agreement, confident learning) validate that active learning did not introduce systematic labeling errors. A 50% dataset reduction is only valuable if benchmarking confirms the model maintains target accuracy, calibration, and robustness.
For a comprehensive treatment of data selection metrics and benchmarking methodologies, including how initiatives like DataPerf are standardizing evaluation protocols, see @sec-benchmarking.
::: {.callout-lighthouse title="Lighthouse Data Selection"}
This chapter has applied data selection principles to all five Lighthouse Models, demonstrating that the techniques are universal but the priorities differ by bottleneck:
| **Lighthouse** | **Primary Bottleneck** | **Data Selection Priority** |
|:---------------------|:-----------------------|:---------------------------------------------------------------------------------|
| **ResNet-50** | Compute | Coreset selection directly reduces training FLOPs |
| **GPT-2/Llama** | Memory bandwidth | Deduplication reduces corpus size; curriculum learning improves token efficiency |
| **MobileNet** | Latency/Power | Aggressive augmentation compensates for reduced model capacity |
| **DLRM** | Memory capacity | Interaction deduplication and embedding pruning reduce table size |
| **Keyword Spotting** | Extreme constraints | Augmentation and synthesis create datasets from minimal seeds |
The common thread: **data selection is not a single technique but a systems optimization** tailored to whichever resource is most constrained.
:::
These measurement tools and lighthouse examples demonstrate what data selection can achieve when applied correctly. But the techniques involve counterintuitive trade-offs, and practitioners frequently fall into predictable traps. The following section catalogs the most common errors so that readers can avoid them.
## Fallacies and Pitfalls {#sec-data-selection-fallacies-pitfalls-f4d0}
Data selection involves counterintuitive diminishing returns that contradict the "more is better" intuition from traditional machine learning. The errors below fall into three groups: *conceptual fallacies* about what data selection can achieve, *implementation pitfalls* that arise when correct strategies meet engineering realities, and *transfer errors* that occur when benchmark results are applied uncritically to new domains.
```{python}
#| label: fp-scaling-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACIES AND PITFALLS CALCULATIONS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Quantitative backing for F&P section claims
# │
# │ Goal: Provide quantitative backing for data selection misconceptions.
# │ Show: How curated small datasets can outperform raw 10× larger ones.
# │ How: Pre-compute comparative stats for accuracy per dollar.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: Various formatted strings for inline use
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
# --- Fallacy 1: Diminishing returns (scaling law based) ---
# Approximate scaling: accuracy ~ log(data_size)
# Doubling data from 1M to 2M adds less than doubling from 100K to 200K
data_1m_value = 1_000_000
data_10m_value = 10_000_000
# Illustrative: 10x data yields ~3-5% accuracy gain at scale
acc_gain_10x_value = 4.0
cost_10x_data_value = 9 # 9x more compute for 10x data (sublinear storage)
cost_efficiency_10x_value = acc_gain_10x_value / cost_10x_data_value
# Curated vs raw comparison
curated_size_value = 100_000
raw_size_value = 1_000_000
curated_accuracy_value = 92.0
raw_accuracy_value = 88.0
curated_cost_ratio_value = raw_size_value / curated_size_value
# --- Fallacy 2: Synthetic data / Model Collapse ---
# Research shows accuracy drops after training on model-generated data
synthetic_gen1_acc_value = 95.0 # First generation
synthetic_gen5_acc_value = 78.0 # After 5 generations (collapse)
synthetic_acc_drop_value = synthetic_gen1_acc_value - synthetic_gen5_acc_value
optimal_synthetic_mix_min_value = 50
optimal_synthetic_mix_max_value = 80
# --- Fallacy 4: Scale economics ---
training_run_cost_value = 100_000_000 # $100M
efficiency_gain_pct_value = 10
savings_value = training_run_cost_value * efficiency_gain_pct_value / 100
# --- Pitfall 1: Selection overhead ---
selection_time_bad_value = 10 # hours
training_time_value = 2 # hours for subset
full_training_time_value = 8 # hours for full dataset
selection_overhead_ratio_value = selection_time_bad_value / training_time_value
# Good scenario
selection_time_good_value = 0.5 # 30 minutes with proxy
selection_overhead_good_pct_value = selection_time_good_value / full_training_time_value * 100
# --- Pitfall 2: Rare class pruning ---
# Class imbalance scenario
total_samples_value = 1_000_000
rare_class_pct_value = 0.1 # 0.1% of data
rare_class_count_value = int(total_samples_value * rare_class_pct_value / 100)
coreset_pct_value = 10 # Keep 10%
expected_rare_in_coreset_value = int(rare_class_count_value * coreset_pct_value / 100)
min_samples_threshold_value = 50
# --- Pitfall 3: Deduplication leakage ---
# Studies show 3-15% test set contamination in web-scraped data
test_contamination_pct_value = 8
inflated_acc_value = 94.0
true_acc_value = 89.0
inflation_gap_value = inflated_acc_value - true_acc_value
# --- Pitfall 4: Active learning latency ---
al_latency_days_value = 14 # 2 weeks for expert labels
model_drift_epochs_value = 10
batch_size_small_value = 100
batch_size_large_value = 1000
# --- Fallacy 5: Benchmark transfer ---
cifar10_coreset_pct_value = 50
cifar10_acc_retained_value = 98
imagenet_coreset_pct_value = 50
imagenet_acc_retained_value = 95
medical_coreset_pct_value = 50
medical_acc_retained_value = 72
# --- Pitfall 5: Deployment metrics ---
coreset_size_pct_value = 10
ppd_score_value = 0.95
rare_class_acc_value = 45
majority_class_acc_value = 97
# --- Outputs (formatted strings) ---
# Fallacy 1
data_1m_str = fmt(data_1m_value / MILLION, precision=0) + "M"
data_10m_str = fmt(data_10m_value / MILLION, precision=0) + "M"
acc_gain_10x_str = fmt(acc_gain_10x_value, precision=0, commas=False)
curated_size_str = fmt(curated_size_value / 1e3, precision=0) + "K"
raw_size_str = fmt(raw_size_value / MILLION, precision=0) + "M"
curated_accuracy_str = fmt(curated_accuracy_value, precision=0, commas=False)
raw_accuracy_str = fmt(raw_accuracy_value, precision=0, commas=False)
curated_cost_ratio_str = fmt(curated_cost_ratio_value, precision=0, commas=False)
# Fallacy 2
synthetic_gen1_acc_str = fmt(synthetic_gen1_acc_value, precision=0, commas=False)
synthetic_gen5_acc_str = fmt(synthetic_gen5_acc_value, precision=0, commas=False)
synthetic_acc_drop_str = fmt(synthetic_acc_drop_value, precision=0, commas=False)
optimal_synthetic_mix_str = f"{optimal_synthetic_mix_min_value}{optimal_synthetic_mix_max_value}"
# Fallacy 4
training_run_cost_str = fmt(training_run_cost_value / MILLION, precision=0) + "M"
efficiency_gain_pct_str = fmt(efficiency_gain_pct_value, precision=0, commas=False)
savings_str = fmt(savings_value / MILLION, precision=0) + "M"
# Pitfall 1
selection_time_bad_str = fmt(selection_time_bad_value, precision=0, commas=False)
training_time_str = fmt(training_time_value, precision=0, commas=False)
full_training_time_str = fmt(full_training_time_value, precision=0, commas=False)
selection_overhead_ratio_str = fmt(selection_overhead_ratio_value, precision=0, commas=False)
selection_time_good_str = fmt(selection_time_good_value * 60, precision=0, commas=False) # in minutes
selection_overhead_good_pct_str = fmt(selection_overhead_good_pct_value, precision=0, commas=False)
# Pitfall 2
total_samples_str = fmt(total_samples_value / MILLION, precision=0) + "M"
rare_class_pct_str = fmt(rare_class_pct_value, precision=1, commas=False)
rare_class_count_str = fmt(rare_class_count_value, precision=0, commas=True)
fp_coreset_pct_str = fmt(coreset_pct_value, precision=0, commas=False)
expected_rare_in_coreset_str = fmt(expected_rare_in_coreset_value, precision=0, commas=False)
min_samples_threshold_str = fmt(min_samples_threshold_value, precision=0, commas=False)
# Pitfall 3
test_contamination_pct_str = fmt(test_contamination_pct_value, precision=0, commas=False)
inflated_acc_str = fmt(inflated_acc_value, precision=0, commas=False)
true_acc_str = fmt(true_acc_value, precision=0, commas=False)
inflation_gap_str = fmt(inflation_gap_value, precision=0, commas=False)
# Pitfall 4
al_latency_days_str = fmt(al_latency_days_value, precision=0, commas=False)
model_drift_epochs_str = fmt(model_drift_epochs_value, precision=0, commas=False)
batch_size_small_str = fmt(batch_size_small_value, precision=0, commas=False)
batch_size_large_str = fmt(batch_size_large_value, precision=0, commas=True)
# Fallacy 5
cifar10_coreset_pct_str = fmt(cifar10_coreset_pct_value, precision=0, commas=False)
cifar10_acc_retained_str = fmt(cifar10_acc_retained_value, precision=0, commas=False)
imagenet_acc_retained_str = fmt(imagenet_acc_retained_value, precision=0, commas=False)
medical_acc_retained_str = fmt(medical_acc_retained_value, precision=0, commas=False)
# Pitfall 5
coreset_size_pct_str = fmt(coreset_size_pct_value, precision=0, commas=False)
rare_class_acc_str = fmt(rare_class_acc_value, precision=0, commas=False)
majority_class_acc_str = fmt(majority_class_acc_value, precision=0, commas=False)
```
**Fallacy:** *Data is the new oil, so more is always better.*
Engineers assume linear returns from data scaling: 10 $\times$ more data should yield proportional accuracy gains. In reality, the ICR framework (@sec-data-selection-informationcompute-ratio-8c0b) reveals severe diminishing returns. Scaling from `{python} data_1m_str` to `{python} data_10m_str` samples typically yields only `{python} acc_gain_10x_str` percentage points of accuracy gain while incurring 9 $\times$ the compute cost. @tbl-scaling-asymmetry quantifies the asymmetry: GPU compute grows 10 $\times$ every 3 years while high-quality data grows only 2 $\times$ every 5 years. A curated `{python} curated_size_str` dataset achieving `{python} curated_accuracy_str`% accuracy often outperforms a raw `{python} raw_size_str` dataset at `{python} raw_accuracy_str`%, despite `{python} curated_cost_ratio_str` $\times$ fewer samples. Teams that blindly scale data budgets waste compute on redundant samples that contribute near-zero gradient signal.
**Fallacy:** *Synthetic data can completely replace real data.*
Engineers assume generative models produce unlimited training data at marginal cost. However, @sec-data-selection-bridging-domain-gap-100b and @tbl-synthetic-mix establish that synthetic-only training suffers from model collapse: accuracy degrades from `{python} synthetic_gen1_acc_str`% to `{python} synthetic_gen5_acc_str`% after five generations of training on model-generated data, a `{python} synthetic_acc_drop_str`-point drop. Synthetic data is bounded by the generator's knowledge and amplifies distributional errors through feedback loops. @fig-domain-gap illustrates why: the synthetic distribution diverges from the real distribution, causing the learned decision boundary to misclassify real-world inputs. Optimal ratios are `{python} optimal_synthetic_mix_str`% synthetic mixed with 2050% real data; pure synthetic training fails catastrophically on deployment distributions.
**Fallacy:** *Data selection is just data cleaning.*
Engineers conflate data quality (removing errors) with data value (maximizing ICR). A perfectly clean dataset can still be highly inefficient if filled with redundant, easy examples far from the decision boundary. @fig-coreset-selection illustrates the distinction: random sampling selects uniformly, wasting budget on samples deep within class regions. Coreset selection (@sec-data-selection-coreset-selection-algorithms-2c74) prioritizes samples near the decision boundary where uncertainty is highest. EL2N and GraNd methods (@tbl-coreset-comparison) achieve 1.8 $\times$ higher ICR than random sampling by focusing on informative samples, not just clean ones. Cleaning addresses label errors; selection optimizes information content per FLOP.
**Fallacy:** *Data selection is only for resource-constrained settings.*
Practitioners view data selection as relevant only for TinyML or budget-limited startups. In reality, data selection delivers maximum ROI at scale. A `{python} efficiency_gain_pct_str`% efficiency gain on a \$`{python} training_run_cost_str` training run saves \$`{python} savings_str`. The Data Wall (@fig-running-out-of-human-data) affects frontier labs most acutely: they have compute but lack quality data. @sec-data-selection-amortization-across-training-runs-f6b8 shows that amortized ROI grows with reuse: deduplication infrastructure yielding \$10K savings per run becomes highly profitable across 50 training runs. Organizations with "unlimited" budgets increasingly adopt data selection because it addresses their true bottleneck: high-quality training data, not GPU hours.
These conceptual misunderstandings often lead to flawed strategies. Equally damaging are the implementation pitfalls that arise when correct strategies meet messy engineering realities.
**Pitfall:** *Optimizing selection without measuring selection overhead.*
A sophisticated coreset algorithm requiring `{python} selection_time_bad_str` hours to select samples for a `{python} training_time_str`-hour training run has `{python} selection_overhead_ratio_str` $\times$ overhead, yielding negative ROI. @sec-data-selection-selection-bottleneck-4d00 establishes the Selection Inequality: $T_{selection} + T_{train}(subset) < T_{train}(full)$. If full training takes `{python} full_training_time_str` hours, selection must remain under 10% of that time. Use lightweight proxy models (ResNet-18 for 5 epochs instead of ResNet-50 for 100) or cached embeddings: proxy-based EL2N scoring completes in `{python} selection_time_good_str` minutes (`{python} selection_overhead_good_pct_str`% overhead), satisfying the inequality while achieving comparable selection quality.
**Pitfall:** *Pruning rare classes into oblivion.*
Aggressive coreset selection removes rare classes entirely because they contribute little to average loss. In a `{python} total_samples_str` dataset with `{python} rare_class_pct_str`% rare class samples (`{python} rare_class_count_str` examples), a `{python} fp_coreset_pct_str`% coreset using uniform importance sampling retains only `{python} expected_rare_in_coreset_str` rare examples on average, below the `{python} min_samples_threshold_str`-sample minimum needed for reliable learning. @sec-data-selection-coreset-selection-algorithms-2c74 recommends stratified selection: set minimum samples per class before applying pruning, ensuring rare classes retain sufficient representation. Production models failing on rare cases despite excellent average accuracy often trace to this pitfall.
**Pitfall:** *Training on deduplicated data while evaluating on duplicated test sets.*
Web-scraped datasets contain `{python} test_contamination_pct_str`% train-test overlap on average. Deduplicating only training data inflates apparent test accuracy to `{python} inflated_acc_str`% when true generalization is `{python} true_acc_str`%, a `{python} inflation_gap_str`-point gap that masks real-world performance. @sec-data-selection-data-deduplication-6c20 emphasizes joint deduplication: hash both train and test sets, removing any sample appearing in both. As noted in that section, GPT-3 and LLaMA training studies confirm that deduplicated data improves both efficiency and generalization. Teams observing accuracy drops after deduplication are measuring improved generalization, not degradation.
**Pitfall:** *Active learning without considering annotation latency.*
Active learning theory assumes instant oracle responses. @sec-data-selection-active-learning-humanintheloop-6932 notes that expert labels require days or weeks in practice. With `{python} al_latency_days_str`-day annotation latency, a model trained for `{python} model_drift_epochs_str` epochs between query rounds may have drifted significantly: samples selected as uncertain become irrelevant as the decision boundary shifts. Select larger batches (`{python} batch_size_large_str` vs `{python} batch_size_small_str` samples) to amortize latency and use diversity sampling to hedge against model drift. @sec-data-selection-breakeven-analysis-9a38 shows that active learning ROI depends critically on matching batch size to annotation turnaround time.
A subtler class of errors emerges when practitioners assume that benchmark results transfer directly to their specific domains and deployment contexts.
**Fallacy:** *If a technique works on ImageNet, it will work on my dataset.*
Benchmark papers report impressive results, but data selection effectiveness depends critically on dataset redundancy. CIFAR-10 is highly redundant: `{python} cifar10_coreset_pct_str`% coresets retain `{python} cifar10_acc_retained_str`% accuracy. ImageNet has moderate redundancy: the same coreset retains `{python} imagenet_acc_retained_str`% accuracy. Domain-specific datasets (medical imaging, satellite imagery) have near-zero redundancy: `{python} cifar10_coreset_pct_str`% coresets may retain only `{python} medical_acc_retained_str`% accuracy because every sample captures unique diagnostic information. @sec-data-selection-computeoptimal-frontier-24ba and @fig-ppd-curve show how the compute-optimal frontier varies by dataset structure. Start with conservative pruning (2030%) and validate on held-out data before aggressive reduction; the "free lunch" ratios from benchmark papers rarely transfer directly.
**Pitfall:** *Optimizing data selection metrics instead of deployment metrics.*
A team creates a `{python} coreset_size_pct_str`% coreset with excellent PPD and DCR scores, but the model fails catastrophically on production edge cases: `{python} majority_class_acc_str`% accuracy on majority classes but only `{python} rare_class_acc_str`% on rare subgroups. @sec-data-selection-core-metrics-32e9 defines efficiency metrics, but @sec-data-selection-measurement-framework-733b emphasizes that deployment success requires stratified evaluation. If the task demands 99.9% reliability on edge cases, the coreset must *oversample* those cases even at the cost of reduced average PPD. Include demographic subgroups, rare classes, and failure-mode coverage in selection optimization. The goal is deployment success, not benchmark efficiency.
These fallacies and pitfalls share a common thread: they arise when practitioners treat data selection as a purely algorithmic exercise divorced from the systems context in which it operates. With that caution in mind, we can now consolidate the chapter's key ideas.
## Summary {#sec-data-selection-summary-cf3e}
This chapter opened with a question: why do smaller, curated datasets sometimes outperform massive ones? The answer lies in recognizing data selection as a *systems* problem rather than a purely statistical one: the "D" pillar of the D·A·M taxonomy, addressing the first question in the optimization ordering: can we reduce the work before it begins? Where traditional machine learning asks "how few samples achieve target accuracy?", the systems perspective asks "how do we minimize total cost across the entire pipeline?"
This reframing transforms how practitioners approach the ML development lifecycle. The shift introduced in the Purpose section, from accumulating data as a massive liability to curating it as a precise resource, becomes actionable through the ICR metric, the Selection Inequality, and the cost modeling framework. The goal is minimizing total cost across compute, storage, labeling, energy, and time, not merely maximizing accuracy.
We explored the three-stage optimization pipeline: **Static Pruning** removes redundancy before training through coreset selection and deduplication, **Dynamic Selection** prioritizes informative examples during training through curriculum and active learning, and **Synthetic Generation** creates data where none exists through augmentation, simulation, and distillation. Together, these strategies address the "Data Wall," the structural asymmetry between exponentially growing compute and slowly growing high-quality data.
The self-supervised learning paradigm represents a ceiling of data selection: by eliminating task-specific labels entirely, foundation models achieve 1,000 $\times$ multipliers on downstream tasks through cost amortization. This structural transformation from "train from scratch" to "pre-train once, fine-tune many" has become the dominant approach in production ML precisely because of its superior data economics.
Translating these techniques into production requires systems engineering: the Selection Inequality ($T_{selection} + T_{train}(subset) < T_{train}(full)$) gates every technique, proxy models and shard-based data loaders reconcile selection algorithms with storage hardware, and data echoing maximizes GPU utilization when pipelines become the bottleneck. The cost modeling framework (total data cost, ROI analysis, and break-even thresholds) provides the quantitative tools to evaluate which techniques merit investment for a given workload, while core metrics (PPD, AULC, DCR) and the compute-optimal frontier diagnostic help practitioners determine whether their training is data-starved or compute-starved.
::: {.callout-takeaways title="Curate, Don't Accumulate"}
* **Data selection is a systems problem, and ICR is its metric**: The goal is reduced total cost across the entire pipeline (compute, storage, labeling, energy), not just "fewer samples for same accuracy." The Information-Compute Ratio quantifies this: learning gained per FLOP spent. Maximizing ICR is mathematically equivalent to improving hardware throughput, but often cheaper to achieve.
* **Start with deduplication**: It is the only data selection technique with guaranteed zero accuracy penalty and immediate compute savings. Deduplication should precede all other selection methods in any data pipeline.
* **The Selection Inequality gates every technique**: $T_{selection} + T_{train}(subset) < T_{train}(full)$. Selection overhead should stay below 10% of training time. Proxy models and cached embeddings keep $T_{selection}$ low; expensive selection algorithms can consume all the savings they promise.
* **Dynamic selection adapts the data diet as the model learns**: Curriculum learning (easy-to-hard ordering) and active learning (uncertainty-guided labeling) exploit the insight that the optimal training distribution changes during training, improving convergence speed and label efficiency respectively.
* **Self-supervised pre-training delivers a 1000 $\times$ labeled-data multiplier**: Foundation models amortize expensive pre-training across many downstream tasks, reducing per-task label requirements by 100 $\times$ and marginal compute by 20 $\times$. This cost amortization is strongest when techniques are reused across teams and training runs.
* **Synthetic data is a supplement, not a replacement**: Mixing 5080% synthetic with 2050% real data typically yields the best results. Pure synthetic training risks model collapse and domain gap degradation.
* **Data selection sits at the head of the D·A·M optimization stack**: Savings from data selection are multiplicative with downstream optimizations (model compression, hardware acceleration). Every FLOP eliminated upstream is a FLOP that never needs to be compressed or accelerated.
:::
The techniques explored throughout this chapter (deduplication, coreset selection, curriculum learning, active learning, and synthetic generation) provide practitioners with a systematic toolkit for breaking through the Data Wall. Organizations that master these techniques gain compound advantages: reduced labeling budgets, faster iteration cycles, lower storage costs, and models that generalize better because they learn from higher-quality examples rather than redundant noise.
::: {.callout-chapter-connection title="From Data to Algorithms"}
With high-quality data in hand, we have optimized the source of the system. Even the best data, however, cannot make an inefficient model run fast on constrained hardware. In @sec-model-compression, we move from optimizing *what* the system learns to optimizing *how* it represents that knowledge, applying pruning, quantization, and knowledge distillation to reduce the computational cost of the model artifact itself.
:::
<!-- This is here to make sure that quizzes are inserted properly before a part begins. -->
::: { .quiz-end }
:::