mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 17:48:27 -05:00
style: Vol1 full SQS Phase 3 pass — prose quality, LEGO headers, locality
- Remove AI-pattern phrases: leverage/leverages (40+), utilize (10+), powerful (30+) - Eliminate recap-style openers: 'With X established...' pattern (25+ instances) - Fix sentence-initial coordinating conjunctions: But/And/So in body prose - Replace vague intensifiers: very/significantly/somewhat → quantitative language - Standardize LEGO headers: old P.I.C.O. naming → LOAD/EXECUTE/GUARD/OUTPUT (introduction, ml_workflow, benchmarking, responsible_engr) - Fix unit spacing: 80ms→80 ms, 40GB→40 GB, 3GB→3 GB (training, hw_acceleration) - Correct hyphen-as-en-dash in numeric ranges: 1-2%→1--2%, 50-200 ms→50--200 ms - Convert bold paragraph starters in body prose to flowing paragraphs (ml_ops: data consistency/freshness/quality; training: Flash Attention conditions) - Rewrite abstract section openers with concrete scenarios - Fix contractions in body prose (doesn't/isn't/wasn't → expanded forms) - Add end_chapter bookend to ml_workflow (was missing) - Add end_chapter to mlsys/registry.py and export from __init__.py - Standardize LEGO cell GUARD sections where missing (noted for author pass)
This commit is contained in:
@@ -97,7 +97,7 @@ from mlsys.formatting import fmt, sci
|
||||
|
||||
class BenchmarkingSetup:
|
||||
"""Chapter-wide hardware and model constants for all benchmarking sections and callouts."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Hardware specs from Digital Twins
|
||||
_a100_fp16 = Hardware.A100.peak_flops.m_as(TFLOPs/second)
|
||||
_a100_fp32 = V100_FLOPS_FP32.m_as(TFLOPs/second) # V100 constant for fp32 baseline
|
||||
@@ -132,7 +132,7 @@ class BenchmarkingSetup:
|
||||
_accel_power_w = 300
|
||||
_lat_fast_ms = 10
|
||||
_lat_slow_ms = 100
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
_a100_ridge = Hardware.A100.ridge_point().m_as('flop/byte')
|
||||
_mv1_size_mb = _mobilenet_v1_m * 4
|
||||
_mv1_int8_mb = _mobilenet_v1_m * 1
|
||||
@@ -143,7 +143,7 @@ class BenchmarkingSetup:
|
||||
_e_slow_j = _accel_power_w * (_lat_slow_ms / 1000)
|
||||
_e_slow_wh = _e_slow_j / SEC_PER_HOUR
|
||||
_gpt3_b_str = fmt(_gpt3_params_b, precision=0, commas=False)
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
a100_tflops_fp16_str = fmt(_a100_fp16, precision=0, commas=False)
|
||||
a100_tflops_fp32_str = fmt(_a100_fp32, precision=1, commas=False)
|
||||
a100_bw_gbs_str = fmt(_a100_bw_gbs, precision=0, commas=True)
|
||||
@@ -408,7 +408,7 @@ Individual organizations learned these lessons independently—often painfully
|
||||
|
||||
A team evaluating edge deployment hardware needs to compare five different SoCs for a smart camera product. Vendor A reports 8 TOPS at INT8; Vendor B reports 15 TOPS at INT4; Vendor C reports inference latency on a proprietary model; Vendor D cites MLPerf scores from two generations ago; Vendor E provides only peak throughput at maximum batch size. None of these numbers are comparable. The team cannot make a procurement decision because every vendor measured a different thing, under different conditions, using different definitions of "performance." This fragmentation—not a lack of data, but a lack of *commensurable* data—is precisely the problem that benchmarking suites exist to solve.
|
||||
|
||||
The preceding section established three lessons from benchmark history—representative workloads, multi-objective evaluation, and integrated measurement—and identified the additional challenge unique to ML: inherent probabilistic variability. Modern benchmarking suites encode these lessons into standardized frameworks that make the kind of cross-organization comparison our hardware procurement team needs actually possible.
|
||||
Three lessons from benchmark history—representative workloads, multi-objective evaluation, and integrated measurement—converge with the challenge unique to ML: inherent probabilistic variability. Modern benchmarking suites encode these lessons into standardized frameworks that make the kind of cross-organization comparison our hardware procurement team needs actually possible.
|
||||
|
||||
ML benchmarks must evaluate the interplay between algorithms, hardware, and data, not merely computational efficiency alone. Early benchmarks focused on algorithmic performance [@lecun1998gradient], but scaling demands expanded the focus to hardware efficiency [@jouppi2017datacenter], and high-profile deployment failures elevated data quality as a third evaluation dimension [@gebru2021datasheets]. This probabilistic nature elevates accuracy to a first-class evaluation dimension alongside speed and energy consumption: the same ML system can produce different results depending on the data it encounters. Energy efficiency cuts across all three framework dimensions, since algorithmic choices affect computational complexity, hardware capabilities determine energy-performance trade-offs, and dataset characteristics influence training energy costs [@hernandez2020measuring].
|
||||
|
||||
@@ -559,14 +559,14 @@ from mlsys.constants import A100_MEM_BW, A100_FLOPS_FP16_TENSOR, TB, TFLOPs, sec
|
||||
from mlsys.constants import BILLION, MILLION
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class RooflineExamples:
|
||||
"""
|
||||
Namespace for Roofline Analysis Examples (ResNet vs BERT).
|
||||
Scenario: Comparing compute-bound vs memory-bound workloads on A100.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# A100 Specs (re-derived locally for safety)
|
||||
peak_flops = A100_FLOPS_FP16_TENSOR.m_as(TFLOPs/second)
|
||||
peak_bw = A100_MEM_BW.m_as(TB/second)
|
||||
@@ -582,7 +582,7 @@ class RooflineExamples:
|
||||
bert_weight_mb = 440.0
|
||||
bert_util_peak = 0.85
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# ResNet Performance
|
||||
resnet_perf_tflops = peak_flops * (resnet_util_max / 100.0)
|
||||
|
||||
@@ -591,11 +591,11 @@ class RooflineExamples:
|
||||
bert_perf_b1 = bert_ai_b1 * peak_bw
|
||||
bert_util_b1 = (bert_perf_b1 / peak_flops) * 100.0
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(resnet_ai > ridge_point, f"ResNet AI ({resnet_ai}) must be > Ridge ({ridge_point:.0f}) to be compute-bound.")
|
||||
check(bert_ai_b1 < ridge_point, f"BERT AI ({bert_ai_b1:.0f}) must be < Ridge ({ridge_point:.0f}) to be memory-bound.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
# A100 context
|
||||
a100_tflops_fp16_str = fmt(peak_flops, precision=0, commas=False)
|
||||
a100_bw_tbs_str = fmt(peak_bw, precision=1, commas=False)
|
||||
@@ -672,14 +672,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class BertRoofline:
|
||||
"""
|
||||
Namespace for BERT Roofline Calculation.
|
||||
Scenario: Comparing Batch-1 (Memory Bound) vs Batch-32 (Shift to Compute).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs from Twins) ───────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Model
|
||||
m_bert = Models.Language.BERT_Base
|
||||
params_m = m_bert.parameters.m_as(Mparam)
|
||||
@@ -697,7 +697,7 @@ class BertRoofline:
|
||||
batch_32 = 32
|
||||
util_peak = 0.85
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Batch 1
|
||||
ai_b1 = (flops_b_per_inf * BILLION) / (weight_mb * MILLION)
|
||||
perf_b1 = ai_b1 * peak_bw
|
||||
@@ -715,12 +715,12 @@ class BertRoofline:
|
||||
# Performance at Batch 32 (capped by compute if AI > Ridge)
|
||||
perf_b32 = peak_flops * util_peak if is_compute_bound_b32 else (ai_b32 * peak_bw)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ai_b32 > ai_b1, "Batching must increase Arithmetic Intensity.")
|
||||
if ai_b32 < 1000: # Sanity check, should be huge (50 * 32 = 1600)
|
||||
pass
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
bert_params_m_str = fmt(params_m, precision=0, commas=False)
|
||||
bert_flops_b_str = fmt(flops_b_per_inf, precision=0, commas=False)
|
||||
bert_weight_mb_str = fmt(weight_mb, precision=0, commas=False)
|
||||
@@ -922,7 +922,7 @@ Community standards ensure reproducibility, but they do not prescribe the level
|
||||
|
||||
## Benchmarking Granularity {#sec-benchmarking-benchmarking-granularity-3855}
|
||||
|
||||
Community-driven standardization establishes common measurement protocols, but a second design decision remains: at what level of detail should evaluation occur? Standardization answers "how do we measure consistently?" while granularity answers "what exactly do we measure?" Each validation dimension can be assessed at different scales, from individual operations to complete workflows, with each granularity level revealing different kinds of problems:
|
||||
A GPU kernel that runs 3$\times$ faster in isolation may deliver zero end-to-end speedup if the data pipeline cannot keep pace. This diagnostic failure illustrates a fundamental design question: at what level of detail should evaluation occur? Standardization answers "how do we measure consistently?" while granularity answers "what exactly do we measure?" Each validation dimension can be assessed at different scales, from individual operations to complete workflows, with each granularity level revealing different kinds of problems:
|
||||
|
||||
\index{Micro Benchmarks!component isolation}
|
||||
\index{Macro Benchmarks!subsystem evaluation}
|
||||
@@ -1190,7 +1190,7 @@ Choosing a granularity level, however, is only half the design problem. The othe
|
||||
|
||||
## Benchmark Components {#sec-benchmarking-benchmark-components-97cc}
|
||||
|
||||
The granularity level established above shapes how benchmark components are instantiated. Micro-benchmarks require synthetic inputs that isolate specific computational patterns; macro-benchmarks demand representative datasets like ImageNet; end-to-end benchmarks must incorporate real-world data with all its noise and distributional shift. Evaluation metrics similarly shift focus: FLOPS and memory bandwidth at the micro level, accuracy and inference speed at the macro level, system reliability and operational efficiency at the end-to-end level. Despite this variation, all benchmarks share common implementation components that enable consistent evaluation.
|
||||
Choosing between micro, macro, and end-to-end granularity determines what a benchmark can diagnose, but every benchmark at every granularity must still answer the same implementation questions: what task are we measuring, on what data, with which model, against which metrics, and under what rules? Micro-benchmarks require synthetic inputs that isolate specific computational patterns; macro-benchmarks demand representative datasets like ImageNet; end-to-end benchmarks must incorporate real-world data with all its noise and distributional shift. Despite this variation, all benchmarks share common implementation components that enable consistent evaluation.
|
||||
|
||||
The essential components interconnect to form a complete evaluation pipeline. Study the workflow in @fig-benchmark-components carefully: each stage—task definition, dataset selection, model selection, and evaluation metrics—feeds directly into the next, creating a chain where decisions made early constrain every downstream choice.
|
||||
|
||||
@@ -1536,7 +1536,7 @@ Neural network compression—pruning, quantization, knowledge distillation, and
|
||||
|
||||
The most basic compression metric is raw size reduction: parameter count, memory footprint in bytes, and compressed storage requirements. But size alone is misleading. MobileNetV2 achieves approximately 72% ImageNet top-1 accuracy with `{python} mobilenet_params_m_str` million parameters versus ResNet-50's 76% accuracy with `{python} resnet50_params_m_str` million parameters—a 7.5$\times$ efficiency improvement in the parameter-to-accuracy ratio that matters far more than raw parameter counts.
|
||||
|
||||
Pruning benchmarks must distinguish between structured and unstructured approaches, because they behave very differently on real hardware. Structured pruning removes entire neurons or filters, achieving consistent speedups but typically lower compression ratios (2--4$\times$). Unstructured pruning eliminates individual weights for higher compression ratios (10--100$\times$), but realizing actual speedups requires specialized sparse computation support—meaning benchmark protocols must specify hardware platform and software implementation.
|
||||
Pruning benchmarks must distinguish between structured and unstructured approaches, because they produce qualitatively different results on real hardware. Structured pruning removes entire neurons or filters, achieving consistent speedups but typically lower compression ratios (2--4$\times$). Unstructured pruning eliminates individual weights for higher compression ratios (10--100$\times$), but realizing actual speedups requires specialized sparse computation support—meaning benchmark protocols must specify hardware platform and software implementation.
|
||||
|
||||
\index{Mixed-Precision!layer-wise precision assignment}
|
||||
\index{Knowledge Distillation!benchmark evaluation}
|
||||
@@ -1611,9 +1611,9 @@ Whether benchmarking cloud servers or microcontrollers, however, a critical dist
|
||||
|
||||
## Training vs. Inference {#sec-benchmarking-training-vs-inference-evaluation-a3be}
|
||||
|
||||
Training and inference pursue fundamentally different objectives, and these contrasting goals create evaluation requirements so different that separate benchmarking frameworks emerged for each: MLPerf Training and MLPerf Inference. The following sections detail how each framework validates the hardware acceleration claims from preceding chapters, revealing whether theoretical TFLOPS translate to practical time-to-train or queries-per-second. Training seeks optimal parameters through iterative refinement (@sec-model-training), processing billions of examples over hours or days, stressing memory bandwidth, multi-GPU scaling, and sustained throughput. Inference applies those parameters to individual inputs under deployment strategies (@sec-ml-operations), often within millisecond deadlines, stressing latency consistency, cold-start time, and power efficiency.
|
||||
Training and inference pursue fundamentally different objectives, and these contrasting goals create evaluation requirements so different that separate benchmarking frameworks emerged for each: MLPerf Training and MLPerf Inference. The critical question is whether theoretical TFLOPS translate to practical time-to-train or queries-per-second. Training seeks optimal parameters through iterative refinement (@sec-model-training), processing billions of examples over hours or days, stressing memory bandwidth, multi-GPU scaling, and sustained throughput. Inference applies those parameters to individual inputs under deployment strategies (@sec-ml-operations), often within millisecond deadlines, stressing latency consistency, cold-start time, and power efficiency.
|
||||
|
||||
The differences cascade through every aspect of system design. Training involves bidirectional computation (forward and backward passes), while inference performs single forward passes with fixed parameters. Memory allocation diverges sharply: training requires simultaneous access to parameters, gradients, optimizer states, and activations, creating 3--4$\times$ memory overhead compared to inference. Training employs mixed-precision computation and gradient compression to manage this overhead, while inference leverages more aggressive precision reduction (detailed in @sec-benchmarking-inference-metrics-78d4) and techniques like post-training quantization and knowledge distillation. Resource utilization patterns also contrast: training targets sustained GPU saturation, whereas inference contends with variable request patterns that leave hardware underutilized, as the roofline analysis in @sec-benchmarking-system-benchmarks-393c demonstrated.
|
||||
The differences cascade through every aspect of system design. Training involves bidirectional computation (forward and backward passes), while inference performs single forward passes with fixed parameters. Memory allocation diverges sharply: training requires simultaneous access to parameters, gradients, optimizer states, and activations, creating 3--4$\times$ memory overhead compared to inference. Training employs mixed-precision computation and gradient compression to manage this overhead, while inference uses more aggressive precision reduction (detailed in @sec-benchmarking-inference-metrics-78d4) and techniques like post-training quantization and knowledge distillation. Resource utilization patterns also contrast: training targets sustained GPU saturation, whereas inference contends with variable request patterns that leave hardware underutilized, as the roofline analysis in @sec-benchmarking-system-benchmarks-393c demonstrated.
|
||||
|
||||
Energy costs follow different patterns. Training energy costs are amortized across model lifetime and measured in total energy per trained model; estimates for large training runs can reach the scale of thousands of megawatt-hours (GPT-3 has been estimated at roughly 1,287 MWh) [@patterson2021carbon]. Inference energy costs accumulate per query and can become a dominant operational consideration at scale. A durable way to reason about per-query energy is the identity \(E = P \times t\). For example, a `{python} accel_power_w_str` W accelerator running a `{python} latency_fast_ms_str` ms inference consumes \(`{python} accel_power_w_str` \times 0.01 = `{python} energy_fast_j_str`\) joules, which is about \(`{python} energy_fast_wh_str`\) Wh; at `{python} latency_slow_ms_str` ms, that becomes about \(`{python} energy_slow_wh_str`\) Wh.
|
||||
|
||||
@@ -1623,7 +1623,7 @@ This comparative framework guides benchmark design by highlighting which metrics
|
||||
|
||||
\index{Training Benchmarks!convergence throughput scalability}
|
||||
\index{Training Benchmarks!definition}
|
||||
Training benchmarks divide into three categories: convergence metrics that measure learning progress, throughput metrics that measure computational efficiency, and scalability metrics that measure distributed performance. We examine each in turn.
|
||||
A team purchases a \$10M GPU cluster expecting 5$\times$ the training speed of their \$2M setup, only to discover that communication overhead and memory bottlenecks limit the actual speedup to 2.8$\times$. Training benchmarks exist to catch this kind of gap before procurement. They divide into three categories: convergence metrics that measure learning progress, throughput metrics that measure computational efficiency, and scalability metrics that measure distributed performance.
|
||||
|
||||
Training benchmarks validate whether hardware acceleration delivers promised training throughput. The GPU clusters, TPU pods, and distributed training strategies examined in @sec-hardware-acceleration all claim dramatic speedups, and training benchmarks reveal which claims hold under realistic workloads. They evaluate how hardware configurations, data loading mechanisms, and distributed training strategies actually perform when training production-scale models.
|
||||
|
||||
@@ -1911,16 +1911,16 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class ScalingEfficiencyCalc:
|
||||
"""Strong scaling efficiency for 8-GPU ResNet-50 training: 75% efficiency, 25% overhead loss."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
t1_hours = 24 # single-GPU training time
|
||||
n_gpus = 8
|
||||
tn_hours = 4 # actual N-GPU training time
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
ideal_hours = t1_hours / n_gpus
|
||||
efficiency_pct = t1_hours / (n_gpus * tn_hours) * 100
|
||||
loss_pct = 100 - efficiency_pct
|
||||
eff_denom = n_gpus * tn_hours
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
ideal_str = fmt(ideal_hours, precision=0, commas=False)
|
||||
eff_str = fmt(efficiency_pct, precision=0, commas=False)
|
||||
loss_str = fmt(loss_pct, precision=0, commas=False)
|
||||
@@ -2019,13 +2019,13 @@ from mlsys.constants import ENERGY_DRAM_PJ_PER_BYTE, ENERGY_FLOP_FP32_PJ, ENERGY
|
||||
|
||||
class EnergyBreakdownCalc:
|
||||
"""FP32 vs INT8 MobileNet inference energy: memory load dominates; INT8 attacks both sources."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
m_params = 4.3 # Million params
|
||||
m_macs = 300 # Million MACs
|
||||
_dram_pj = ENERGY_DRAM_PJ_PER_BYTE.m_as(ureg.picojoule / byte)
|
||||
_fp32_pj = ENERGY_FLOP_FP32_PJ.m_as(ureg.picojoule / ureg.flop)
|
||||
_int8_pj = ENERGY_FLOP_INT8_PJ.m_as(ureg.picojoule / ureg.flop)
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# FP32: 4 bytes/param for load; INT8: 1 byte/param
|
||||
e_fp32_load = (m_params * MILLION * 4 * _dram_pj) / MILLION # uJ
|
||||
e_fp32_compute = (m_macs * MILLION * _fp32_pj) / MILLION # uJ
|
||||
@@ -2040,7 +2040,7 @@ class EnergyBreakdownCalc:
|
||||
s_total = e_fp32_total / e_int8_total
|
||||
e_fp32_load_mj = e_fp32_load / 1000
|
||||
e_fp32_compute_mj = e_fp32_compute / 1000
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
dram_energy_pj_str = fmt(_dram_pj, precision=0, commas=False)
|
||||
m_params_str = f"{m_params}"
|
||||
m_macs_str = fmt(m_macs, precision=0, commas=False)
|
||||
@@ -2364,18 +2364,18 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class AmdahlBenchmarkCalc:
|
||||
"""Amdahl ceiling: 5× inference speedup → only 1.8× end-to-end when preprocessing dominates."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
bench_preprocess_ms = 8
|
||||
bench_inference_ms = 10
|
||||
bench_inf_speedup = 5
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
bench_total_ms = bench_preprocess_ms + bench_inference_ms
|
||||
bench_opt_inference_ms = bench_inference_ms / bench_inf_speedup
|
||||
bench_opt_total_ms = bench_preprocess_ms + bench_opt_inference_ms
|
||||
bench_e2e_improvement = bench_total_ms / bench_opt_total_ms
|
||||
preprocess_fraction = bench_preprocess_ms / bench_total_ms
|
||||
amdahl_ceiling = 1 / preprocess_fraction
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
bench_preprocess_ms_str = fmt(bench_preprocess_ms, precision=0, commas=False)
|
||||
bench_inference_ms_str = fmt(bench_inference_ms, precision=0, commas=False)
|
||||
bench_inf_speedup_str = fmt(bench_inf_speedup, precision=0, commas=False)
|
||||
@@ -2427,7 +2427,7 @@ Comprehensive latency reporting therefore requires specifying which components a
|
||||
#### Throughput and Batch Efficiency {#sec-benchmarking-throughput-batch-efficiency-fe85}
|
||||
|
||||
\index{Queries per Second!inference throughput metric}
|
||||
Throughput measures how many inference requests a system can process per second, typically expressed as queries per second (QPS) or frames per second (FPS). Single-instance systems process each input independently on arrival; batch systems process multiple inputs in parallel, leveraging hardware optimizations for higher efficiency.
|
||||
Throughput measures how many inference requests a system can process per second, typically expressed as queries per second (QPS) or frames per second (FPS). Single-instance systems process each input independently on arrival; batch systems process multiple inputs in parallel, exploiting hardware parallelism for higher efficiency.
|
||||
|
||||
For example, cloud-based services handling millions of queries per second benefit from batch inference, where large groups of inputs are processed together to maximize computational efficiency. In contrast, applications like robotics, interactive AI, and augmented reality require low-latency single-instance inference, where the system must respond immediately to each new input.
|
||||
|
||||
@@ -2644,21 +2644,21 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class EdgeTPUSpeedupCalc:
|
||||
"""EdgeTPU vs Cortex-M7: 7.5× inference speedup, higher peak power, lower energy per inference."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
edgetpu_latency_ms = 2
|
||||
cpu_latency_ms = 15
|
||||
edgetpu_e2e_ms = 6
|
||||
cpu_e2e_ms = 18
|
||||
edgetpu_power_mw = 500
|
||||
cpu_power_mw = 120
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
inference_speedup = cpu_latency_ms / edgetpu_latency_ms
|
||||
e2e_speedup = cpu_e2e_ms / edgetpu_e2e_ms
|
||||
power_ratio = edgetpu_power_mw / cpu_power_mw
|
||||
cpu_energy_mj = cpu_power_mw * cpu_latency_ms / 1000
|
||||
edgetpu_energy_mj = edgetpu_power_mw * edgetpu_latency_ms / 1000
|
||||
energy_ratio = cpu_energy_mj / edgetpu_energy_mj
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
edgetpu_latency_ms_str = fmt(edgetpu_latency_ms, precision=0, commas=False)
|
||||
cpu_latency_ms_str = fmt(cpu_latency_ms, precision=0, commas=False)
|
||||
edgetpu_e2e_ms_str = fmt(edgetpu_e2e_ms, precision=0, commas=False)
|
||||
@@ -2715,7 +2715,7 @@ Training benchmarks measure learning speed; inference benchmarks measure serving
|
||||
|
||||
## Power Measurement Techniques {#sec-benchmarking-power-measurement-techniques-bcc2}
|
||||
|
||||
Power measurement completes the evaluation triad by quantifying the energy cost of performance, enabling the efficiency comparisons that increasingly determine deployment decisions.
|
||||
A chip vendor advertises "10 TOPS at 0.5 W," but under sustained inference load, thermal throttling drops actual throughput to 3 TOPS at 2 W. Without standardized power measurement, this 6.7$\times$ efficiency gap between the datasheet and reality goes undetected until deployment.
|
||||
|
||||
\index{TOPS per Watt!primary design objective}
|
||||
This third dimension is critical because @sec-hardware-acceleration established TOPS/Watt as a primary design objective alongside raw TOPS. Power benchmarks validate whether efficiency-optimized accelerators actually deliver their promised energy savings. Power claims are particularly susceptible to gaming: a chip advertising "10 TOPS at 0.5 W" might achieve that ratio only at minimal utilization; under sustained load, thermal throttling and voltage scaling may deliver 3 TOPS at 2 W. Power benchmarks expose these gaps.
|
||||
@@ -2899,7 +2899,7 @@ The relationship between computational performance and energy efficiency is a ce
|
||||
|
||||
In deployment scenarios with strict energy constraints, particularly battery-powered edge devices and mobile applications, optimizing this performance-energy tradeoff becomes essential for practical viability. Model optimization techniques offer promising approaches to achieve better efficiency without material accuracy degradation. Numerical precision optimization techniques, which reduce computational requirements while maintaining model quality, demonstrate this tradeoff effectively. Research shows that reduced-precision computation can maintain model accuracy within 1–2% of the original while delivering 3--4$\times$ improvements in both inference speed and energy efficiency.
|
||||
|
||||
These optimization strategies span three interconnected dimensions: accuracy, computational performance, and energy efficiency. Advanced optimization methods enable fine-tuned control over this tradeoff space. Similarly, model optimization and compression techniques require careful balancing of accuracy losses against efficiency gains. The optimal operating point among these factors depends heavily on deployment requirements and constraints; mobile applications typically prioritize energy efficiency to extend battery life, while cloud-based services might optimize for accuracy even at higher power consumption costs, leveraging economies of scale and dedicated cooling infrastructure.
|
||||
These optimization strategies span three interconnected dimensions: accuracy, computational performance, and energy efficiency. Advanced optimization methods enable fine-tuned control over this tradeoff space. Similarly, model optimization and compression techniques require careful balancing of accuracy losses against efficiency gains. The optimal operating point among these factors depends heavily on deployment requirements and constraints; mobile applications typically prioritize energy efficiency to extend battery life, while cloud-based services might optimize for accuracy even at higher power consumption costs, benefiting from economies of scale and dedicated cooling infrastructure.
|
||||
|
||||
Energy efficiency metrics now occupy a central position in AI system evaluation. Power measurement standards such as MLPerf Power [@tschand2024mlperf] provide standardized frameworks for comparing energy efficiency across hardware platforms and deployment scenarios. These standards enable engineers to systematically balance performance, power consumption, and environmental impact when selecting hardware and optimization strategies.
|
||||
|
||||
@@ -3297,13 +3297,13 @@ anchor=south,fill=white,inner sep=1pt]at (axis description cs: 0.82,0.49) {Bench
|
||||
|
||||
Analysis of these MLPerf Power trends reveals two notable patterns. First, energy efficiency improvements for traditional ML workloads (ResNet, BERT, RNN-T) have plateaued after initial gains; the low-hanging fruit of optimization has been harvested. Second, generative AI applications show dramatic efficiency increases (378$\times$ for Llama2, 113$\times$ for GPTJ), reflecting rapid innovation as researchers optimize these newer, larger models. This dichotomy suggests that established workloads have reached optimization maturity while frontier models still offer substantial efficiency headroom, a pattern likely to repeat as each new model architecture matures.
|
||||
|
||||
The measurement techniques examined above—from timing protocols to power instrumentation—provide the raw data for benchmarking. Yet raw data alone does not guarantee sound conclusions. Converting measurements into meaningful comparisons requires understanding the systematic sources of error, bias, and misalignment that can make even carefully collected benchmark numbers misleading.
|
||||
Timing protocols and power instrumentation provide the raw data for benchmarking. Raw data alone, however, does not guarantee sound conclusions. Converting measurements into meaningful comparisons requires understanding the systematic sources of error, bias, and misalignment that can make even carefully collected benchmark numbers misleading.
|
||||
|
||||
## Benchmarking Best Practices {#sec-benchmarking-benchmarking-limitations-best-practices-9d65}
|
||||
|
||||
The preceding sections established what benchmarks measure: training throughput, inference latency, power efficiency, and their validation through MLPerf. Knowing what to measure, however, is insufficient without understanding what benchmarks *cannot* capture—and why this gap has derailed countless deployments.
|
||||
Training throughput, inference latency, and power efficiency each have established measurement protocols validated through MLPerf. Knowing *what* to measure, however, is insufficient without understanding what benchmarks *cannot* capture—and why this gap has derailed countless deployments.
|
||||
|
||||
Every benchmark makes simplifying assumptions that enable standardized comparison but diverge from production reality. Training benchmarks assume fixed datasets and reproducible random seeds; production data drifts continuously. Inference benchmarks assume steady-state operation; production traffic spikes unpredictably. Power benchmarks assume controlled thermal environments; real hardware throttles under sustained load. Four categories of limitations—statistical, deployment-related, system design, and organizational—determine whether benchmark results translate to deployment success; the following discussion covers each and actionable practices for bridging the gaps.
|
||||
Every benchmark makes simplifying assumptions that enable standardized comparison but diverge from production reality. Training benchmarks assume fixed datasets and reproducible random seeds; production data drifts continuously. Inference benchmarks assume steady-state operation; production traffic spikes unpredictably. Power benchmarks assume controlled thermal environments; real hardware throttles under sustained load. Four categories of limitations—statistical, deployment-related, system design, and organizational—determine whether benchmark results translate to deployment success.
|
||||
|
||||
### Statistical & Methodological Issues {#sec-benchmarking-statistical-methodological-issues-7aa5}
|
||||
|
||||
@@ -3683,17 +3683,17 @@ In the Hennessy & Patterson tradition of quantitative systems, we must acknowled
|
||||
|
||||
Common "cheating" techniques in ML benchmarking include:
|
||||
|
||||
* **Precision Dropping**: Compilers may silently reduce precision (e.g., from FP32 to BF16) only during the benchmark run to inflate throughput, even if the user didn't request it.
|
||||
* **Operator Removal**: A compiler might identify that a benchmark only cares about top-1 accuracy and "optimize out" the activation functions or layer norms if they don't affect that specific metric, yielding unrealistic speedups.
|
||||
* **Precision Dropping**: Compilers may silently reduce precision (e.g., from FP32 to BF16) only during the benchmark run to inflate throughput, even if the user did not request it.
|
||||
* **Operator Removal**: A compiler might identify that a benchmark only cares about top-1 accuracy and "optimize out" the activation functions or layer norms if they do not affect that specific metric, yielding unrealistic speedups.
|
||||
* **Weight Pre-Loading**: Hardcoding the benchmark model's weights into the chip's on-chip SRAM, bypassing the "Memory Wall" bottlenecks that real production models must face.
|
||||
|
||||
**The MLPerf Response**: MLPerf prevents this gaming through its **Reference vs. Submission** validation. Every submitter must run the *exact same* model structure and reach a *verifiable accuracy target* (e.g., 75.9% on ImageNet) to qualify. If a compiler "cheats" by dropping precision or removing operators, the accuracy check fails, and the result is disqualified. This "Accuracy Guardrail" is what transforms a simple speed test into a rigorous engineering benchmark, forcing vendors to optimize for the **Silicon Contract** rather than just gaming the numbers.
|
||||
MLPerf prevents this gaming through its **Reference vs. Submission** validation. Every submitter must run the *exact same* model structure and reach a *verifiable accuracy target* (e.g., 75.9% on ImageNet) to qualify. If a compiler "cheats" by dropping precision or removing operators, the accuracy check fails, and the result is disqualified. This "Accuracy Guardrail" transforms a simple speed test into a rigorous engineering benchmark, forcing vendors to optimize for the **Silicon Contract** rather than gaming the numbers.
|
||||
|
||||
Yet even the most rigorous system benchmarks validate only one dimension of deployment readiness. A system achieving record throughput and efficiency on MLPerf says nothing about whether the model it runs is accurate on real-world inputs, or whether the data it was trained on represents the population it will serve. Hardware that delivers promised TFLOPS is necessary but insufficient; the model running on that hardware must preserve the quality users depend on, and the data that shaped that model must represent the world it will encounter. Completing the validation stack requires turning from hardware to the model and data dimensions of our three-dimensional framework.
|
||||
|
||||
## Model and Data Evaluation {#sec-benchmarking-model-data-benchmarking-e0ca}
|
||||
|
||||
The preceding sections validated hardware acceleration through training throughput, inference latency, and power efficiency, completing the system dimension of our three-dimensional benchmarking framework. Hardware validation alone, however, cannot ensure deployment success. The optimization pipeline from Part III also included model compression (@sec-model-compression) and data selection (@sec-data-selection), each requiring its own validation — a compressed model running on accelerated hardware trained on biased data will fail despite excellent system benchmarks. This section completes the validation stack by addressing the remaining two dimensions: model benchmarks verify that compression preserved accuracy and critical model properties, while data benchmarks verify that training data enables robust generalization.
|
||||
System benchmarks can confirm that hardware delivers promised training throughput, inference latency, and power efficiency. Hardware validation alone, however, cannot ensure deployment success. The optimization pipeline from Part III also included model compression (@sec-model-compression) and data selection (@sec-data-selection), each requiring its own validation — a compressed model running on accelerated hardware trained on biased data will fail despite excellent system benchmarks. The remaining two dimensions of the framework address this gap: model benchmarks verify that compression preserved accuracy and critical model properties, while data benchmarks verify that training data enables robust generalization.
|
||||
|
||||
### Model Benchmarking {#sec-benchmarking-model-benchmarking-4847}
|
||||
|
||||
@@ -3807,7 +3807,7 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class MobileNetINT8Calc:
|
||||
"""MobileNetV2 FP32 vs INT8: aggregate accuracy holds but calibration and edge cases degrade."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
acc_fp32 = 71.8
|
||||
params_m = 3.5
|
||||
size_fp32_mb = 14.0
|
||||
@@ -3819,10 +3819,10 @@ class MobileNetINT8Calc:
|
||||
ece_int8 = 0.089
|
||||
edge_fp32 = 68.2
|
||||
edge_int8 = 61.4
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
acc_drop = acc_fp32 - acc_int8
|
||||
edge_drop = edge_fp32 - edge_int8
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mv2_acc_fp32_str = fmt(acc_fp32, precision=1, commas=False)
|
||||
mv2_params_m_str = fmt(params_m, precision=1, commas=False)
|
||||
mv2_size_fp32_mb_str = fmt(size_fp32_mb, precision=1, commas=False)
|
||||
@@ -3931,14 +3931,14 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class LLMThroughputCalc:
|
||||
"""25 vs 100 tok/s on a 750-token response: 30 s vs 7.5 s — a 4× user-perceived difference."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
response_tokens = 750 # ~500 words
|
||||
slow_toks = 25
|
||||
fast_toks = 100
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
slow_sec = response_tokens / slow_toks
|
||||
fast_sec = response_tokens / fast_toks
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
slow_str = fmt(slow_sec, precision=0, commas=False)
|
||||
fast_str = fmt(fast_sec, precision=1, commas=False)
|
||||
response_tokens_str = fmt(response_tokens, precision=0, commas=False)
|
||||
@@ -4370,7 +4370,7 @@ With benchmarking principles, methodologies, and production considerations estab
|
||||
|
||||
class FallaciesPitfallsSetup:
|
||||
"""Quantitative backing for all Fallacy/Pitfall items in the benchmarking chapter."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Fallacy 1: Benchmark vs. production accuracy gap
|
||||
benchmark_accuracy_pct = 92
|
||||
production_accuracy_low_pct = 78
|
||||
@@ -4415,13 +4415,13 @@ class FallaciesPitfallsSetup:
|
||||
e2e_latency_high_ms = 100
|
||||
availability_nines = 99.9
|
||||
downtime_minutes_month = 43
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
accuracy_drop_pct = benchmark_accuracy_pct - production_accuracy_high_pct
|
||||
latency_multiplier = production_p99_low_ms / benchmark_latency_mean_ms
|
||||
power_ratio = high_power_w / low_power_w
|
||||
throughput_loss_pct = round((1 - low_throughput_qps / high_throughput_qps) * 100)
|
||||
throughput_degradation_pct = round((1 - production_throughput_high_qps / isolated_throughput_qps) * 100)
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
benchmark_accuracy_pct_str = f"{benchmark_accuracy_pct}"
|
||||
production_accuracy_range_str = f"{production_accuracy_low_pct}-{production_accuracy_high_pct}"
|
||||
accuracy_drop_pct_str = f"{accuracy_drop_pct}"
|
||||
@@ -4561,3 +4561,10 @@ We have validated our optimizations in the lab, but a benchmark is a map, not th
|
||||
```{=latex}
|
||||
\part{key:vol1_deploy}
|
||||
```
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:benchmarking")
|
||||
```
|
||||
|
||||
@@ -16,7 +16,6 @@ engine: jupyter
|
||||
# │ Goal: Initialize the chapter and register with the mlsys registry.
|
||||
# │ Show: Correct registration for cross-chapter constant resolution.
|
||||
# │ How: Invoke start_chapter from the mlsys registry module.
|
||||
|
||||
# │
|
||||
# │ Imports: mlsys.registry (start_chapter)
|
||||
# │ Exports: (none — side effect only)
|
||||
@@ -48,19 +47,19 @@ start_chapter("vol1:conclusion")
|
||||
from mlsys import Hardware, Models
|
||||
from mlsys.formatting import md_math, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ConclusionRoofline:
|
||||
"""
|
||||
Namespace for Conclusion Roofline Analysis.
|
||||
Scenario: Llama-2-70B inference on H100 (Memory Bound).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
gpu = Hardware.H100
|
||||
model = Models.Language.Llama2_70B
|
||||
precision_bytes = 2.0 # FP16
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
d_vol = model.parameters * precision_bytes
|
||||
compute_req = model.parameters * 2 # 2 FLOPs per param per token
|
||||
|
||||
@@ -69,10 +68,10 @@ class ConclusionRoofline:
|
||||
|
||||
ratio = t_mem / t_comp
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ratio >= 10, f"LLM Inference should be heavily memory bound. Ratio is only {ratio:.1f}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
llama_params_str = "70B"
|
||||
llama_dvol_gb_str = f"{d_vol.m_as('GB'):.0f}"
|
||||
llama_compute_gflops_str = f"{compute_req.m_as('GFLOPs'):.0f}"
|
||||
@@ -153,9 +152,9 @@ This is the central lesson of every chapter in this book. The introduction (@sec
|
||||
This book began with a mathematical formula: the Iron Law of ML Systems (Principle \ref{pri-iron-law}) (@sec-introduction-iron-law-ml-systems-c32a). Its terms, Data Movement, Compute, and Overhead, once seemed abstract. Now they are primary engineering levers for quantitative analysis of systems that once seemed opaque. Building intelligence requires more than writing algorithms; it requires honoring the Silicon Contract (Principle \ref{pri-silicon-contract}), the *physical and economic agreement* between the model and the machine. @sec-hardware-acceleration equipped us to calculate arithmetic intensity and identify whether workloads are memory-bound or compute-bound, transforming vague performance intuitions into quantitative engineering decisions.
|
||||
|
||||
\index{Transformer!systems integration}
|
||||
This quantitative foundation leads to a broader point: contemporary artificial intelligence[^fn-ai-systems-view] achievements are not the product of any single algorithmic insight but of careful integration across interacting components. This systems perspective places machine learning within the same engineering tradition that built reliable computers, where powerful capabilities arise from coordinating many parts together. The Transformer architectures [@vaswani2017attention] enabling large language models exemplify this principle—their mathematical elegance alone does not explain their dominance. Their practical utility depends on integrating attention mechanisms with distributed training infrastructure, memory-efficient optimization techniques, and reliable operational frameworks that keep them reliable in production.
|
||||
This quantitative foundation leads to a broader point: contemporary artificial intelligence[^fn-ai-systems-view] achievements are not the product of any single algorithmic insight but of careful integration across interacting components. This systems perspective places machine learning within the same engineering tradition that built reliable computers, where emergent capabilities arise from coordinating many parts together. The Transformer architectures [@vaswani2017attention] enabling large language models exemplify this principle—their mathematical elegance alone does not explain their dominance. Their practical utility depends on integrating attention mechanisms with distributed training infrastructure, memory-efficient optimization techniques, and reliable operational frameworks that keep them reliable in production.
|
||||
|
||||
[^fn-ai-systems-view]: **Artificial Intelligence (Systems Perspective)**: The capacity of an integrated system---not any single algorithm---to exhibit intelligent behavior. Throughout this book, we have seen that success depends on coordinating data pipelines, training infrastructure, inference serving, security, and governance: systems engineering excellence across all components.
|
||||
[^fn-ai-systems-view]: **Artificial Intelligence (Systems Perspective)**: The capacity of an integrated system—not any single algorithm—to exhibit intelligent behavior. Throughout this book, we have seen that success depends on coordinating data pipelines, training infrastructure, inference serving, security, and governance: systems engineering excellence across all components.
|
||||
|
||||
What does this integration mean in practice? We often speak of the "model" as the weights file—a 500 MB blob of floating-point numbers. In a production environment, however, the weights are just one component of the true model, and often not even the most important one. A model that produces perfect predictions is useless if it receives corrupted inputs, and a model that trains flawlessly will fail if it cannot be deployed reliably. The *true model* is the sum of the data pipeline that defines what the model sees, the training infrastructure that determines what it learns, the serving system that decides how it interacts with the world, and the monitoring loop that keeps it tethered to reality. Optimize the system, and the model improves. Neglect the system, and the model degrades. Systems engineering is not a wrapper around ML; it is the implementation of ML. The system *is* the model.
|
||||
|
||||
@@ -180,7 +179,7 @@ Part III shifted from building to optimizing. Model compression (@sec-model-comp
|
||||
|
||||
Finally, Part IV confronted production reality. Serving systems (@sec-model-serving) had to meet latency budgets under load. Operational practices (@sec-ml-operations) maintained model health over time as data distributions shifted. Responsible engineering (@sec-responsible-engineering) ensured that systems serve all users fairly, not just the populations best represented in training data.
|
||||
|
||||
Each chapter contributed a piece. But the real lesson is not in any individual piece—it is in *how the pieces constrain each other*. An architecture choice enabled a compression choice, which enabled an acceleration choice, which shaped a serving constraint, which defined an operational requirement. Depthwise separable convolutions in MobileNetV2 allowed INT8 quantization with minimal accuracy loss. That in turn enabled mobile NPU deployment, which shaped a P99 < 50 ms latency constraint and required drift monitoring across heterogeneous device populations. Every decision propagated forward, and the engineer who understands only one layer cannot predict how changes ripple through the rest.
|
||||
Each chapter contributed a piece. The real lesson, however, is not in any individual piece—it is in *how the pieces constrain each other*. An architecture choice enabled a compression choice, which enabled an acceleration choice, which shaped a serving constraint, which defined an operational requirement. Depthwise separable convolutions in MobileNetV2 allowed INT8 quantization with minimal accuracy loss. That in turn enabled mobile NPU deployment, which shaped a P99 < 50 ms latency constraint and required drift monitoring across heterogeneous device populations. Every decision propagated forward, and the engineer who understands only one layer cannot predict how changes ripple through the rest.
|
||||
|
||||
This chapter distills that integrated perspective into a framework for reasoning about ML systems as wholes rather than as collections of parts. We begin by revisiting the Lighthouse Models that traced these constraint interactions across chapters, then formalize twelve quantitative invariants—rooted in physics, information theory, and statistics—that govern ML system behavior regardless of framework, hardware generation, or model family. We then examine how these principles apply across three domains, explore future directions where systems thinking will matter most, and close with the engineering responsibility that accompanies building systems of this power.
|
||||
|
||||
@@ -188,7 +187,7 @@ This chapter distills that integrated perspective into a framework for reasoning
|
||||
|
||||
The five Lighthouse Models introduced in @sec-introduction-iron-law-ml-systems-c32a made this constraint propagation concrete, serving as systems detectives throughout the book. Each revealed how different workloads expose different bottlenecks.
|
||||
|
||||
ResNet-50 taught compute-bound optimization, showing how batch size transforms memory-bound inference into compute-bound throughput and why the same pruning strategy achieves different speedups on different hardware. GPT-2/Llama exposed a different wall entirely—memory bandwidth—revealing why autoregressive decoding is memory-bound, KV-caches dominate serving costs, and model parallelism becomes necessary at scale. Where these two Lighthouses stressed throughput and bandwidth, MobileNetV2 demonstrated efficiency under constraint: depthwise separable convolutions trading representational capacity for computational efficiency, quantization enabling deployment on mobile NPUs, and the Pareto frontier between accuracy and power consumption. DLRM shifted the binding constraint yet again—from memory *bandwidth* to memory *capacity*—where terabyte-scale embedding tables force the system architecture to be designed around where the data physically resides, and where sparse operations behave fundamentally differently from the dense matrix multiplications that dominate the other Lighthouses. Finally, Keyword Spotting (KWS) and Wake Vision brought us to the extreme edge: sub-megabyte models running on microcontrollers with always-on inference under microwatt power budgets, where every byte and every milliwatt matters.
|
||||
ResNet-50 taught compute-bound optimization, showing how batch size transforms memory-bound inference into compute-bound throughput and why the same pruning strategy achieves different speedups on different hardware. GPT-2/Llama exposed a different wall entirely: memory bandwidth. Their autoregressive decoding is memory-bound, KV-caches dominate serving costs, and model parallelism becomes necessary at scale. Where these two Lighthouses stressed throughput and bandwidth, MobileNetV2 demonstrated efficiency under constraint: depthwise separable convolutions trading representational capacity for computational efficiency, quantization enabling deployment on mobile NPUs, and the Pareto frontier between accuracy and power consumption. DLRM shifted the binding constraint yet again, from memory *bandwidth* to memory *capacity*, where terabyte-scale embedding tables force engineers to design the system architecture around where the data physically resides and where sparse operations behave fundamentally differently from the dense matrix multiplications that dominate the other Lighthouses. Finally, Keyword Spotting (KWS) and Wake Vision brought us to the extreme edge: sub-megabyte models running on microcontrollers with always-on inference under microwatt power budgets, where every byte and every milliwatt matters.
|
||||
|
||||
Together, these five workloads span the full deployment spectrum from datacenter to microcontroller, probing every bottleneck the invariants predict and testing every optimization strategy the book has taught. The systems thinking we developed by tracing these Lighthouses across chapters—from architecture design through training, optimization, and deployment—is the integrated perspective that distinguishes ML systems engineering from isolated algorithm development.
|
||||
|
||||
@@ -264,7 +263,7 @@ These four invariants explain why @sec-ml-operations devoted extensive attention
|
||||
|
||||
These principles are not a checklist to apply sequentially. They form a web of mutual constraints. As the Conservation of Complexity dictates, a single engineering decision ripples through multiple invariants simultaneously.
|
||||
|
||||
To see this concretely, trace what happens when you quantize a model from FP16 to INT8. This single decision navigates the Pareto Frontier (Principle \ref{pri-pareto-frontier}), trading precision for bandwidth. But the consequences do not stop there: quantization changes the model's Silicon Contract (Principle \ref{pri-silicon-contract}), shifting where it sits on the Arithmetic Intensity curve (Principle \ref{pri-arithmetic-intensity}) and altering its energy profile (Principle \ref{pri-energy-movement}). When you deploy that quantized model, the Latency Budget (Principle \ref{pri-latency-budget}) governs whether the speedup meets the SLO, while the Training-Serving Skew Law (Principle \ref{pri-training-serving-skew}) demands verification that reduced precision did not introduce a divergence between training and serving behavior. A single quantization decision ripples through the Pareto Frontier, Silicon Contract, and Latency Budget simultaneously, where a win in one (bandwidth) must be validated against a risk in another (numerical skew).
|
||||
To see this concretely, trace what happens when you quantize a model from FP16 to INT8. This single decision navigates the Pareto Frontier (Principle \ref{pri-pareto-frontier}), trading precision for bandwidth. The consequences do not stop there: quantization changes the model's Silicon Contract (Principle \ref{pri-silicon-contract}), shifting where it sits on the Arithmetic Intensity curve (Principle \ref{pri-arithmetic-intensity}) and altering its energy profile (Principle \ref{pri-energy-movement}). When you deploy that quantized model, the Latency Budget (Principle \ref{pri-latency-budget}) governs whether the speedup meets the SLO, while the Training-Serving Skew Law (Principle \ref{pri-training-serving-skew}) demands verification that reduced precision did not introduce a divergence between training and serving behavior. A single quantization decision ripples through the Pareto Frontier, Silicon Contract, and Latency Budget simultaneously, where a win in one (bandwidth) must be validated against a risk in another (numerical skew).
|
||||
|
||||
Meanwhile, the Data Gravity Invariant (Principle \ref{pri-data-gravity}) determines where the model runs, the Data as Code Invariant (Principle \ref{pri-data-as-code}) determines what it learned, the Iron Law (Principle \ref{pri-iron-law}) determines how fast it runs, and Amdahl's Law (Principle \ref{pri-amdahl}) determines how much faster it can ever run. The Verification Gap (Principle \ref{pri-verification-gap}) reminds us that statistical tests can only *bound* the resulting accuracy loss, and the Statistical Drift Invariant (Principle \ref{pri-statistical-drift}) warns that even a validated deployment will degrade over time. Complexity is conserved; the engineer's task is to allocate it wisely.
|
||||
|
||||
@@ -304,7 +303,7 @@ To see this cycle of mutual constraint in action, trace the flow in @fig-invaria
|
||||
```
|
||||
:::
|
||||
|
||||
The critical insight the figure reveals is the Deploy-to-Foundations feedback arrow. Invariants 9–12—the deployment invariants that detect drift, skew, and verification failures—are the signals that force the system to evolve. When drift erodes accuracy or skew corrupts predictions, the system must return to its foundations: new data, retrained models, fresh optimization passes through the entire stack. This cycle operates within a single node today, but the same physics governs fleet-scale systems—a transition we will return to at the chapter's close.
|
||||
The critical insight the figure reveals is the Deploy-to-Foundations feedback arrow. Invariants 9–12, the deployment invariants that detect drift, skew, and verification failures, are the signals that force the system to evolve. When drift erodes accuracy or skew corrupts predictions, the system must return to its foundations: new data, retrained models, fresh optimization passes through the entire stack. This cycle operates within a single node today, but the same physics governs fleet-scale systems—a transition we return to at the chapter's close.
|
||||
|
||||
::: {.callout-checkpoint title="Applying the Invariants" collapse="false"}
|
||||
A colleague proposes quantizing your model from FP32 to INT8 to reduce serving costs.
|
||||
@@ -340,7 +339,7 @@ The Cost of a Token calculation illustrates a broader truth: the invariant frame
|
||||
|
||||
## Principles in Practice {#sec--principles-practice-1005}
|
||||
|
||||
The twelve invariants gain their power not from theoretical elegance but from practical application. Throughout this book, these quantitative constraints have shaped engineering decisions across three domains spanning the full ML lifecycle: building technical foundations, engineering for scale, and navigating production reality. Each domain foregrounds different invariants, but all three demonstrate the same underlying lesson: systems thinking connects what isolated component analysis cannot.
|
||||
A team that memorizes all twelve invariants but cannot apply them to a real deployment decision has learned nothing. Throughout this book, these quantitative constraints have shaped engineering decisions across three domains spanning the full ML lifecycle: building technical foundations, engineering for scale, and navigating production reality. Each domain foregrounds different invariants, but all three demonstrate the same underlying lesson: systems thinking connects what isolated component analysis cannot.
|
||||
|
||||
### Building Technical Foundations { .unnumbered}
|
||||
|
||||
@@ -371,24 +370,24 @@ Building and optimizing a model, however, is only half the engineering challenge
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TailLatencyRatio:
|
||||
"""
|
||||
Namespace for Tail Latency Ratio Calculation.
|
||||
Scenario: Comparing mean latency vs P99 tail latency.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
mean_latency_ms = 50.0
|
||||
p99_latency_ms = 2000.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
ratio = p99_latency_ms / mean_latency_ms
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ratio >= 10, f"P99 tail latency ({ratio:.1f}x) is not significant enough.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
conclusion_tail_ratio_str = fmt(ratio, precision=0, commas=False)
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -452,11 +451,11 @@ The most ambitious application of these invariants lies ahead: engineering the p
|
||||
|
||||
Universal generalization imposes extraordinary systems demands. Every invariant becomes simultaneously active: the Iron Law (Principle \ref{pri-iron-law}) governs computation at a scale where models may contain trillions of parameters. The Silicon Contract (Principle \ref{pri-silicon-contract}) must be honored across heterogeneous hardware spanning GPUs, TPUs, and custom accelerators. The Pareto Frontier expands from two or three metrics (accuracy, latency, memory) to dozens (safety, fairness, reasoning quality, factuality, multilinguality). The Statistical Drift Invariant applies not to a single domain but to the entire distribution of human knowledge and interaction. No monolithic model can navigate this complexity alone.
|
||||
|
||||
This realization has driven the emergence of **compound AI systems**[^fn-compound-ai]\index{Compound AI Systems!reliability through composition}—architectures that chain multiple models and deterministic tools to achieve reliability exceeding their individual components. Rather than building a single model that does everything, compound systems decompose tasks into specialized steps: a retrieval component finds relevant information, a reasoning component processes it, and a verification component checks the output. Each step can be independently updated, monitored, and debugged. This decomposition trades latency and architectural complexity for control and correctness—a trade-off that the Pareto Frontier predicts and the Conservation of Complexity demands.
|
||||
This realization has driven the emergence of **compound AI systems**[^fn-compound-ai]\index{Compound AI Systems!reliability through composition}—architectures that chain multiple models and deterministic tools to achieve reliability exceeding their individual components. Rather than building a single model that does everything, compound systems decompose tasks into specialized steps: a retrieval component finds relevant information, a reasoning component processes it, and a verification component checks the output. Each step can be independently updated, monitored, and debugged. This decomposition trades latency and architectural complexity for control and correctness, a trade-off that the Pareto Frontier predicts and the Conservation of Complexity demands.
|
||||
|
||||
[^fn-compound-ai]: **Compound AI Systems**\index{Compound AI Systems!etymology}: Coined by researchers at Berkeley AI Research (BAIR) in 2024 to describe systems that compose multiple AI components---models, retrievers, tools, and verifiers---into pipelines, rather than relying on a single monolithic model. Examples include retrieval-augmented generation (RAG) and tool-augmented agents. From a systems perspective, compound AI systems trade single-model simplicity for orchestration complexity, but gain independently updatable components, debuggable intermediate outputs, and the ability to enforce deterministic constraints alongside probabilistic generation.
|
||||
[^fn-compound-ai]: **Compound AI Systems**\index{Compound AI Systems!etymology}: Coined by researchers at Berkeley AI Research (BAIR) in 2024 to describe systems that compose multiple AI components—models, retrievers, tools, and verifiers—into pipelines, rather than relying on a single monolithic model. Examples include retrieval-augmented generation (RAG) and tool-augmented agents. From a systems perspective, compound AI systems trade single-model simplicity for orchestration complexity, but gain independently updatable components, debuggable intermediate outputs, and the ability to enforce deterministic constraints alongside probabilistic generation.
|
||||
|
||||
The compound AI systems framework aligns naturally with the systems engineering principles we have studied. Modular components can be independently compressed and accelerated using the techniques from @sec-model-compression and @sec-hardware-acceleration. Each component has its own Silicon Contract (Principle \ref{pri-silicon-contract}) and Arithmetic Intensity profile, allowing hardware-specific optimization. The interfaces between components create natural monitoring points for detecting drift, skew, and degradation. The engineering challenges ahead—reliable orchestration of multiple models, efficient routing of requests across specialized components, maintaining consistency across distributed state—require mastery across the full stack we have explored, from data engineering and distributed training to model optimization and operational infrastructure. These quantitative invariants, not algorithmic breakthroughs alone, define the path toward artificial general intelligence—an endeavor that unfolds within what Hennessy and Patterson have called *a new golden age for computer architecture*.
|
||||
The compound AI systems framework aligns naturally with the systems engineering principles we have studied. Modular components can be independently compressed and accelerated using the techniques from @sec-model-compression and @sec-hardware-acceleration. Each component has its own Silicon Contract (Principle \ref{pri-silicon-contract}) and Arithmetic Intensity profile, allowing hardware-specific optimization. The interfaces between components create natural monitoring points for detecting drift, skew, and degradation. The engineering challenges ahead require mastery across the full stack we have explored: reliable orchestration of multiple models, efficient routing of requests across specialized components, and maintaining consistency across distributed state all demand integration from data engineering through model optimization to operational infrastructure. These quantitative invariants, not algorithmic breakthroughs alone, define the path toward artificial general intelligence, an endeavor that unfolds within what Hennessy and Patterson have called *a new golden age for computer architecture*.
|
||||
|
||||
\index{Hennessy and Patterson}
|
||||
|
||||
@@ -489,7 +488,7 @@ Exercising that responsibility at the scale these applications demand, however,
|
||||
|
||||
### Node to Fleet { .unnumbered}
|
||||
|
||||
Every principle we have established—from measuring bottlenecks to co-designing for hardware—was developed within the scope of a single system. But training a frontier model requires thousands of GPUs running for months, petabytes of data flowing through distributed pipelines, and failure rates measured in failures per hour rather than failures per year. The systems that will define the next decade of AI operate at a scale where individual machines become components of something far larger. That transition is not merely an increase in quantity; it is a qualitative shift in the engineering challenges involved.
|
||||
Every principle we have established—from measuring bottlenecks to co-designing for hardware—was developed within the scope of a single system. Training a frontier model, however, requires thousands of GPUs running for months, petabytes of data flowing through distributed pipelines, and failure rates measured in failures per hour rather than failures per year. The systems that will define the next decade of AI operate at a scale where individual machines become components of something far larger. That transition is not merely an increase in quantity; it is a qualitative shift in the engineering challenges involved.
|
||||
|
||||
This book has deliberately focused on **Mastering the ML Node**. We established principles that can be directly observed and experimented with on a single system. Understanding bottlenecks on one machine—whether memory bandwidth limitations, CPU-GPU data transfer overhead, or preprocessing inefficiencies—enables recognition of when and why scaling becomes necessary. We learned to calculate arithmetic intensity, optimize data pipelines, and prune models to fit within strict constraints.
|
||||
|
||||
@@ -505,6 +504,8 @@ Mastery, however, carries a recurring temptation: the belief that understanding
|
||||
|
||||
## Fallacies and Pitfalls {#sec--fallacies-pitfalls-12ef}
|
||||
|
||||
The errors below arise from a common source: treating ML systems as decomposable into independent parts. Each fallacy assumes that optimizing one dimension, one metric, or one stage suffices; each pitfall shows the consequence when that assumption meets production reality.
|
||||
|
||||
**Fallacy:** *Systems engineering complexity disappears with better tools and abstractions.*
|
||||
|
||||
Tools abstract complexity; they do not eliminate it. A high-level framework that hides memory management still consumes memory. An AutoML system that tunes hyperparameters still faces the Pareto Frontier. The Conservation of Complexity guarantees that simplifying one interface pushes complexity to another. The engineer who believes tools eliminate fundamental constraints will be surprised when those constraints resurface at scale, often in forms harder to diagnose than the original problem.
|
||||
@@ -541,13 +542,13 @@ This chapter distilled the integrated perspective that distinguishes ML systems
|
||||
|
||||
::: {.callout-takeaways title="Reasoning Across Boundaries"}
|
||||
|
||||
- **Twelve quantitative invariants define ML systems engineering**: From the Data as Code Invariant through the Latency Budget Invariant, these principles quantify the constraints that govern every design decision, organized across Foundations (data physics), Build (computation physics), Optimize (efficiency physics), and Deploy (reliability physics).
|
||||
- **The Conservation of Complexity unifies all twelve**: You cannot destroy complexity in an ML system; you can only move it between Data, Algorithm, and Machine. Every invariant quantifies a specific consequence of where complexity currently resides.
|
||||
- **The system is the model**: The true model is data pipeline + training infrastructure + serving system + monitoring loop. Optimize the system to improve the model.
|
||||
- **Production ML demands continuous operation and designed-in robustness**: The Verification Gap, Statistical Drift Invariant, and Training-Serving Skew Law guarantee that models degrade without code changes and that some failures reach production. Redundancy, uncertainty quantification, and continuous monitoring are first-class design requirements, not optional add-ons.
|
||||
- **Every deployment context stresses different invariants, but no context escapes them**: Cloud, edge, generative AI, and TinyML each foreground different terms of the Iron Law, but the Pareto Frontier and Energy-Movement Invariant govern all of them; success requires applying multiple principles simultaneously rather than optimizing any single metric.
|
||||
- **Technical excellence must combine with ethical commitment**: The Verification Gap and drift invariants apply equally to fairness metrics. Build systems that are efficient, accessible, sustainable, and beneficial.
|
||||
- **Mastering the node prepares you for the fleet**: The principles developed for single systems, from bottleneck diagnosis to hardware co-design and drift monitoring, scale to the Warehouse-Scale Computer, where the datacenter becomes the computer and the Iron Law spans racks and zones.
|
||||
* **Twelve quantitative invariants define ML systems engineering**: From the Data as Code Invariant through the Latency Budget Invariant, these principles quantify the constraints that govern every design decision, organized across Foundations (data physics), Build (computation physics), Optimize (efficiency physics), and Deploy (reliability physics).
|
||||
* **The Conservation of Complexity unifies all twelve**: You cannot destroy complexity in an ML system; you can only move it between Data, Algorithm, and Machine. Every invariant quantifies a specific consequence of where complexity currently resides.
|
||||
* **The system is the model**: The true model is data pipeline + training infrastructure + serving system + monitoring loop. Optimize the system to improve the model.
|
||||
* **Production ML demands continuous operation and designed-in robustness**: The Verification Gap, Statistical Drift Invariant, and Training-Serving Skew Law guarantee that models degrade without code changes and that some failures reach production. Redundancy, uncertainty quantification, and continuous monitoring are first-class design requirements, not optional add-ons.
|
||||
* **Every deployment context stresses different invariants, but no context escapes them**: Cloud, edge, generative AI, and TinyML each foreground different terms of the Iron Law, but the Pareto Frontier and Energy-Movement Invariant govern all of them; success requires applying multiple principles simultaneously rather than optimizing any single metric.
|
||||
* **Technical excellence must combine with ethical commitment**: The Verification Gap and drift invariants apply equally to fairness metrics. Build systems that are efficient, accessible, sustainable, and beneficial.
|
||||
* **Mastering the node prepares you for the fleet**: The principles developed for single systems, from bottleneck diagnosis to hardware co-design and drift monitoring, scale to the Warehouse-Scale Computer, where the datacenter becomes the computer and the Iron Law spans racks and zones.
|
||||
|
||||
:::
|
||||
|
||||
@@ -569,3 +570,10 @@ This book established the principles for mastering the ML node—the single syst
|
||||
```{=latex}
|
||||
\part{key:backmatter}
|
||||
```
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:conclusion")
|
||||
```
|
||||
|
||||
@@ -88,7 +88,7 @@ from mlsys import Hardware, Models
|
||||
from mlsys.formatting import fmt, sci, md_math
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DataloaderStats:
|
||||
"""
|
||||
Namespace for Dataloader Choke Point statistics.
|
||||
@@ -99,14 +99,14 @@ class DataloaderStats:
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
resnet_worker_count_str = DataloaderStats.resnet_worker_count_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DataEngineeringSetup:
|
||||
"""
|
||||
Namespace for Data Engineering chapter setup.
|
||||
Scenario: Constants for tables, KWS case study, and physical invariants.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Digital Twins
|
||||
h_a100 = Hardware.Cloud.A100
|
||||
m_gpt3 = Models.GPT3
|
||||
@@ -163,7 +163,7 @@ class DataEngineeringSetup:
|
||||
_dataset_pb = 1
|
||||
_network_10g_gbs = 10 / 8 # 10 Gbps in GB/s
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Hardware/Energy
|
||||
a100_tflops_fp16 = h_a100.peak_flops.m_as(TFLOPs/second)
|
||||
a100_tflops_fp16_sparse = (h_a100.peak_flops * 2).m_as(TFLOPs/second)
|
||||
@@ -199,7 +199,7 @@ class DataEngineeringSetup:
|
||||
SECONDS_PER_MINUTE * MINUTES_PER_HOUR * HOURS_PER_DAY
|
||||
)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
# Energy/Hardware
|
||||
a100_tflops_fp16_str = fmt(a100_tflops_fp16, precision=0, commas=False)
|
||||
a100_tflops_fp16_sparse_str = fmt(a100_tflops_fp16_sparse, precision=0, commas=False)
|
||||
@@ -393,14 +393,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FeedingProblem:
|
||||
"""
|
||||
Namespace for The Feeding Problem calculation.
|
||||
Scenario: Saturating an A100 with ResNet-50 images from a standard disk.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
gpu_flops = A100_FLOPS_FP16_TENSOR
|
||||
model_flops = RESNET50_FLOPs
|
||||
|
||||
@@ -410,7 +410,7 @@ class FeedingProblem:
|
||||
# Standard Cloud Disk (e.g. AWS gp3 baseline)
|
||||
disk_bw_mbs = 250.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Throughput (Images/sec) = GPU_Peak / Model_FLOPs
|
||||
img_per_sec = (gpu_flops / model_flops).to_base_units().m_as('count/second')
|
||||
|
||||
@@ -422,11 +422,11 @@ class FeedingProblem:
|
||||
eta = min(disk_bw_mbs / (req_bw_bytes_sec / (1 * MB).m_as('B')), 1.0)
|
||||
feeding_tax_pct = (1.0 - eta) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(req_bw_gbs > 1.0, f"Required bandwidth ({req_bw_gbs:.1f} GB/s) is lower than expected.")
|
||||
check(feeding_tax_pct > 50, f"Feeding tax ({feeding_tax_pct:.1f}%) should be significant for standard disks.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
req_bw_gbs_str = fmt(req_bw_gbs, precision=1, commas=False)
|
||||
disk_bw_mbs_str = fmt(disk_bw_mbs, precision=0, commas=False)
|
||||
feeding_tax_pct_str = fmt(feeding_tax_pct, precision=0, commas=False)
|
||||
@@ -492,20 +492,20 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check, md_math
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DataGravity:
|
||||
"""
|
||||
Namespace for Data Gravity calculation.
|
||||
Scenario: Moving 1PB of data vs Moving the Compute.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
dataset_pb = 1
|
||||
network = Hardware.Networks.Ethernet_100G
|
||||
egress_cost_gb = CLOUD_EGRESS_PER_GB.m_as(USD / GB)
|
||||
tpu_hourly_cost = TPU_V4_PER_HOUR.m_as(USD / hour)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
dataset_gb = dataset_pb * MILLION
|
||||
network_gbs = network.bandwidth.m_as(GB/second)
|
||||
network_gbps = network.bandwidth.m_as(Gbps)
|
||||
@@ -520,11 +520,11 @@ class DataGravity:
|
||||
transfer_cost = dataset_gb * egress_cost_gb
|
||||
equiv_tpu_hours = transfer_cost / tpu_hourly_cost
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(transfer_hours >= 20, f"Transfer time ({transfer_hours:.1f}h) is too fast. Data gravity argument fails.")
|
||||
check(transfer_cost >= 10000, f"Transfer cost (${transfer_cost}) is too cheap.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
transfer_seconds_str = fmt(transfer_seconds, precision=0, commas=True)
|
||||
transfer_hours_str = fmt(transfer_hours, precision=0, commas=False)
|
||||
transfer_days_10g_str = fmt(transfer_days_10g, precision=0, commas=False)
|
||||
@@ -668,7 +668,7 @@ Data cascades occur when teams skip establishing clear quality criteria, reliabi
|
||||
**The Systems Lesson**: This is a **Pipeline Jungle** failure. Without explicit **Data Contracts** and schema validation at the ingestion interface, changes in one system ("we need string zip codes") cause catastrophic, silent failures in downstream systems. Data engineering is the defense against this entropy.
|
||||
:::
|
||||
|
||||
With the cascade pattern understood, we now define the four pillars that prevent these failures.
|
||||
Preventing these cascading failures requires more than ad hoc fixes; it demands a systematic framework that organizes data engineering decisions into four interdependent dimensions.
|
||||
|
||||
### Four Foundational Pillars {#sec-data-engineering-four-foundational-pillars-c119}
|
||||
|
||||
@@ -974,7 +974,7 @@ Look at @fig-keywords to see how a KWS system operates as a lightweight, always-
|
||||
|
||||
{#fig-keywords width=55% fig-alt="Diagram showing voice-activated device with microphone, always-on wake word detector, and connection to main voice assistant that activates upon keyword detection."}
|
||||
|
||||
With this understanding established, we apply the problem definition approach to the KWS example, demonstrating how the four pillars guide practical engineering decisions.
|
||||
The four pillars translate directly into engineering constraints for the KWS system.
|
||||
|
||||
The core problem is deceptively simple: detect specific keywords amidst ambient sounds and other spoken words, with high accuracy, low latency, and minimal false activations, on devices with severely limited computational resources. A well-specified problem definition identifies the desired keywords, the envisioned application, and the deployment scenario. The objectives that follow must balance competing requirements: performance targets of `{python} kws_accuracy_target_str`% accuracy in keyword detection with latency under `{python} kws_latency_limit_ms_str` milliseconds, alongside resource constraints demanding minimal power consumption and model sizes optimized for available device memory.
|
||||
|
||||
@@ -1006,29 +1006,29 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FalsePositiveTarget:
|
||||
"""
|
||||
Namespace for KWS False Positive Target calculation.
|
||||
Scenario: Always-on device (24h) with 1 false wake-up tolerance per month.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
duty_cycle_hours = 24
|
||||
window_sec = 1
|
||||
tolerance_per_month = 1
|
||||
|
||||
days_month = DAYS_PER_MONTH
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
windows_per_month = (days_month * duty_cycle_hours * SEC_PER_HOUR) / window_sec
|
||||
target_fpr = tolerance_per_month / windows_per_month
|
||||
rejection_pct = (1 - target_fpr) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(rejection_pct >= 99.999, f"Rejection target ({rejection_pct:.4f}%) is too lenient. Should be > 99.999%.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
sec_str = "60"
|
||||
min_str = "60"
|
||||
hr_str = "24"
|
||||
@@ -1109,11 +1109,11 @@ The following worked example demonstrates how to apply this design space analysi
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class BudgetAllocation:
|
||||
"""Budget planning as an engineering optimization problem."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
total_budget = 150_000
|
||||
labeling_pct = 0.60
|
||||
storage_pct = 0.25
|
||||
@@ -1121,7 +1121,7 @@ class BudgetAllocation:
|
||||
cost_per_label = 0.10
|
||||
review_overhead = 0.20
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
labeling_budget = total_budget * labeling_pct
|
||||
storage_budget = total_budget * storage_pct
|
||||
governance_budget = total_budget * governance_pct
|
||||
@@ -1129,7 +1129,7 @@ class BudgetAllocation:
|
||||
labeled_examples = labeling_budget / effective_cost
|
||||
review_overhead_pct = int(review_overhead * 100)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cost_per_label_str = fmt(cost_per_label, precision=2, commas=False)
|
||||
review_overhead_pct_str = fmt(review_overhead_pct, precision=0, commas=False)
|
||||
labeling_k = f"{labeling_budget/1000:.0f}"
|
||||
@@ -1702,7 +1702,7 @@ For our KWS system, `{python} kws_dataset_size_m_round_str` million training exa
|
||||
|
||||
Governance constraints further shape acquisition: privacy regulations (GDPR, CCPA, HIPAA)\index{GDPR!data collection requirements}\index{CCPA!privacy regulations} limit what data can be collected and how, while ethical sourcing requires fair compensation and transparent use of human contributions. @sec-responsible-engineering-data-governance-compliance-bd1a examines the full governance infrastructure for production ML systems.
|
||||
|
||||
With acquisition strategies established, the diversity of sources—crowdsourced audio, synthetic waveforms, web-scraped content—creates specific challenges at the boundary where external data enters our controlled pipeline. We now cross that boundary into the infrastructure that receives, validates, and routes this heterogeneous data.
|
||||
The diversity of sources—crowdsourced audio, synthetic waveforms, web-scraped content—creates specific challenges at the boundary where external data enters our controlled pipeline. Each source arrives in a different format, at a different cadence, with different quality guarantees, and the infrastructure that receives, validates, and routes this heterogeneous data must reconcile all of them.
|
||||
|
||||
## Data Pipeline Architecture {#sec-data-engineering-data-pipeline-architecture-b527}
|
||||
|
||||
@@ -1840,25 +1840,25 @@ import math
|
||||
from mlsys.constants import KS_TEST_COEFFICIENT
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class KSTest:
|
||||
"""
|
||||
Namespace for K-S Test Critical Value calculation.
|
||||
Scenario: Detecting drift with n=1000 samples at alpha=0.05.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
n = 1000
|
||||
coeff = KS_TEST_COEFFICIENT # 1.36 for alpha=0.05
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
d_crit = coeff / math.sqrt(n)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(d_crit > 0, "Critical value must be positive.")
|
||||
check(d_crit <= 0.1, f"Critical value ({d_crit:.3f}) is too loose for n=1000. Check formula.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
ks_dcrit_str = fmt(d_crit, precision=3, commas=False)
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -2025,9 +2025,9 @@ Moving beyond ad-hoc error handling, cascade failure prevention requires circuit
|
||||
|
||||
[^fn-circuit-breaker-ml]: **Circuit Breaker**: Named for its three-state behavior -- closed (normal flow), open (faults blocked), half-open (recovery probe) -- after the electrical safety device that interrupts current on overload. In ML data pipelines, the circuit breaker prevents a failing feature computation service from cascading timeouts through the entire serving path: once failure count exceeds a threshold, the breaker opens and the pipeline falls back to cached or default features rather than waiting on a dead service. \index{Circuit Breaker!cascade prevention}
|
||||
|
||||
Automated recovery engineering extends beyond simple retry logic. Progressive timeout increases prevent overwhelming struggling services while maintaining rapid recovery for transient issues: initial requests timeout after 1 second, but after detecting service degradation, timeouts extend to 5 seconds, then 30 seconds, giving the service time to stabilize. Multi-tier fallback systems provide degraded service when primary data sources fail: serving slightly stale cached features when real-time computation fails, or using approximate features when exact computation times out. A recommendation system unable to compute user preferences from the past 30 days might fall back to preferences from the past 90 days, providing somewhat less accurate but still useful recommendations rather than failing entirely. Comprehensive alerting and escalation procedures ensure human intervention occurs when automated recovery fails, with sufficient diagnostic information captured during the failure to enable rapid debugging.
|
||||
Automated recovery engineering extends beyond simple retry logic. Progressive timeout increases prevent overwhelming struggling services while maintaining rapid recovery for transient issues: initial requests timeout after 1 second, but after detecting service degradation, timeouts extend to 5 seconds, then 30 seconds, giving the service time to stabilize. Multi-tier fallback systems provide degraded service when primary data sources fail: serving slightly stale cached features when real-time computation fails, or using approximate features when exact computation times out. A recommendation system unable to compute user preferences from the past 30 days might fall back to preferences from the past 90 days, providing less precise but still useful recommendations rather than failing entirely. Comprehensive alerting and escalation procedures ensure human intervention occurs when automated recovery fails, with sufficient diagnostic information captured during the failure to enable rapid debugging.
|
||||
|
||||
These patterns—retry logic, dead letter queues, circuit breakers—are the runtime error handlers of our dataset compiler: they catch malformed inputs without halting the entire compilation. With these defensive patterns established, we can now examine the specific ingestion mechanisms that feed data into the pipeline. The choice of ingestion pattern—batch versus streaming, ETL versus ELT—determines how quickly new data reaches the model, how much infrastructure the system requires, and how the reliability patterns above are concretely deployed.
|
||||
These patterns—retry logic, dead letter queues, circuit breakers—are the runtime error handlers of our dataset compiler: they catch malformed inputs without halting the entire compilation. The next question is how data enters the pipeline in the first place. The choice of ingestion pattern—batch versus streaming, ETL versus ELT—determines how quickly new data reaches the model, how much infrastructure the system requires, and how these reliability patterns are concretely deployed.
|
||||
|
||||
### Data Ingestion {#sec-data-engineering-data-ingestion-8efc}
|
||||
|
||||
@@ -2115,18 +2115,18 @@ Production systems must balance cost versus latency trade-offs when selecting pa
|
||||
from mlsys.constants import KB, GB, TB
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class RealtimeCost:
|
||||
"""Streaming vs. batch ingestion daily cost comparison."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
events_per_sec = 1_000_000
|
||||
event_size_kb = 1
|
||||
stream_cores = 100
|
||||
stream_hours = 24
|
||||
stream_cost_per_hr = 0.05
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
throughput_gbs = (events_per_sec * event_size_kb * KB).m_as(GB) # GB/s
|
||||
stream_cost_day = stream_cores * stream_hours * stream_cost_per_hr
|
||||
hourly_data_tb = (throughput_gbs * SEC_PER_HOUR * GB).m_as(TB) # TB per hour
|
||||
@@ -2134,7 +2134,7 @@ class RealtimeCost:
|
||||
batch_cost_day = batch_core_hours * stream_cost_per_hr
|
||||
cost_ratio = stream_cost_day / batch_cost_day
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
throughput_str = fmt(throughput_gbs, precision=0, commas=False)
|
||||
hourly_tb_str = fmt(hourly_data_tb, precision=1, commas=False)
|
||||
stream_cost_str = fmt(stream_cost_day, precision=0, commas=False)
|
||||
@@ -2433,11 +2433,11 @@ storage_savings_mo = elt_storage_mo - etl_storage_mo
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class EtlEltCost:
|
||||
"""ETL vs. ELT monthly cost comparison for a 10 TB/day data warehouse."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
daily_raw_tb = 10
|
||||
s3_per_tb_mo = 23
|
||||
spark_per_tb = 5
|
||||
@@ -2447,7 +2447,7 @@ class EtlEltCost:
|
||||
etl_datasets = 3
|
||||
etl_tb_each = 2
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
etl_spark_daily = daily_raw_tb * spark_per_tb
|
||||
etl_storage_tb = etl_datasets * etl_tb_each
|
||||
etl_storage_mo = etl_storage_tb * s3_per_tb_mo
|
||||
@@ -2458,7 +2458,7 @@ class EtlEltCost:
|
||||
|
||||
storage_savings_mo = elt_storage_mo - etl_storage_mo
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
etl_spark_str = f"{etl_spark_daily}"
|
||||
etl_storage_str = f"{etl_storage_mo}"
|
||||
elt_storage_tb_str = f"{elt_storage_tb}"
|
||||
@@ -2690,7 +2690,7 @@ Code version ties processing results to the exact code that produced them. When
|
||||
|
||||
The governance pillar in the KWS pipeline tracks audio processing parameters that critically affect model behavior. When audio is normalized to standard volume, the reference volume level is persisted. When FFT transforms audio to frequency domain, the window size, hop length, and window function (Hamming, Hanning, etc.) are recorded. When MFCCs are computed, the number of coefficients, frequency range, and mel filterbank parameters are captured. This comprehensive parameter tracking enables several critical capabilities: reproducing training data exactly when debugging model failures, validating that serving uses identical preprocessing to training, and systematically studying how preprocessing choices affect model accuracy. Without this governance infrastructure, teams resort to manual documentation that inevitably becomes outdated or incorrect, leading to subtle training-serving skew that degrades production performance.
|
||||
|
||||
With raw inputs cleaned, normalized, and transformed into usable features, the remaining question is how we assign meaning to those features. Labels provide that meaning, and they introduce human judgment into the pipeline.
|
||||
Clean, normalized, feature-ready data is still inert without meaning. The remaining question is how we assign that meaning: labels declare which audio clips contain the wake word and which are background noise, and they introduce human judgment into what has been an automated pipeline.
|
||||
|
||||
## Data Labeling {#sec-data-engineering-data-labeling-6836}
|
||||
|
||||
@@ -2876,7 +2876,7 @@ Building on these precise timing markers, the extraction system generates clean
|
||||
|
||||
Modern voice assistant developers often build on this automated labeling foundation. While automated corpora may not contain the specific wake words a product requires, they provide starting points for KWS prototyping, particularly in underserved languages where commercial datasets do not exist. Production systems typically layer targeted human recording and verification for challenging cases (unusual accents, rare words, or difficult acoustic environments), coordinating between automated processing and human expertise.
|
||||
|
||||
With data acquired, ingested, processed, and labeled, the pipeline has produced its compilation artifacts: millions of feature vectors paired with ground truth labels. The question now shifts from *what* data we have to *where* it lives and *how fast* it reaches the accelerators. Storage architecture determines whether expensive GPUs spend their time computing or waiting.
|
||||
The pipeline has now produced its compilation artifacts: millions of feature vectors paired with ground truth labels. The question shifts from *what* data we have to *where* it lives and *how fast* it reaches the accelerators. Storage architecture determines whether expensive GPUs spend their time computing or waiting.
|
||||
|
||||
## Storage Architecture {#sec-data-engineering-strategic-storage-architecture-1a6b}
|
||||
|
||||
@@ -2968,21 +2968,21 @@ Beyond the functional differences between storage systems, cost and performance
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class StorageLoading:
|
||||
"""KWS dataset load time comparison: NVMe vs. object storage."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
kws_dataset_gb = 736
|
||||
nvme_bw_gbs = 5 # effective NVMe throughput
|
||||
obj_bw_gbs = 0.1 # typical object storage throughput
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
nvme_load_s = kws_dataset_gb / nvme_bw_gbs
|
||||
obj_load_s = kws_dataset_gb / obj_bw_gbs
|
||||
load_speedup = obj_load_s / nvme_load_s
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
nvme_load_str = fmt(nvme_load_s, precision=0, commas=False)
|
||||
obj_load_str = fmt(obj_load_s, precision=0, commas=True)
|
||||
load_speedup_str = fmt(load_speedup, precision=0, commas=False)
|
||||
@@ -3050,18 +3050,18 @@ Understanding these quantitative relationships enables informed architectural de
|
||||
from mlsys.constants import KB, GB, flop, GFLOPs, TFLOPs
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class StorageBandwidth:
|
||||
"""Storage bandwidth budget for saturating an A100 training ResNet-50."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
h_a100 = Hardware.Cloud.A100
|
||||
m_resnet = Models.Vision.ResNet50
|
||||
image_size_kb = 150
|
||||
sata_bw_mbs = 500
|
||||
s3_bw_mbs = 100
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
a100_flops_val = h_a100.peak_flops.m_as(flop/second)
|
||||
resnet_flops_per_img = (m_resnet.inference_flops * 3).m_as(flop) # forward + backward
|
||||
max_img_per_sec = int(a100_flops_val / resnet_flops_per_img)
|
||||
@@ -3070,7 +3070,7 @@ class StorageBandwidth:
|
||||
sata_utilization_pct = int(round(sata_max_img_sec / max_img_per_sec * 100))
|
||||
s3_workers = int(round(required_bw_gbs * 1e3 / s3_bw_mbs))
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
max_img_per_sec_str = f"{max_img_per_sec:,}"
|
||||
required_bw_gbs_str = fmt(required_bw_gbs, precision=1, commas=False)
|
||||
a100_flops_t_str = str(int(h_a100.peak_flops.m_as(TFLOPs/second)))
|
||||
@@ -3155,21 +3155,21 @@ File format selection dramatically impacts the **Data Term** ($\frac{D_{vol}}{BW
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FormatEfficiency:
|
||||
"""CSV vs. Parquet I/O efficiency for a 20-of-100-column fraud model."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
needed_cols = 20
|
||||
total_cols = 100
|
||||
eta_parquet = 1.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
eta_csv = needed_cols / total_cols
|
||||
waste_pct = (1 - eta_csv) * 100
|
||||
throughput_ratio = eta_parquet / eta_csv
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
eta_csv_str = fmt(eta_csv, precision=1, commas=False)
|
||||
eta_parquet_str = fmt(eta_parquet, precision=1, commas=False)
|
||||
waste_pct_str = fmt(waste_pct, precision=0, commas=False)
|
||||
@@ -3226,17 +3226,17 @@ Columnar storage formats\index{Parquet!columnar format}\index{ORC!columnar forma
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CompressionTradeoff:
|
||||
"""Snappy vs. Gzip decompression time over 50 training epochs."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
snappy_decompress_mbs = 500 # MB/s
|
||||
gzip_decompress_mbs = 120 # MB/s
|
||||
dataset_gb = 100
|
||||
n_epochs = 50
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
decompress_speedup = snappy_decompress_mbs / gzip_decompress_mbs
|
||||
dataset_mb = dataset_gb * 1000
|
||||
gzip_decompress_min = dataset_mb / gzip_decompress_mbs / 60
|
||||
@@ -3244,7 +3244,7 @@ class CompressionTradeoff:
|
||||
time_diff_min = gzip_decompress_min - snappy_decompress_min
|
||||
total_diff_hours = time_diff_min * n_epochs / 60
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
decompress_speedup_str = fmt(decompress_speedup, precision=0, commas=False)
|
||||
gzip_min_str = fmt(gzip_decompress_min, precision=0, commas=False)
|
||||
snappy_min_str = fmt(snappy_decompress_min, precision=0, commas=False)
|
||||
@@ -3273,7 +3273,7 @@ Storage performance optimization extends beyond format and compression to data l
|
||||
|
||||
### Storage Across the ML Lifecycle {#sec-data-engineering-storage-across-ml-lifecycle-2b22}
|
||||
|
||||
\index{ML Lifecycle!storage requirements}Storage requirements evolve substantially as ML systems progress from development through deployment. The same dataset is accessed very differently during exploratory analysis (random sampling for visualization), model training (sequential scanning for epochs), and production serving (random access for individual predictions)—requiring storage architectures that accommodate these diverse patterns.
|
||||
\index{ML Lifecycle!storage requirements}Storage requirements evolve substantially as ML systems progress from development through deployment. The same dataset is accessed through fundamentally different patterns during exploratory analysis (random sampling for visualization), model training (sequential scanning for epochs), and production serving (random access for individual predictions)—requiring storage architectures that accommodate these diverse patterns.
|
||||
|
||||
During development, flexibility matters more than raw performance. The key challenge is managing dataset versions without overwhelming storage capacity: 10 experiments on a 100 GB dataset would naively require 1 TB of copies. Tools like DVC address this by tracking versions through pointers and storing only deltas. Governance considerations demand tiered access controls where synthetic or anonymized datasets are broadly available for experimentation, while production data containing sensitive information requires approval and audit trails.
|
||||
|
||||
@@ -3345,7 +3345,7 @@ Time-travel capabilities\index{Time Travel!feature stores}\index{Point-in-Time C
|
||||
|
||||
Feature store performance characteristics directly impact both training throughput and serving latency. The offline store must support high-throughput batch reads (millions of feature vectors per minute) using columnar formats that enable efficient reads of specific features from wide tables. The online store must support thousands to millions of reads per second with single-digit millisecond latency. In production, feature freshness adds further pressure: when users add items to shopping carts, recommendation systems need updated features within seconds, not hours. Streaming feature computation\index{Feature Store!streaming updates} pipelines address this by updating online stores continuously rather than through periodic batch jobs, though streaming introduces complexity around exactly-once processing semantics\index{Exactly-Once Semantics!streaming} and handling late-arriving events.
|
||||
|
||||
With the pipeline fully assembled—acquisition, ingestion, processing, labeling, and storage—a tempting conclusion is that data engineering work is "done." But production systems do not stand still. User behavior drifts, upstream schemas evolve, labeling guidelines change, and the careful engineering described in this chapter gradually erodes unless actively maintained. The final section addresses the ongoing health of these systems after deployment.
|
||||
A fully assembled pipeline—acquisition, ingestion, processing, labeling, and storage—might suggest that data engineering work is "done." Production systems, however, do not stand still. User behavior drifts, upstream schemas evolve, labeling guidelines change, and the careful engineering described above gradually erodes unless actively maintained.
|
||||
|
||||
## Operational Data Health {#sec-data-engineering-data-debt-hidden-liability-3335}
|
||||
|
||||
@@ -3732,3 +3732,10 @@ The dataset compiler has produced its output: a clean, versioned, optimized trai
|
||||
```{=latex}
|
||||
\part{key:vol1_build}
|
||||
```
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:data_engineering")
|
||||
```
|
||||
|
||||
@@ -83,7 +83,7 @@ For decades, the dominant strategy was straightforward: more data, better models
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class SelectionEconomicsAnchor:
|
||||
"""
|
||||
Namespace for coreset selection overhead anchor.
|
||||
@@ -99,14 +99,14 @@ class SelectionEconomicsAnchor:
|
||||
coreset_scoring_time_str = SelectionEconomicsAnchor.scoring_time_str
|
||||
coreset_pct_str = SelectionEconomicsAnchor.coreset_pct_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ScalingAsymmetry:
|
||||
"""
|
||||
Namespace for Scaling Asymmetry Table.
|
||||
Scenario: Comparing growth rates of Compute vs Data.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Hardware: 10x every 3 years (approx 2.15x/year)
|
||||
gpu_growth_factor = 10.0
|
||||
gpu_period_years = 3.0
|
||||
@@ -119,7 +119,7 @@ class ScalingAsymmetry:
|
||||
label_growth_factor = 1.5
|
||||
label_period_years = 5.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Annualized growth rates: Rate = Factor^(1/Period)
|
||||
gpu_annual = gpu_growth_factor ** (1.0 / gpu_period_years)
|
||||
web_annual = web_growth_factor ** (1.0 / web_period_years)
|
||||
@@ -127,10 +127,10 @@ class ScalingAsymmetry:
|
||||
# Divergence
|
||||
gap_ratio = gpu_annual / web_annual
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(gap_ratio >= 1.5, f"GPU growth ({gpu_annual:.2f}x/yr) isn't fast enough vs Data ({web_annual:.2f}x/yr). Gap: {gap_ratio:.2f}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpu_growth_str = fmt(gpu_growth_factor, precision=0, commas=False) + "×"
|
||||
gpu_period_str = f"{int(gpu_period_years)} years"
|
||||
|
||||
@@ -223,14 +223,14 @@ from mlsys.constants import Bparam, BILLION, TRILLION, SEC_PER_HOUR, MILLION, TH
|
||||
from mlsys import Models
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ComputeDataGap:
|
||||
"""
|
||||
Namespace for Compute-Data Gap calculation.
|
||||
Scenario: 10k H100s vs Available Quality Tokens.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
h100_count = 10000
|
||||
months = 3
|
||||
model = Models.Language.Llama2_70B
|
||||
@@ -238,13 +238,13 @@ class ComputeDataGap:
|
||||
tokens_available = 5e12 # 5T tokens (RedPajama/RefinedWeb scale)
|
||||
tokens_capacity = 10e12 # Capacity of the cluster
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
gap_ratio = tokens_capacity / tokens_available
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(gap_ratio >= 1.0, f"Compute ({tokens_capacity:.1e}) is less than Data ({tokens_available:.1e}). No Data Wall.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
llama_params_str = fmt(model.parameters.m_as(Bparam), precision=0, commas=False) + "B"
|
||||
h100_count_str = fmt(h100_count, precision=0, commas=True)
|
||||
tokens_capacity_str = fmt(tokens_capacity / TRILLION, precision=0, commas=False) + "T"
|
||||
@@ -301,14 +301,14 @@ The systems framing reveals optimization opportunities invisible to the ML frami
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class IronLawSavings:
|
||||
"""
|
||||
Namespace for Iron Law Multiplicative Savings.
|
||||
Scenario: 2x Data Selection * 2x Compression * 2x Hardware = 8x Total.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
budget_m = 100 # $100M training run
|
||||
|
||||
# Optimization factors
|
||||
@@ -319,18 +319,18 @@ class IronLawSavings:
|
||||
# Derived
|
||||
data_pruning_pct = (1 - (1/factor_data)) * 100
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Multiplicative effect
|
||||
total_speedup = factor_data * factor_model * factor_hw
|
||||
|
||||
# Savings
|
||||
compute_savings_m = budget_m * (data_pruning_pct / 100.0)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
additive_sum = factor_data + factor_model + factor_hw
|
||||
check(total_speedup > additive_sum, f"Multiplicative speedup ({total_speedup}x) should exceed additive sum ({additive_sum}).")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
training_cost_m_str = fmt(budget_m, precision=0, commas=False)
|
||||
dataset_reduction_pct_str = fmt(data_pruning_pct, precision=0, commas=False)
|
||||
compute_savings_m_str = fmt(compute_savings_m, precision=0, commas=False)
|
||||
@@ -372,7 +372,7 @@ The systems framing established above calls for a quantitative metric. The Optim
|
||||
\index{Roofline Model!data selection interaction}
|
||||
In the optimization triad (@fig-optimization-triad), data selection plays the role of *Input Optimization*, reducing total workload before it enters the model or hardware. Model compression minimizes the math per parameter; hardware acceleration maximizes the math per second; data selection minimizes the total math required to reach convergence. The three edges of the triad capture the dominant bottlenecks: *Compute Bound* describes systems limited by arithmetic throughput, *I/O Bound* describes systems limited by data movement, and *Sample Efficiency* describes systems limited by the information content of training data.
|
||||
|
||||
::: {#fig-optimization-triad fig-env="figure" fig-pos="htb" fig-cap="**The Optimization Triad**: Machine learning performance relies on three pillars: Algorithms (models), Machine (hardware/software), and Data Selection. While algorithms and machines have traditionally received the most attention, optimizing data selection (Input Optimization) offers a third, powerful lever for scaling performance." fig-alt="A triangular diagram with three nodes: Algorithms (Model), Machine (Hardware), and Data Selection. Bidirectional arrows connect all three with edge labels: Compute Bound between Algorithms and Machine, I/O Bound between Machine and Data Selection, and Sample Efficiency between Data Selection and Algorithms. Data Selection is highlighted with a bold border. ML Performance appears at the center."}
|
||||
::: {#fig-optimization-triad fig-env="figure" fig-pos="htb" fig-cap="**The Optimization Triad**: Machine learning performance relies on three pillars: Algorithms (models), Machine (hardware/software), and Data Selection. While algorithms and machines have traditionally received the most attention, optimizing data selection (Input Optimization) offers a third, independent lever for scaling performance." fig-alt="A triangular diagram with three nodes: Algorithms (Model), Machine (Hardware), and Data Selection. Bidirectional arrows connect all three with edge labels: Compute Bound between Algorithms and Machine, I/O Bound between Machine and Data Selection, and Sample Efficiency between Data Selection and Algorithms. Data Selection is highlighted with a bold border. ML Performance appears at the center."}
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| out-width: "70%"
|
||||
@@ -547,13 +547,13 @@ from mlsys.formatting import fmt, check
|
||||
class IcrCoresetComparison:
|
||||
"""Compare learning-per-FLOP for random sampling vs. coreset selection."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
imagenet_size_value = IMAGENET_IMAGES.m_as('count')
|
||||
acc_gain_random_value = 5.0 # % accuracy per epoch
|
||||
acc_gain_coreset_value = 4.5 # % with 50% coreset
|
||||
coreset_fraction_value = 0.5 # keep 50% of data
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
m_resnet = Models.ResNet50
|
||||
resnet50_fwd_gflops_value = m_resnet.inference_flops.m_as(GFLOPs)
|
||||
resnet50_fwdbwd_gflops_value = (m_resnet.inference_flops * 2).m_as(GFLOPs)
|
||||
@@ -567,11 +567,11 @@ class IcrCoresetComparison:
|
||||
icr_ratio_value = icr_coreset_value / icr_random_value
|
||||
acc_diff_value = acc_gain_random_value - acc_gain_coreset_value
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(icr_ratio_value > 1.0, f"Coreset ICR ({icr_ratio_value:.2f}) should exceed random ICR.")
|
||||
check(coreset_fraction_value < 1.0, "Coreset fraction must be less than 1.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
resnet50_fwd_gflops_str = fmt(m_resnet.inference_flops.to(GFLOPs), precision=1)
|
||||
resnet50_fwdbwd_gflops_str = fmt((m_resnet.inference_flops * 2).to(GFLOPs), precision=1)
|
||||
full_epoch_flops_str = f"{full_epoch_flops_value:.2e}"
|
||||
@@ -671,17 +671,17 @@ Why does this heterogeneity exist? The answer lies in how neural networks learn
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class QualityMultiplier:
|
||||
"""
|
||||
Namespace for Data Quality Multiplier.
|
||||
Scenario: Comparing sample complexity for Clean (1/N) vs Noisy (1/sqrt(N)) data.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
epsilon = 0.01 # 1% Target Error
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Clean: Error ~ 1/N => N ~ 1/Error
|
||||
n_clean = 1.0 / epsilon
|
||||
|
||||
@@ -690,10 +690,10 @@ class QualityMultiplier:
|
||||
|
||||
ratio = n_noisy / n_clean
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ratio >= 50, f"Noisy penalty ({ratio:.1f}x) is too small to justify cleaning investment.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
epsilon_str = fmt(epsilon, precision=2, commas=False)
|
||||
epsilon_pct_str = fmt(epsilon * 100, precision=0, commas=False)
|
||||
n_clean_str = fmt(n_clean, precision=0, commas=False)
|
||||
@@ -878,19 +878,19 @@ from mlsys.formatting import fmt, check
|
||||
class CoresetPractice:
|
||||
"""Practical 10× coreset workflow: 5-epoch proxy selects 100K from 1M images."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_train_images_value = 1_000_000 # 1M training images
|
||||
coreset_fraction_value = 0.1 # keep 10%
|
||||
n_epochs_proxy_value = 5 # proxy training epochs
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
n_coreset_value = int(n_train_images_value * coreset_fraction_value)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(n_coreset_value > 0, "Coreset must be non-empty.")
|
||||
check(coreset_fraction_value < 1.0, "Coreset fraction must be less than 1.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
n_train_images_str = fmt(n_train_images_value / MILLION, precision=0) + " million"
|
||||
coreset_fraction_pct_str = fmt(coreset_fraction_value * 100, precision=0, commas=False)
|
||||
n_coreset_str = fmt(n_coreset_value, precision=0, commas=True)
|
||||
@@ -976,9 +976,9 @@ Near-duplicate detection addresses the more subtle problem of semantically redun
|
||||
|
||||
\index{Perceptual Hashing!image deduplication}
|
||||
\index{Embedding-based Deduplication!semantic similarity}
|
||||
For images, perceptual hashing produces signatures robust to minor transformations like resizing and compression, identifying visually identical images stored in different formats. Embedding-based similarity offers the most powerful detection by computing dense representations (CLIP[^fn-clip-dedup] for images, sentence transformers for text) and clustering similar items, though this approach incurs higher computational overhead.
|
||||
For images, perceptual hashing produces signatures robust to minor transformations like resizing and compression, identifying visually identical images stored in different formats. Embedding-based similarity offers the highest-fidelity detection by computing dense representations (CLIP[^fn-clip-dedup] for images, sentence transformers for text) and clustering similar items, though this approach incurs higher computational overhead.
|
||||
|
||||
[^fn-clip-dedup]: **CLIP (Contrastive Language-Image Pre-training)**: Pre-trained on 400 million image-text pairs, CLIP maps visually distinct but semantically similar images to a shared embedding space, enabling powerful semantic deduplication. This power comes at a cost: generating an embedding requires a full forward pass through a large vision transformer, making it over 100$\times$ more computationally expensive per sample than perceptual hashing. \index{CLIP!semantic deduplication}
|
||||
[^fn-clip-dedup]: **CLIP (Contrastive Language-Image Pre-training)**: Pre-trained on 400 million image-text pairs, CLIP maps visually distinct but semantically similar images to a shared embedding space, enabling semantic deduplication across visual concepts. This capability comes at a cost: generating an embedding requires a full forward pass through a large vision transformer, making it over 100$\times$ more computationally expensive per sample than perceptual hashing. \index{CLIP!semantic deduplication}
|
||||
|
||||
For foundation model pre-training, deduplication has become essential rather than optional. Studies on GPT-3 and LLaMA training demonstrate that deduplicated data improves both training efficiency and downstream performance by preventing memorization of repeated content. The benefit is twofold: fewer wasted FLOPs on redundant samples, and better generalization because the model sees more diverse examples per training token.
|
||||
|
||||
@@ -1064,7 +1064,7 @@ from mlsys.formatting import fmt, check
|
||||
class CurriculumBenchmarks:
|
||||
"""Curriculum learning convergence speedups across CIFAR-10, CIFAR-100, ImageNet, MentorNet."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cifar10_baseline_epochs = 150
|
||||
cifar10_curriculum_epochs = 115
|
||||
|
||||
@@ -1077,18 +1077,18 @@ class CurriculumBenchmarks:
|
||||
mentornet_baseline_epochs = 90
|
||||
mentornet_curriculum_epochs = 70
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cifar10_speedup_pct = (cifar10_baseline_epochs - cifar10_curriculum_epochs) / cifar10_baseline_epochs * 100
|
||||
cifar100_speedup_pct = (cifar100_baseline_epochs - cifar100_curriculum_epochs) / cifar100_baseline_epochs * 100
|
||||
imagenet_speedup_pct = (imagenet_baseline_epochs - imagenet_curriculum_epochs) / imagenet_baseline_epochs * 100
|
||||
mentornet_speedup_pct = (mentornet_baseline_epochs - mentornet_curriculum_epochs) / mentornet_baseline_epochs * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(cifar10_speedup_pct > imagenet_speedup_pct, "CIFAR-10 (more redundant) should show larger speedup than ImageNet.")
|
||||
check(all(p > 0 for p in [cifar10_speedup_pct, cifar100_speedup_pct, imagenet_speedup_pct, mentornet_speedup_pct]),
|
||||
"All speedups must be positive.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cifar10_speedup_str = fmt(cifar10_speedup_pct, precision=0, commas=False)
|
||||
cifar100_speedup_str = fmt(cifar100_speedup_pct, precision=0, commas=False)
|
||||
imagenet_speedup_str = fmt(imagenet_speedup_pct, precision=0, commas=False)
|
||||
@@ -1217,14 +1217,14 @@ from mlsys.formatting import fmt, check
|
||||
class ActiveLearningRoi:
|
||||
"""Medical imaging active learning: 20× speedup, $4.75M savings vs. naive labeling."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_unlabeled_value = 1_000_000 # scans in pool
|
||||
cost_per_label_value = 5.00 # $/label (specialist)
|
||||
budget_value = 500_000 # $ available
|
||||
deadline_months_value = 1 # time constraint
|
||||
n_active_value = 50_000 # samples needed with AL
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_all_value = n_unlabeled_value * cost_per_label_value
|
||||
n_random_value = int(budget_value / cost_per_label_value)
|
||||
n_random_pct_value = n_random_value / n_unlabeled_value * 100
|
||||
@@ -1234,11 +1234,11 @@ class ActiveLearningRoi:
|
||||
speedup_value = n_unlabeled_value / n_active_value
|
||||
cost_saving_value = cost_all_value - cost_active_value
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(cost_active_value < budget_value, "Active learning cost must be within budget.")
|
||||
check(speedup_value > 1.0, "Active learning must require fewer labels than naive labeling.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
n_unlabeled_str = fmt(n_unlabeled_value / MILLION, precision=0) + " Million"
|
||||
cost_per_label_str = fmt(cost_per_label_value, precision=2, commas=False)
|
||||
budget_str = fmt(budget_value, precision=0, commas=True)
|
||||
@@ -1388,7 +1388,7 @@ from mlsys.formatting import fmt, check
|
||||
class FixmatchLabelEfficiency:
|
||||
"""FixMatch CIFAR-10: 200× label reduction for ~8× total cost savings."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cifar10_full_labels = 50000
|
||||
cifar10_full_acc = 96.1
|
||||
|
||||
@@ -1410,7 +1410,7 @@ class FixmatchLabelEfficiency:
|
||||
fixmatch_labels = 250
|
||||
fixmatch_compute_cost = 250 # 5x more training
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cifar10_fixmatch_4k_eff = cifar10_full_labels / cifar10_fixmatch_4k_labels
|
||||
cifar10_fixmatch_250_eff = cifar10_full_labels / cifar10_fixmatch_250_labels
|
||||
cifar10_fixmatch_40_eff = cifar10_full_labels / cifar10_fixmatch_40_labels
|
||||
@@ -1425,11 +1425,11 @@ class FixmatchLabelEfficiency:
|
||||
cost_reduction = supervised_total / fixmatch_total
|
||||
acc_loss = cifar10_full_acc - cifar10_fixmatch_250_acc
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(cost_reduction > 1.0, "FixMatch must be cheaper than supervised baseline.")
|
||||
check(cifar10_fixmatch_250_eff > 1.0, "FixMatch must require fewer labels than full supervision.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
supervised_label_cost_str = fmt(supervised_label_cost, precision=0, commas=True)
|
||||
supervised_total_str = fmt(supervised_total, precision=0, commas=True)
|
||||
fixmatch_label_cost_str = fmt(fixmatch_label_cost, precision=0, commas=True)
|
||||
@@ -1496,7 +1496,7 @@ Despite these limitations, semi-supervised learning reduces label requirements b
|
||||
\index{Self-supervised Learning!definition}
|
||||
\index{Masked Modeling!self-supervised breakthrough}
|
||||
\index{Self-supervised Learning!etymology}
|
||||
GPT was trained to predict the next word in a sentence. BERT was trained to fill in masked words. Neither task required a single human label. **Self-supervised learning**[^fn-self-supervised-paradigm] generalizes this insight: by designing *pretext tasks* that derive supervision from the data's inherent structure, models learn powerful representations from unlabeled data at scale. Where the progression from active learning to semi-supervised learning drove required labels asymptotically toward zero, SSL breaks through that asymptote entirely. It represents the field's most powerful response to the Data Wall introduced in @sec-data-selection-data-selection-fundamentals-e839: rather than searching for more high-quality labeled data in a finite pool, SSL redefines what counts as training data by extracting supervision from the structure of unlabeled corpora that exist at web scale.
|
||||
GPT was trained to predict the next word in a sentence. BERT was trained to fill in masked words. Neither task required a single human label. **Self-supervised learning**[^fn-self-supervised-paradigm] generalizes this insight: by designing *pretext tasks* that derive supervision from the data's inherent structure, models learn general-purpose representations from unlabeled data at scale. Where the progression from active learning to semi-supervised learning drove required labels asymptotically toward zero, SSL breaks through that asymptote entirely. It represents the field's most effective response to the Data Wall introduced in @sec-data-selection-data-selection-fundamentals-e839: rather than searching for more high-quality labeled data in a finite pool, SSL redefines what counts as training data by extracting supervision from the structure of unlabeled corpora that exist at web scale.
|
||||
|
||||
[^fn-self-supervised-paradigm]: **Self-supervised Learning**: The pretext task—predicting the next word (GPT) or a masked word (BERT)—provides a supervisory signal inherent to the unlabeled data itself, removing the human-labeling bottleneck. This reframes the system-building challenge from one of data acquisition to one of pure computational investment, where pre-training can cost >10,000x more than a single downstream fine-tuning. The expense is amortized across thousands of downstream tasks. \index{Self-supervised Learning!Iron Law restructuring}
|
||||
|
||||
@@ -1554,7 +1554,7 @@ from mlsys.formatting import fmt, check
|
||||
class FoundationCostAmortization:
|
||||
"""Foundation model amortization: 10 tasks, 100× label reduction, 20× marginal compute drop."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cost_scratch_per_task_value = 1000 # GPU-hrs per task
|
||||
n_tasks_value = 10 # number of tasks
|
||||
cost_pretrain_value = 10000 # GPU-hrs (one-time)
|
||||
@@ -1563,7 +1563,7 @@ class FoundationCostAmortization:
|
||||
cost_per_label = 1 # $/label
|
||||
labels_per_task_finetune = 1_000 # labels for fine-tuning
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_scratch_total_value = cost_scratch_per_task_value * n_tasks_value
|
||||
cost_foundation_total_value = cost_pretrain_value + (cost_finetune_value * n_tasks_value)
|
||||
|
||||
@@ -1574,12 +1574,12 @@ class FoundationCostAmortization:
|
||||
marginal_compute_reduction = cost_scratch_per_task_value / cost_finetune_value
|
||||
crossover_tasks_value = cost_pretrain_value / (cost_scratch_per_task_value - cost_finetune_value)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(label_cost_reduction > 1.0, "Fine-tuning must require fewer labels than scratch training.")
|
||||
check(marginal_compute_reduction > 1.0, "Fine-tuning marginal compute must be less than scratch training.")
|
||||
check(crossover_tasks_value > 0, "Crossover must be a positive number of tasks.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
total_a_hrs_str = f"{cost_scratch_total_value:,}"
|
||||
total_b_hrs_str = f"{cost_foundation_total_value:,}"
|
||||
labels_per_task_scratch_str = fmt(labels_per_task_scratch, precision=0, commas=True)
|
||||
@@ -1638,17 +1638,17 @@ This explains *why* the fine-tuning paradigm dominates production ML. The pre-tr
|
||||
class FoundationAmortizationData:
|
||||
"""Figure data: scratch vs. foundation model GPU-hours for 10 tasks."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cost_scratch_per_task_value = 1000 # GPU-hrs per task
|
||||
n_tasks_value = 10 # number of tasks
|
||||
cost_pretrain_value = 10000 # GPU-hrs (one-time)
|
||||
cost_finetune_value = 50 # GPU-hrs per task
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_scratch_total_value = cost_scratch_per_task_value * n_tasks_value
|
||||
cost_foundation_total_value = cost_pretrain_value + (cost_finetune_value * n_tasks_value)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
total_a_hrs_str = f"{cost_scratch_total_value:,}"
|
||||
total_b_hrs_str = f"{cost_foundation_total_value:,}"
|
||||
|
||||
@@ -1856,7 +1856,7 @@ In practice, the best results often come from mixing synthetic and real data rat
|
||||
: **Synthetic-to-Real Data Mixing Ratios.** Pure synthetic data suffers from distribution shift; pure real data is expensive. The optimal ratio varies by domain but typically falls in the 50–80% synthetic range when simulation fidelity is high. {#tbl-synthetic-mix .striped .hover}
|
||||
|
||||
\index{Model Collapse!recursive synthetic training}
|
||||
The optimal mix depends on simulation fidelity, domain complexity, and the cost differential between synthetic and real data. When synthetic data comes from ML models rather than simulators, there is a risk of *model collapse*[^fn-model-collapse-mechanism]: training on model-generated data amplifies errors and reduces diversity over generations. This concern is particularly acute for foundation models, where synthetic data from earlier model generations may contaminate future training corpora. With appropriate safeguards, synthetic data generation remains a powerful tool. The following example illustrates how to combine multiple data selection techniques (augmentation, noise injection, and simulation) into a coherent strategy for a real deployment scenario.
|
||||
The optimal mix depends on simulation fidelity, domain complexity, and the cost differential between synthetic and real data. When synthetic data comes from ML models rather than simulators, there is a risk of *model collapse*[^fn-model-collapse-mechanism]: training on model-generated data amplifies errors and reduces diversity over generations. This concern is particularly acute for foundation models, where synthetic data from earlier model generations may contaminate future training corpora. With appropriate safeguards, synthetic data generation remains an effective tool. The following example illustrates how to combine multiple data selection techniques (augmentation, noise injection, and simulation) into a coherent strategy for a real deployment scenario.
|
||||
|
||||
[^fn-model-collapse-mechanism]: **Model Collapse**: Formally analyzed by Shumailov et al. in 2024, this phenomenon occurs because generative models systematically underrepresent tail distributions -- rare but important patterns that appear infrequently in training data. When generation $n+1$ trains on output from generation $n$, each successive generation further compresses the tails, producing increasingly homogeneous data. The degradation is rapid: original diversity can drop below 50% by generation 5, which is why the Fallacies section of this chapter warns against pure synthetic training. \index{Model Collapse!tail distribution compression}
|
||||
|
||||
@@ -1892,7 +1892,7 @@ The techniques above create new input samples, but there is another form of synt
|
||||
\index{Temperature Parameter!softmax for distillation}
|
||||
The key insight is that the teacher's soft predictions contain more information than hard labels: a teacher predicting [0.7, 0.2, 0.1] for three classes reveals inter-class relationships (classes 1 and 2 are more similar) that a hard label [1, 0, 0] obscures entirely.
|
||||
|
||||
This richer supervision signal enables student models to learn more efficiently from the same data. From a systems perspective, distillation is particularly powerful for creating synthetic labels at scale: run a large model (such as GPT-4) on unlabeled data to generate high-quality annotations, then train a smaller model on these synthetic labels. The smaller model inherits much of the teacher's capability at a fraction of the inference cost, amortizing the expensive teacher computation across many student deployments.
|
||||
This richer supervision signal enables student models to learn more efficiently from the same data. From a systems perspective, distillation is particularly effective for creating synthetic labels at scale: run a large model (such as GPT-4) on unlabeled data to generate high-quality annotations, then train a smaller model on these synthetic labels. The smaller model inherits much of the teacher's capability at a fraction of the inference cost, amortizing the expensive teacher computation across many student deployments.
|
||||
|
||||
Together, augmentation, generative synthesis, and distillation complete the third stage of our data selection pipeline. Where static pruning removes redundancy and dynamic selection focuses compute on high-value samples, synthetic generation fills gaps by creating samples that never existed. These three stages form a complementary toolkit: pruning reduces what you have, selection focuses how you use it, and synthesis expands what you can access.
|
||||
|
||||
@@ -2108,7 +2108,7 @@ from mlsys.formatting import fmt, check
|
||||
class SelectionInequalityCalc:
|
||||
"""1M image scenario: proxy scoring (0.6 hrs) preserves 90% compute savings vs full-model scoring."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_images_value = 1_000_000 # total images
|
||||
n_coreset_value = 100_000 # 10% coreset
|
||||
n_epochs_value = 100 # training epochs
|
||||
@@ -2116,7 +2116,7 @@ class SelectionInequalityCalc:
|
||||
resnet18_time_per_image_value = 0.002 # sec/image (proxy)
|
||||
trap_sel_hrs_value = 50 # hrs for 7B model scoring
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
score_a_sec_value = n_images_value * resnet50_time_per_image_value
|
||||
train_a_sec_value = n_coreset_value * n_epochs_value * resnet50_time_per_image_value
|
||||
total_a_sec_value = score_a_sec_value + train_a_sec_value
|
||||
@@ -2139,11 +2139,11 @@ class SelectionInequalityCalc:
|
||||
trap_total_hrs_value = trap_sel_hrs_value + train_a_sec_value / SEC_PER_HOUR
|
||||
trap_overhead_pct_value = trap_sel_hrs_value / (baseline_hrs_value - trap_total_hrs_value) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(savings_b_pct_value > savings_a_pct_value, "Proxy selection must outperform full-model selection.")
|
||||
check(total_b_hrs_value < baseline_hrs_value, "Selection + subset training must beat full training.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
score_a_str = fmt(score_a_sec_value, precision=0, commas=True)
|
||||
score_a_hrs_str = fmt(score_a_sec_value / SEC_PER_HOUR, precision=1, commas=False)
|
||||
train_a_str = fmt(train_a_sec_value, precision=0, commas=True)
|
||||
@@ -2235,13 +2235,13 @@ from mlsys.formatting import fmt, check
|
||||
class SelectionInequalityMath:
|
||||
"""Epoch-normalized selection inequality: one-shot (9× speedup) vs. iterative (slower than baseline)."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_epochs_full = 100 # baseline epochs
|
||||
subset_fraction = 0.1 # keep 10%
|
||||
cost_selection_full = 1 # 1 epoch equivalent
|
||||
proxy_factor = 0.1 # proxy is 10x faster
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
n_epochs_subset = n_epochs_full * subset_fraction
|
||||
cost_total_efficient = cost_selection_full + n_epochs_subset
|
||||
speedup_efficient = n_epochs_full / cost_total_efficient
|
||||
@@ -2252,11 +2252,11 @@ class SelectionInequalityMath:
|
||||
cost_selection_proxy = cost_selection_full * proxy_factor
|
||||
cost_total_proxy = cost_selection_proxy + n_epochs_subset
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(speedup_efficient > 1.0, "One-shot selection must yield positive speedup.")
|
||||
check(cost_total_iterative > n_epochs_full, "Iterative selection must be slower than baseline.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
n_epochs_full_str = fmt(n_epochs_full, precision=0, commas=False)
|
||||
subset_fraction_pct_str = fmt(subset_fraction * 100, precision=0, commas=False)
|
||||
cost_selection_full_str = fmt(cost_selection_full, precision=0, commas=False)
|
||||
@@ -2405,7 +2405,7 @@ $$
|
||||
R = \frac{T_{\text{data pipeline}}}{T_{\text{GPU training}}}
|
||||
$$
|
||||
|
||||
If $R > 1$ (data pipeline is the bottleneck), set echo factor $e \leq R$ to fully utilize GPU capacity. If $R < 1$ (GPU is the bottleneck), data echoing provides no benefit. The following worked example calculates these trade-offs for a realistic scenario.
|
||||
If $R > 1$ (data pipeline is the bottleneck), set echo factor $e \leq R$ to fully use GPU capacity. If $R < 1$ (GPU is the bottleneck), data echoing provides no benefit. The following worked example calculates these trade-offs for a realistic scenario.
|
||||
|
||||
```{python}
|
||||
#| label: data-echoing-calc
|
||||
@@ -2427,14 +2427,14 @@ from mlsys.formatting import fmt, check
|
||||
class DataEchoingRoi:
|
||||
"""ImageNet heavy augmentation: echo factor 2 cuts training from 107 hrs to 53 hrs."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
pipeline_throughput_value = 300 # images/sec (CPU-bound)
|
||||
gpu_throughput_value = 800 # images/sec (GPU capacity)
|
||||
n_epochs_echo_value = 90 # standard ImageNet epochs
|
||||
imagenet_size_value = 1_280_000 # ~1.28M images
|
||||
echo_factor_value = 2 # repeat each batch 2x
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
ratio_r_value = gpu_throughput_value / pipeline_throughput_value
|
||||
gpu_idle_pct_value = (1 - pipeline_throughput_value / gpu_throughput_value) * 100
|
||||
|
||||
@@ -2447,11 +2447,11 @@ class DataEchoingRoi:
|
||||
echo_sec_value = n_epochs_echo_value * imagenet_size_value / echo_throughput_value
|
||||
echo_hrs_value = echo_sec_value / SEC_PER_HOUR
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(ratio_r_value > 1.0, "Pipeline must be slower than GPU for echoing to help.")
|
||||
check(echo_hrs_value < no_echo_hrs_value, "Echoing must reduce training time.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
pipeline_throughput_str = fmt(pipeline_throughput_value, precision=0, commas=False)
|
||||
gpu_throughput_str = fmt(gpu_throughput_value, precision=0, commas=False)
|
||||
pipeline_ratio_str = fmt(ratio_r_value, precision=2, commas=False)
|
||||
@@ -2505,7 +2505,7 @@ effective_throughput_str = DataEchoingRoi.effective_throughput_str
|
||||
**System Implication:** Data echoing trades sample diversity for GPU utilization. It works best when:
|
||||
|
||||
1. Augmentation is diverse (each echo sees different transforms)
|
||||
2. The dataset is already somewhat redundant
|
||||
2. The dataset contains redundant samples
|
||||
3. The echo factor $e$ stays below the critical threshold (~4$\times$ for ImageNet)
|
||||
|
||||
Above this threshold, the model starts memorizing and accuracy degrades.
|
||||
@@ -2564,7 +2564,7 @@ from mlsys.formatting import fmt, check
|
||||
class CostBreakdown:
|
||||
"""ImageNet-scale training cost breakdown: data costs (~81%) dominate compute (~19%)."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
c_raw_value = 50000 # $ for licensed dataset
|
||||
n_labels_value = 1_200_000 # images to label
|
||||
cost_per_label_value = 0.05 # $/label (crowd)
|
||||
@@ -2576,18 +2576,18 @@ class CostBreakdown:
|
||||
train_gpus = 8
|
||||
train_hours = 24
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
c_label_value = n_labels_value * cost_per_label_value
|
||||
c_total_value = c_raw_value + c_label_value + c_store_value + c_train_value
|
||||
|
||||
p_data_value = (c_raw_value + c_label_value + c_store_value) / c_total_value * 100
|
||||
p_compute_value = c_train_value / c_total_value * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(p_data_value > p_compute_value, "Data costs must dominate compute costs in this scenario.")
|
||||
check(abs(p_data_value + p_compute_value - 100) < 0.1, "Data + compute percentages must sum to 100.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
c_raw_str = f"${c_raw_value:,}"
|
||||
c_label_str = f"${c_label_value:,.0f}"
|
||||
c_store_str = f"${c_store_value}"
|
||||
@@ -2674,7 +2674,7 @@ from mlsys.formatting import fmt, check
|
||||
class BreakevenCalc:
|
||||
"""Active learning break-even: 2K labels + $500 inference achieves same accuracy as 5K random labels."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cost_label = 10 # $/label
|
||||
n_initial = 1000 # initial labeled set
|
||||
n_queries_per_round = 100 # samples per round
|
||||
@@ -2683,7 +2683,7 @@ class BreakevenCalc:
|
||||
n_active = 2000 # active learning needs
|
||||
n_rounds = 10 # AL query rounds
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_random_total = n_random * cost_label
|
||||
cost_active_label = n_active * cost_label
|
||||
cost_active_inference = n_rounds * cost_inference
|
||||
@@ -2691,11 +2691,11 @@ class BreakevenCalc:
|
||||
|
||||
roi_pct = (cost_random_total - cost_active_total) / cost_active_total * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(cost_random_total > cost_active_total, "Active learning must be cheaper than random sampling.")
|
||||
check(roi_pct > 0, "ROI must be positive for active learning to be justified.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cost_label_str = fmt(cost_label, precision=0, commas=False)
|
||||
n_initial_str = fmt(n_initial, precision=0, commas=True)
|
||||
cost_initial_str = fmt(n_initial * cost_label, precision=0, commas=True)
|
||||
@@ -2771,12 +2771,12 @@ from mlsys.formatting import fmt, check
|
||||
class DeduplicationAmortization:
|
||||
"""Deduplication pipeline ROI: negative at 1 run, highly profitable at 50 runs."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cost_build = 50000 # $ engineering time
|
||||
cost_compute_once = 5000 # $ one-time MinHash compute
|
||||
savings_per_run = 10000 # $ saved per training run
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_investment = cost_build + cost_compute_once
|
||||
runs = [1, 5, 10, 50]
|
||||
# For-loop (not list comprehension) — class bodies cannot access class attrs in comprehension scopes
|
||||
@@ -2784,11 +2784,11 @@ class DeduplicationAmortization:
|
||||
for _r in runs:
|
||||
rois.append((_r * savings_per_run - cost_investment) / cost_investment * 100)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(rois[0] < 0, "Single-run ROI must be negative (investment not yet recovered).")
|
||||
check(rois[3] > 0, "50-run ROI must be positive (highly profitable).")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cost_build_str = fmt(cost_build, precision=0, commas=True)
|
||||
cost_compute_once_str = fmt(cost_compute_once, precision=0, commas=True)
|
||||
savings_per_run_str = fmt(savings_per_run, precision=0, commas=True)
|
||||
@@ -2924,21 +2924,21 @@ Several strategies mitigate this staleness problem, each with distinct overhead
|
||||
class DistributedOverheadCalc:
|
||||
"""8× A100 cluster coreset selection: 67-minute total overhead for 10× training speedup."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
t_embed_value = 20 # minutes (parallel)
|
||||
t_dedup_value = 15 # minutes (distributed hash)
|
||||
t_score_value = 30 # minutes (parallel proxy)
|
||||
t_select_value = 2 # minutes (centralized)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
t_total_overhead_value = t_embed_value + t_dedup_value + t_score_value + t_select_value
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(t_total_overhead_value > 0, "Total overhead must be positive.")
|
||||
check(t_score_value == max(t_embed_value, t_dedup_value, t_score_value, t_select_value),
|
||||
"Scoring must be the dominant overhead phase.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
t_embed_str = f"{t_embed_value} minutes"
|
||||
t_dedup_str = f"{t_dedup_value} minutes"
|
||||
t_score_str = f"{t_score_value} minutes"
|
||||
@@ -3019,7 +3019,7 @@ t_total_overhead_str = DistributedOverheadCalc.t_total_overhead_str
|
||||
|
||||
This positive ROI can erode quickly when workers must coordinate frequently during training. Distributed data selection always incurs a *coordination tax*: the overhead of maintaining consistent selection across workers. This tax must be smaller than the efficiency gains, or distributed selection yields negative ROI. As a rule of thumb, if selection overhead exceeds 10% of training time, simplify the selection strategy or increase the selection interval.
|
||||
|
||||
So far we have examined data selection techniques individually and in distributed settings. Real ML systems, however, combine data selection with model compression, hardware acceleration, and distributed training simultaneously, and these optimizations interact in ways that can amplify or undermine each other. Understanding these interactions is essential for designing efficient end-to-end pipelines.
|
||||
Real ML systems combine data selection with model compression, hardware acceleration, and distributed training simultaneously. These optimizations interact in ways that can amplify or undermine each other, and understanding these interactions is essential for designing efficient end-to-end pipelines.
|
||||
|
||||
## Cross-Layer Interactions {#sec-data-selection-crosslayer-interactions-1f39}
|
||||
|
||||
@@ -3028,7 +3028,7 @@ Data selection does not exist in isolation. A coreset-trained model will eventua
|
||||
|
||||
### Model Compression {#sec-data-selection-model-compression-9aef}
|
||||
|
||||
Model compression (@sec-model-compression) reduces the size of the trained model through pruning, quantization, and distillation. The training dataset directly affects how compressible the resulting model becomes. Perhaps counterintuitively, models trained on smaller, higher-quality datasets may be *more* compressible than those trained on larger, noisier ones.
|
||||
Model compression (@sec-model-compression) reduces the size of the trained model through pruning, quantization, and distillation. The training dataset directly affects how compressible the resulting model becomes. Models trained on smaller, higher-quality datasets are often *more* compressible than those trained on larger, noisier ones.
|
||||
|
||||
The mechanism relates to how models encode information. A model trained on repetitive data learns redundant features that pruning later removes. The training compute required to learn those features was wasted, only to be discarded during compression. By contrast, a model trained on diverse, informative samples learns compact, non-redundant representations from the start, making subsequent compression more effective. Empirical evidence supports this relationship: in experiments on ImageNet, models trained on 50% coresets selected by EL2N compress to 4-bit precision with 2% less accuracy loss than models trained on the full dataset, because the curated training produced cleaner weight distributions that quantize more gracefully.
|
||||
|
||||
@@ -3376,7 +3376,7 @@ from mlsys.formatting import fmt, check
|
||||
class FpScalingCalc:
|
||||
"""Quantitative backing for all Fallacies and Pitfalls in the F&P section."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Fallacy 1: Diminishing returns
|
||||
data_1m_value = 1_000_000
|
||||
data_10m_value = 10_000_000
|
||||
@@ -3430,7 +3430,7 @@ class FpScalingCalc:
|
||||
rare_class_acc_value = 45
|
||||
majority_class_acc_value = 97
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
curated_cost_ratio_value = raw_size_value / curated_size_value
|
||||
synthetic_acc_drop_value = synthetic_gen1_acc_value - synthetic_gen5_acc_value
|
||||
savings_value = training_run_cost_value * efficiency_gain_pct_value / 100
|
||||
@@ -3440,14 +3440,14 @@ class FpScalingCalc:
|
||||
expected_rare_in_coreset_value = int(rare_class_count_value * coreset_pct_value / 100)
|
||||
inflation_gap_value = inflated_acc_value - true_acc_value
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(curated_accuracy_value > raw_accuracy_value,
|
||||
"Curated small dataset must outperform raw large dataset.")
|
||||
check(synthetic_acc_drop_value > 0, "Model collapse must degrade accuracy.")
|
||||
check(expected_rare_in_coreset_value < min_samples_threshold_value,
|
||||
"Random coreset should drop rare class below minimum threshold.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
# Fallacy 1
|
||||
data_1m_str = fmt(data_1m_value / MILLION, precision=0) + "M"
|
||||
data_10m_str = fmt(data_10m_value / MILLION, precision=0) + "M"
|
||||
@@ -3634,3 +3634,10 @@ With high-quality data in hand, we have optimized the source of the system. Even
|
||||
<!-- This is here to make sure that quizzes are inserted properly before a part begins. -->
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:data_selection")
|
||||
```
|
||||
|
||||
@@ -79,7 +79,7 @@ from mlsys import Hardware
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import TFLOPs, second
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GraphOptimizationStats:
|
||||
"""
|
||||
Namespace for graph-level optimization statistics.
|
||||
@@ -92,21 +92,21 @@ class GraphOptimizationStats:
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
graph_flop_reduction_str = GraphOptimizationStats.flop_reduction_range_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class A100BLAS:
|
||||
"""
|
||||
Namespace for A100 BLAS Specs.
|
||||
Scenario: Dense vs Sparse Tensor Core throughput.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
gpu = Hardware.Cloud.A100
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
dense_flops = gpu.peak_flops.m_as(TFLOPs/second)
|
||||
sparse_flops = dense_flops * 2
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
dense_tflops_str = fmt(dense_flops, precision=0, commas=False)
|
||||
sparse_tflops_str = fmt(sparse_flops, precision=0, commas=False)
|
||||
|
||||
@@ -165,7 +165,7 @@ The three problems---execution, differentiation, and abstraction---did not emerg
|
||||
|
||||
\index{ML Framework!historical evolution}
|
||||
\index{NumPy!framework foundation}
|
||||
Modern frameworks are not just a collection of tools; they are a **Ladder of Abstraction** built to solve specific systems problems. Each rung of this ladder emerged to solve a bottleneck that made the previous generation impractical for scaling.
|
||||
In 1979, writing a matrix multiplication in Fortran that saturated the hardware required deep knowledge of cache lines, register scheduling, and vector units. By 2016, a single line of Python (`torch.matmul(A, B)`) achieved the same peak throughput without the programmer knowing anything about the silicon. That compression of effort did not happen in one step; it accumulated across four decades of abstraction, each layer solving a bottleneck that made the previous generation impractical for scaling. The result is a **Ladder of Abstraction** where each rung automates what the rung below exposed.
|
||||
|
||||
1. **Solving Performance (1979–1992)**: The **Basic Linear Algebra Subprograms (BLAS)**\index{BLAS!historical foundation}[^fn-blas-performance] and **LAPACK**[^fn-lapack-algebra] solved the problem of *Hardware Primitives*. They provided standardized, highly optimized implementations of matrix operations (like GEMM[^fn-gemm-utilization]). This layer ensures that `C = A @ B` runs at near-peak silicon speed, regardless of the language calling it.
|
||||
|
||||
@@ -273,27 +273,27 @@ from mlsys import Hardware
|
||||
from mlsys.constants import TFLOPs, TB, second, flop, byte
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MemoryWallSpecs:
|
||||
"""
|
||||
Namespace for A100 Memory Wall Specs.
|
||||
Scenario: Demonstrating the 150x gap between compute and bandwidth.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
gpu = Hardware.Cloud.A100
|
||||
|
||||
flops_fp16 = gpu.peak_flops.m_as(TFLOPs/second)
|
||||
bw_tbs = gpu.memory_bw.m_as(TB/second)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Arithmetic Intensity "Ridge Point" (Ops / Byte)
|
||||
ridge_point = gpu.ridge_point().m_as(flop/byte)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ridge_point >= 100, f"A100 ridge point ({ridge_point:.1f}) is too low to claim a 'Memory Wall'.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
a100_tflops_fp16_str = fmt(flops_fp16, precision=0, commas=False)
|
||||
a100_bw_tbs_str = fmt(bw_tbs, precision=1, commas=False)
|
||||
|
||||
@@ -420,7 +420,7 @@ No single execution model optimizes all these dimensions. Frameworks must choose
|
||||
|
||||
### Three Execution Strategies {#sec-ml-frameworks-three-execution-strategies-5934}
|
||||
|
||||
The computational graph representation is powerful, but it raises a critical design question: *when* should the framework build this graph? Consider a simple operation like `y = x * 2`. Two distinct approaches exist:
|
||||
The computational graph representation enables global optimization, but it raises a critical design question: *when* should the framework build this graph? Consider a simple operation like `y = x * 2`. Two distinct approaches exist:
|
||||
|
||||
1. **Immediate execution**: Perform the multiplication right now, storing the result in `y`. Natural and debuggable, but the framework sees only one operation at a time.
|
||||
|
||||
@@ -586,30 +586,30 @@ For a typical ResNet-50 forward pass, eager execution overhead adds approximatel
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DispatchTax:
|
||||
"""
|
||||
Namespace for The Dispatch Tax calculation.
|
||||
Scenario: Comparing overhead for small vs large operations.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
python_overhead_us = 10.0 # Standard Python dispatch (μs)
|
||||
|
||||
# Kernel Durations (μs)
|
||||
small_kernel_us = 1.0 # e.g. ReLU on 1024 elements
|
||||
large_kernel_us = 100.0 # e.g. MatMul 1024x1024
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Dispatch Tax = Overhead / (Overhead + Execution)
|
||||
small_tax_pct = (python_overhead_us / (python_overhead_us + small_kernel_us)) * 100
|
||||
large_tax_pct = (python_overhead_us / (python_overhead_us + large_kernel_us)) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(small_tax_pct > 90, f"Small op tax ({small_tax_pct:.1f}%) should be dominant.")
|
||||
check(large_tax_pct < 15, f"Large op tax ({large_tax_pct:.1f}%) should be negligible.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
python_overhead_str = f"{int(python_overhead_us)}"
|
||||
small_kernel_str = f"{int(small_kernel_us)}"
|
||||
large_kernel_str = f"{int(large_kernel_us)}"
|
||||
@@ -920,14 +920,14 @@ The compilation overhead in these examples (approximately 100ms to compile the f
|
||||
# │ overhead_speedup_str, bw_efficiency_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FusionSpeedup:
|
||||
"""
|
||||
Namespace for Kernel Fusion Speedup calculation.
|
||||
Scenario: Comparing Eager (2 launches) vs Fused (1 launch) overheads.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
python_dispatch_us = 10
|
||||
kernel_launch_us = 5
|
||||
memory_access_us = 1
|
||||
@@ -938,7 +938,7 @@ class FusionSpeedup:
|
||||
eager_mem_factor = 4 # 2R + 2W
|
||||
fused_mem_factor = 2 # 1R + 1W (intermediate fused)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
launch_overhead = python_dispatch_us + kernel_launch_us
|
||||
|
||||
eager_total_overhead = eager_ops * launch_overhead
|
||||
@@ -947,10 +947,10 @@ class FusionSpeedup:
|
||||
speedup = eager_total_overhead / fused_total_overhead
|
||||
bw_efficiency = eager_mem_factor / fused_mem_factor
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(speedup >= 1.5, f"Fusion speedup ({speedup:.1f}x) is too small to justify compilation complexity.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
python_dispatch_us_str = f"{python_dispatch_us}"
|
||||
kernel_launch_only_us_str = f"{kernel_launch_us}"
|
||||
memory_access_us_str = f"{memory_access_us}"
|
||||
@@ -1400,14 +1400,14 @@ From the case study in @sec-ml-frameworks-putting-together-anatomy-training-step
|
||||
from mlsys.constants import KIB_TO_BYTES
|
||||
from mlsys.formatting import fmt, check, md_math
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DispatchTax:
|
||||
"""
|
||||
Namespace for Dispatch Tax Calculation.
|
||||
Scenario: Comparing overhead impact on Small Ops vs Large Ops.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Scenario 1: Small MLP (Overhead Bound)
|
||||
small_ops_count = 6
|
||||
small_dispatch_us = 5.0
|
||||
@@ -1417,7 +1417,7 @@ class DispatchTax:
|
||||
large_hw_us = 100_000.0 # 100ms
|
||||
large_dispatch_us = 50.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Small Model
|
||||
small_sw_total = small_ops_count * small_dispatch_us
|
||||
small_total_time = small_sw_total + small_hw_us
|
||||
@@ -1428,11 +1428,11 @@ class DispatchTax:
|
||||
# Large Model
|
||||
large_overhead_ratio = large_dispatch_us / large_hw_us
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(small_overhead_ratio >= 1.0, f"Small model ratio ({small_overhead_ratio:.1f}) implies it is NOT overhead bound.")
|
||||
check(large_overhead_ratio <= 0.01, f"Large model overhead ({large_overhead_ratio:.4f}) is too high.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
dispatch_n_ops_value = small_ops_count
|
||||
dispatch_us_per_op_value = small_dispatch_us
|
||||
dispatch_hw_time_us_value = small_hw_us
|
||||
@@ -1499,20 +1499,20 @@ from mlsys import Models
|
||||
from mlsys.constants import Bparam
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT3Context:
|
||||
"""
|
||||
Namespace for GPT-3 Parameter Counts.
|
||||
Scenario: Compilation benefits at scale.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
model = Models.GPT3
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
params_b = model.parameters.m_as(Bparam)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt3_params_b_str = fmt(params_b, precision=0, commas=False)
|
||||
|
||||
# Note: Use GPT3Context.gpt3_params_b_str directly.
|
||||
@@ -1758,17 +1758,17 @@ from mlsys.constants import GPT3_PARAMS, BYTES_FP16, GB, Bparam
|
||||
from mlsys.formulas import model_memory
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT3MemoryFootprint:
|
||||
"""GPT-3 FP16 weight memory to show single-GPU capacity is exceeded."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
params_b = GPT3_PARAMS.m_as(Bparam) # 175 billion
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
fp16_gb = model_memory(GPT3_PARAMS, BYTES_FP16, GB) # 350 GB
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
gpt3_params_b_str = fmt(params_b, precision=0, commas=False) # e.g. "175"
|
||||
gpt3_fp16_gb_str = fmt(fp16_gb, precision=0, commas=False) # e.g. "350"
|
||||
|
||||
@@ -1890,19 +1890,19 @@ from mlsys import Models
|
||||
from mlsys.constants import BYTES_FP32, BYTES_ADAM_STATE, MB, Mparam
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResNetMemory:
|
||||
"""
|
||||
Namespace for ResNet-50 Memory Breakdown.
|
||||
Scenario: Comparing training vs inference memory costs.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
model = Models.ResNet50
|
||||
training_min_gb = 10
|
||||
training_max_gb = 15
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
params_m = model.parameters.m_as(Mparam)
|
||||
|
||||
fp32_mb = model.size_in_bytes(BYTES_FP32).m_as(MB)
|
||||
@@ -1910,7 +1910,7 @@ class ResNetMemory:
|
||||
|
||||
training_ratio = (training_min_gb * KIB_TO_BYTES) / fp32_mb
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
resnet_params_m_str = fmt(params_m, precision=1, commas=False)
|
||||
resnet_fp32_mb_str = fmt(fp32_mb, precision=0, commas=False)
|
||||
resnet_adam_mb_str = fmt(adam_mb, precision=0, commas=False)
|
||||
@@ -2162,22 +2162,22 @@ from mlsys.constants import BYTES_FP16, BYTES_ADAM_STATE, GB, ureg
|
||||
from mlsys.formulas import model_memory
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class Model7B:
|
||||
"""
|
||||
Namespace for 7B Model Memory.
|
||||
Scenario: Optimizer state overhead.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
params = 7e9 * ureg.param
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
fp16_gb = model_memory(params, BYTES_FP16, GB)
|
||||
adam_gb = model_memory(params, BYTES_ADAM_STATE, GB)
|
||||
total_gb = fp16_gb + adam_gb
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
model_7b_fp16_gb_str = fmt(fp16_gb, precision=0, commas=False)
|
||||
model_7b_adam_gb_str = fmt(adam_gb, precision=0, commas=False)
|
||||
model_7b_total_gb_str = fmt(total_gb, precision=0, commas=False)
|
||||
@@ -2271,7 +2271,7 @@ The execution models covered in @sec-ml-frameworks-execution-problem-e1e1, namel
|
||||
|
||||
- **PyTorch** [@paszke2019pytorch] builds its autograd tape dynamically during forward execution, providing immediate debugging at the cost of graph-level optimization. The `grad_fn` chain mechanism detailed in @sec-ml-frameworks-pytorch-autograd-internals-4fa0 enables flexible control flow but requires storing the complete graph until backward pass completion.
|
||||
- **TensorFlow** (in its 1.x incarnation) performed symbolic differentiation during graph construction, enabling ahead-of-time optimization. Modern TensorFlow 2.x uses eager execution by default but provides `tf.function` for graph compilation when performance matters.
|
||||
- **JAX** [@frostig2018compiling] transforms functions rather than tracking operations. The `jax.grad()` transformation returns a new function that computes gradients, enabling composition with `jax.vmap()` for vectorization and `jax.jit()` for compilation. This approach requires pure functions but enables powerful program transformations.
|
||||
- **JAX** [@frostig2018compiling] transforms functions rather than tracking operations. The `jax.grad()` transformation returns a new function that computes gradients, enabling composition with `jax.vmap()` for vectorization and `jax.jit()` for compilation. This approach requires pure functions but enables composable program transformations that chain differentiation, vectorization, and compilation in a single expression.
|
||||
|
||||
These implementation differences have direct practical consequences for framework selection, which @sec-ml-frameworks-major-framework-platform-analysis-fe96 examines in detail.
|
||||
|
||||
@@ -2326,7 +2326,7 @@ To make this concrete, trace what must happen when a programmer writes `model(in
|
||||
from mlsys.constants import RESNET50_PARAMS, Mparam
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResNetAbstraction:
|
||||
"""
|
||||
Namespace for ResNet-50 parameter scale in abstraction section.
|
||||
@@ -2387,20 +2387,20 @@ But how much memory does this single abstraction actually consume? The answer is
|
||||
from mlsys.constants import BYTES_FP16, BYTES_ADAM_STATE, GB, BILLION
|
||||
from mlsys.formatting import fmt, check, md_math
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AdminTax:
|
||||
"""
|
||||
Namespace for Administrative Tax Calculation.
|
||||
Scenario: Memory overhead for 1B parameter model.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
params_count = 1 * BILLION
|
||||
batch_size = 32
|
||||
layers = 100
|
||||
width = 1024
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
weights_gb = (params_count * BYTES_FP16).m_as(GB)
|
||||
grads_gb = weights_gb
|
||||
opt_gb = (params_count * BYTES_ADAM_STATE).m_as(GB)
|
||||
@@ -2412,10 +2412,10 @@ class AdminTax:
|
||||
total_gb = weights_gb + grads_gb + opt_gb + act_gb
|
||||
tax_gb = total_gb - weights_gb
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(tax_gb > 15, f"Administrative tax ({tax_gb:.1f} GB) unexpectedly low.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
admin_weights_gb_str = fmt(weights_gb, precision=0, commas=False)
|
||||
admin_grads_gb_str = fmt(grads_gb, precision=0, commas=False)
|
||||
admin_opt_gb_str = fmt(opt_gb, precision=0, commas=False)
|
||||
@@ -2654,17 +2654,17 @@ from mlsys.constants import (PCIE_GEN4_BW, NVLINK_A100_BW, A100_MEM_BW, A100_MEM
|
||||
BILLION, MILLION, THOUSAND)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DeviceBandwidthHierarchy:
|
||||
"""
|
||||
Namespace for Device Bandwidth Hierarchy.
|
||||
Scenario: Comparing PCIe vs NVLink vs HBM speeds.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
tensor_4mb = 4 * MILLION
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
pcie4_gbs = PCIE_GEN4_BW.m_as(GB/second)
|
||||
nvlink_a100_gbs = NVLINK_A100_BW.m_as(GB/second)
|
||||
a100_bw_gbs = A100_MEM_BW.m_as(GB/second)
|
||||
@@ -2683,7 +2683,7 @@ class DeviceBandwidthHierarchy:
|
||||
# Equiv Ops: (ms / 1000) * FLOPS
|
||||
pcie4_1gb_equiv_ops = (pcie4_1gb_ms / THOUSAND) * A100_FLOPS_FP16_TENSOR.m_as(flop/second)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
pcie4_gbs_str = fmt(pcie4_gbs, precision=0, commas=False)
|
||||
pcie4_bidir_gbs_str = fmt(pcie4_gbs * 2, precision=0, commas=False)
|
||||
nvlink_a100_gbs_str = fmt(nvlink_a100_gbs, precision=0, commas=False)
|
||||
@@ -2874,20 +2874,20 @@ When overlap is insufficient, profiling reveals where time is lost. NVIDIA provi
|
||||
from mlsys.constants import PCIE_GEN4_BW, BYTES_FP32, MB, GB, byte, second, MS_PER_SEC
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DataloaderThroughput:
|
||||
"""
|
||||
Namespace for Dataloader Throughput.
|
||||
Scenario: GPU data ingestion requirements.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
img_per_sec = 1000
|
||||
img_res = 224
|
||||
img_channels = 3
|
||||
batch_size = 64
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Throughput requirement
|
||||
throughput_bytes_sec = img_per_sec * img_res * img_res * img_channels * byte
|
||||
dataloader_mbs = throughput_bytes_sec.m_as(MB)
|
||||
@@ -2900,10 +2900,10 @@ class DataloaderThroughput:
|
||||
# PCIe Ref
|
||||
pcie4_gbs = PCIE_GEN4_BW.m_as(GB/second)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(dataloader_mbs > 100, f"Throughput requirement ({dataloader_mbs:.1f} MB/s) too low.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
dataloader_mbs_str = fmt(dataloader_mbs, precision=0, commas=False)
|
||||
batch_mb_str = fmt(batch_mb, precision=0, commas=False)
|
||||
batch_transfer_ms_str = fmt(batch_transfer_ms, precision=1, commas=False)
|
||||
@@ -2984,17 +2984,17 @@ from mlsys.constants import GPT3_PARAMS, BYTES_FP16, GB, Bparam
|
||||
from mlsys.formulas import model_memory
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT3ParameterStructures:
|
||||
"""GPT-3 FP16 storage to motivate parameter sharding across devices."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
params_b = GPT3_PARAMS.m_as(Bparam) # 175 billion
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
fp16_gb = model_memory(GPT3_PARAMS, BYTES_FP16, GB) # 350 GB
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
gpt3_params_b_str = fmt(params_b, precision=0, commas=False) # e.g. "175"
|
||||
gpt3_fp16_gb_str = fmt(fp16_gb, precision=0, commas=False) # e.g. "350"
|
||||
|
||||
@@ -3230,20 +3230,20 @@ The execution controller coordinates work across multiple processing units and m
|
||||
from mlsys.constants import RESNET50_FLOPs, GFLOPs
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResNetGFLOPS:
|
||||
"""
|
||||
Namespace for ResNet GFLOPS.
|
||||
Scenario: Compute intensity check.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
flops = RESNET50_FLOPs
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
gflops = flops.m_as(GFLOPs)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
resnet_gflops_str = fmt(gflops, precision=1, commas=False)
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -3579,7 +3579,7 @@ The trade-off is therefore a more fragmented deployment path. Because the graph
|
||||
\index{JAX!transformation composition}
|
||||
PyTorch's eager execution and TensorFlow's graph compilation represent two points on a spectrum, yet both share an imperative programming heritage where computation proceeds as a sequence of stateful operations. JAX represents a radically different approach, one built on functional programming principles and composable program transformations rather than computational graphs [@jax2018github]. Developed by Google Research, JAX has gained significant traction in research settings, particularly for work requiring custom differentiation, advanced optimization research, and large-scale distributed training.
|
||||
|
||||
JAX's architecture reframes the **Differentiation Problem** entirely. Google Research built JAX on a key observation: if functions are pure (no side effects, no mutable state), the compiler can safely reorder, fuse, and parallelize any operation, because outputs depend only on inputs. This constraint, borrowed from functional programming, is what makes JAX's composable transformations possible. Rather than implementing automatic differentiation as a tape-based system (PyTorch) or a graph transformation pass (TensorFlow), JAX treats differentiation as one of several *composable function transformations*. The `jax.grad` function does not compute gradients directly; it returns a *new function* that computes gradients. This subtle distinction enables powerful compositions: you can differentiate a differentiated function (higher-order derivatives), vectorize a gradient computation (`vmap(grad(f))`), or compile a vectorized gradient to XLA (`jit(vmap(grad(f)))`).
|
||||
JAX's architecture reframes the **Differentiation Problem** entirely. Google Research built JAX on a key observation: if functions are pure (no side effects, no mutable state), the compiler can safely reorder, fuse, and parallelize any operation, because outputs depend only on inputs. This constraint, borrowed from functional programming, is what makes JAX's composable transformations possible. Rather than implementing automatic differentiation as a tape-based system (PyTorch) or a graph transformation pass (TensorFlow), JAX treats differentiation as one of several *composable function transformations*. The `jax.grad` function does not compute gradients directly; it returns a *new function* that computes gradients. This subtle distinction enables arbitrary compositions: differentiating a differentiated function yields higher-order derivatives, vectorizing a gradient computation (`vmap(grad(f))`) parallelizes across examples, and compiling a vectorized gradient to XLA (`jit(vmap(grad(f)))`) eliminates Python overhead entirely.
|
||||
|
||||
JAX's functional paradigm requires a genuine mental shift from "tracking state through objects" to "transforming pure functions." The conceptual introduction here covers JAX's core design; transformation composition, pytree handling, and XLA tracing mechanics each warrant dedicated study for production use.
|
||||
|
||||
@@ -3611,7 +3611,7 @@ fast_batched_grad = jax.jit(batched_grad)
|
||||
```
|
||||
:::
|
||||
|
||||
This functional approach requires **pure functions** (no side effects) and **immutable data** (arrays cannot be modified in place). These constraints may seem restrictive coming from PyTorch's mutable object model, but they enable powerful guarantees: the compiler can safely reorder, fuse, and parallelize operations because function outputs depend only on inputs. The restriction is the feature; purity is what makes transformation composition possible.
|
||||
This functional approach requires **pure functions** (no side effects) and **immutable data** (arrays cannot be modified in place). These constraints may seem restrictive coming from PyTorch's mutable object model, but they enable formal guarantees: the compiler can safely reorder, fuse, and parallelize operations because function outputs depend only on inputs. The restriction is the feature; purity is what makes transformation composition possible.
|
||||
|
||||
#### Key Transformations {#sec-ml-frameworks-key-transformations-4105}
|
||||
|
||||
@@ -3624,7 +3624,7 @@ JAX's minimalist core delegates neural network abstractions to companion librari
|
||||
#### Trade-offs and Use Cases {#sec-ml-frameworks-tradeoffs-use-cases-6453}
|
||||
|
||||
\index{JAX!XLA compilation}
|
||||
The functional constraints that JAX imposes become advantages in specific domains. Custom differentiation---higher-order gradients, custom VJP/JVP rules---composes cleanly because pure functions make differentiation rules predictable. Research on optimization algorithms benefits from transformations that let researchers manipulate gradient computation as naturally as they manipulate data. Large-scale distributed training, particularly on TPUs, leverages XLA compilation to extract maximum hardware utilization. Scientific computing with AD requirements benefits from functional purity that enables mathematical reasoning about code. JAX requires more upfront investment than PyTorch: the functional paradigm has a learning curve, state management requires explicit patterns, and debugging compiled code is harder than eager execution. Teams should choose JAX when its strengths align with project requirements, not as a default.
|
||||
The functional constraints that JAX imposes become advantages in specific domains. Custom differentiation---higher-order gradients, custom VJP/JVP rules---composes cleanly because pure functions make differentiation rules predictable. Research on optimization algorithms benefits from transformations that let researchers manipulate gradient computation as naturally as they manipulate data. Large-scale distributed training, particularly on TPUs, uses XLA compilation to extract maximum hardware utilization. Scientific computing with AD requirements benefits from functional purity that enables mathematical reasoning about code. JAX requires more upfront investment than PyTorch: the functional paradigm has a learning curve, state management requires explicit patterns, and debugging compiled code is harder than eager execution. Teams should choose JAX when its strengths align with project requirements, not as a default.
|
||||
|
||||
### Quantitative Platform Performance Analysis {#sec-ml-frameworks-quantitative-platform-performance-analysis-816d}
|
||||
|
||||
@@ -3838,7 +3838,7 @@ These hardware constraints cascade into performance trade-offs that are tightly
|
||||
|
||||
### Development Support and Long-term Viability Assessment {#sec-ml-frameworks-development-support-longterm-viability-assessment-d1d7}
|
||||
|
||||
What determines whether a framework remains viable five years into a production deployment? Technical capabilities are necessary but not sufficient. Community composition shapes framework evolution in measurable ways: PyTorch's academic community drives research-oriented features and reproducibility tools, though production tooling (PyTorch Lightning, TorchServe) has historically lagged; TensorFlow's enterprise community emphasizes production reliability through TFX pipelines, TensorBoard visualization, and TensorFlow Model Analysis; JAX's smaller community concentrates on mathematical rigor, producing powerful research tools but with a steeper onboarding curve.
|
||||
What determines whether a framework remains viable five years into a production deployment? Technical capabilities are necessary but not sufficient. Community composition shapes framework evolution in measurable ways: PyTorch's academic community drives research-oriented features and reproducibility tools, though production tooling (PyTorch Lightning, TorchServe) has historically lagged; TensorFlow's enterprise community emphasizes production reliability through TFX pipelines, TensorBoard visualization, and TensorFlow Model Analysis; JAX's smaller community concentrates on mathematical rigor, producing specialized research tools (composable transformations, custom VJP rules) but with a steeper onboarding curve.
|
||||
|
||||
A framework's practical utility, however, often depends more on its surrounding ecosystem than on its core capabilities. Hugging Face provides consistent model APIs across all three major frameworks, making pretrained model availability a near-commodity. Cross-framework tools (Weights & Biases, MLflow for experiment tracking; ONNX Runtime for serving) reduce lock-in, while framework-native tools (XLA, TorchScript, TensorFlow Serving) offer deeper optimization at the cost of portability. Cloud ML services (SageMaker, Google AI Platform, Azure ML) provide native integration for specific frameworks, creating operational advantages that compound over time.
|
||||
|
||||
@@ -3890,11 +3890,11 @@ optimizer.step()
|
||||
# │ Exports: train_batch, train_input, train_hidden, train_output
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TrainingStepDims:
|
||||
"""Model dimensions for the two-layer MLP training step example."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
train_batch = 32 # batch size
|
||||
train_input = 784 # MNIST input (28 × 28)
|
||||
train_hidden = 256 # hidden layer
|
||||
@@ -3970,18 +3970,18 @@ Applying the Dispatch Overhead Equation (@eq-dispatch-overhead) to this step, @t
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import byte, KB, MB, flop, KFLOPs, MFLOPs
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TrainingStepCalc:
|
||||
"""FLOPs and memory analysis for each operation in a two-layer MLP training step."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Dimensions from TrainingStepDims (available at module scope via EXPORTS)
|
||||
_batch = train_batch
|
||||
_input = train_input
|
||||
_hidden = train_hidden
|
||||
_output = train_output
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# MatMul 1: [32, 784] @ [784, 256]
|
||||
mm1_flops = 2 * _batch * _input * _hidden # ~12.8M FLOPs
|
||||
mm1_mem_bytes = (_batch * _input + _input * _hidden + _batch * _hidden) * 4 # FP32
|
||||
@@ -4003,11 +4003,11 @@ class TrainingStepCalc:
|
||||
bwd_mem_str = "~3.2 MB" # approximation
|
||||
bwd_ai_str = "~8.0" # approximation
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(mm1_flops / mm1_mem_bytes > relu_flops / relu_mem_bytes,
|
||||
"MatMul must have higher arithmetic intensity than ReLU.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
train_mm1_flops_str = fmt((mm1_flops * flop).m_as(MFLOPs), precision=1, commas=False) + "M"
|
||||
train_mm1_mem_str = fmt((mm1_mem_bytes * byte).m_as(MB), precision=1, commas=False) + " MB"
|
||||
train_mm1_ai_str = fmt(mm1_flops / mm1_mem_bytes, precision=1, commas=False)
|
||||
@@ -4079,18 +4079,18 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class MnistTrainingStepCalc:
|
||||
"""MNIST overhead-bound analysis: dispatch dominates compute and memory on A100."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
total_flops = 40e6 # ~40M FLOPs total
|
||||
mem_traffic_bytes = 5e6 # ~5 MB traffic
|
||||
n_ops = 6 # 6 kernel launches
|
||||
us_per_op = 5 # 5 μs dispatch/op
|
||||
peak_flops = A100_FLOPS_FP16_TENSOR.m_as(flop/second) # 312e12 FLOPS
|
||||
mem_bw_bytes = A100_MEM_BW.m_as(byte/second) # ~2e12 B/s
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
t_compute_us = total_flops / peak_flops * MILLION # ~0.1 μs
|
||||
t_memory_us = mem_traffic_bytes / mem_bw_bytes * MILLION # ~2.5 μs
|
||||
t_overhead_us = n_ops * us_per_op # 30 μs (dominant!)
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mnist_t_compute_us_str = fmt(t_compute_us, precision=1, commas=False)
|
||||
mnist_t_memory_us_str = fmt(t_memory_us, precision=1, commas=False)
|
||||
mnist_t_overhead_us_str = fmt(t_overhead_us, precision=0, commas=False)
|
||||
@@ -4177,15 +4177,15 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class FrameworkGapsCalc:
|
||||
"""PyTorch vs TensorRT latency gap and PyTorch Mobile vs TFLite Micro memory gap."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
pytorch_ms = 52 # 52 ms inference
|
||||
tensorrt_ms = 3 # 3 ms inference
|
||||
pytorch_mobile_mb = 220 # 220 MB runtime
|
||||
tflite_micro_kb = 32 # 32 KB runtime
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
perf_gap = pytorch_ms / tensorrt_ms # ~17x gap
|
||||
memory_ratio = pytorch_mobile_mb * 1000 / tflite_micro_kb # ~6875x gap (decimal SI: 1 MB = 1000 KB)
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
pytorch_ms_str = fmt(pytorch_ms, precision=0, commas=False) # e.g. "52"
|
||||
tensorrt_ms_str = fmt(tensorrt_ms, precision=0, commas=False) # e.g. "3"
|
||||
pytorch_mobile_mb_str = fmt(pytorch_mobile_mb, precision=0, commas=False) # e.g. "220"
|
||||
@@ -4244,12 +4244,12 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class Model7BMemory:
|
||||
"""7B-parameter FP16 weight memory to show batch-size myth under capacity constraints."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
params = 7e9 # 7 billion parameters
|
||||
a100_mem = A100_MEM_CAPACITY.m_as(GiB) # 80 GB
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
fp16_gb = model_memory(params, BYTES_FP16, GB) # 14 GB
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
model_7b_fp16_gb_str = fmt(fp16_gb, precision=0, commas=False) # e.g. "14"
|
||||
a100_remaining_7b_gb_str = fmt(a100_mem - fp16_gb, precision=0, commas=False) # e.g. "66"
|
||||
a100_mem_str = fmt(a100_mem, precision=0, commas=False) # e.g. "80"
|
||||
@@ -4288,17 +4288,17 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class CompilationOverheadCalc:
|
||||
"""Break-even analysis showing frequent recompilation negates compiled-mode throughput gains."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_images = 10_000 # small experiment
|
||||
eager_throughput = 1_450 # images/sec (eager)
|
||||
compiled_throughput = 2_150 # images/sec (compiled)
|
||||
n_recompilations = 10 # code changes
|
||||
compilation_time_s = 30 # seconds per compile
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
eager_total = n_images / eager_throughput # ~6.9 s
|
||||
compiled_total = (n_images / compiled_throughput +
|
||||
n_recompilations * compilation_time_s) # ~304.7 s
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
n_images_str = f"{n_images:,}" # e.g. "10,000"
|
||||
n_recompilations_str = fmt(n_recompilations, precision=0, commas=False) # e.g. "10"
|
||||
eager_throughput_str = f"{eager_throughput:,}" # e.g. "1,450"
|
||||
@@ -4345,9 +4345,16 @@ Understanding framework internals transforms how practitioners approach performa
|
||||
|
||||
::: {.callout-chapter-connection title="From Control Room to Power Plant"}
|
||||
|
||||
We have established the software substrate of ML: the frameworks that translate abstract architectures into executable kernels. The computational graphs, autograd tapes, and kernel dispatch pipelines examined here are the control room instruments---they give engineers visibility into and control over the training process. A control room without a source of energy, however, is just a room with glowing lights. We turn next to @sec-model-training, where the concepts introduced here---mixed-precision training, gradient checkpointing, compilation pipelines, and distributed execution contexts---scale from single-device examples to the massive multi-GPU and multi-node orchestration that powers modern AI.
|
||||
Frameworks are the software substrate that translates abstract architectures into executable kernels. Computational graphs, autograd tapes, and kernel dispatch pipelines are the control room instruments---they give engineers visibility into and control over the training process. A control room without a source of energy, however, is just a room with glowing lights. @sec-model-training puts this machinery to work, scaling mixed-precision training, gradient checkpointing, compilation pipelines, and distributed execution contexts from single-device examples to the massive multi-GPU and multi-node orchestration that powers modern AI.
|
||||
|
||||
:::
|
||||
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:frameworks")
|
||||
```
|
||||
|
||||
@@ -25,7 +25,7 @@ engine: jupyter
|
||||
|
||||
_Why does moving data cost more than computing it?_
|
||||
|
||||
The central surprise of modern computing is that *arithmetic is nearly free while memory access is expensive*. In the time it takes to fetch a single value from main memory, a processor could perform thousands of calculations. This inversion, the "Memory Wall," is not an engineering limitation awaiting a fix; it is a physical consequence of the speed of light and the energy cost of moving electrons across silicon. It explains why specialized accelerators exist: GPUs, TPUs, and neural processing units are not merely faster at math but architected specifically to hide, amortize, and minimize the crushing cost of moving data through deep memory hierarchies, massive parallelism, and specialized data paths. Concretely, hardware acceleration is the only way to sustain the exponential growth required by modern AI models, as general-purpose CPU scaling alone is no longer sufficient. It explains why some optimizations that reduce theoretical computation fail to improve actual runtime: if the operation was already memory-bound, computing less changes nothing because the bottleneck was never computation. And it explains why hardware selection cannot be reduced to comparing peak FLOPS—what matters is whether a workload's data movement patterns align with what the hardware was actually designed to accelerate. For the engineer choosing hardware, this means the question is never "which chip is fastest?" but "which chip's memory system best matches my model's access patterns?" A model with large embedding tables and irregular lookups needs a very different accelerator than one performing dense matrix multiplications over compact weight tensors. Getting this match right is the difference between running at 10% of theoretical peak and running at 80%.
|
||||
The central surprise of modern computing is that *arithmetic is nearly free while memory access is expensive*. In the time it takes to fetch a single value from main memory, a processor could perform thousands of calculations. This inversion, the "Memory Wall," is not an engineering limitation awaiting a fix; it is a physical consequence of the speed of light and the energy cost of moving electrons across silicon. It explains why specialized accelerators exist: GPUs, TPUs, and neural processing units are not merely faster at math but architected specifically to hide, amortize, and minimize the crushing cost of moving data through deep memory hierarchies, massive parallelism, and specialized data paths. Concretely, hardware acceleration is the only way to sustain the exponential growth required by modern AI models, as general-purpose CPU scaling alone is no longer sufficient. It explains why some optimizations that reduce theoretical computation fail to improve actual runtime: if the operation was already memory-bound, computing less changes nothing because the bottleneck was never computation. It also explains why hardware selection cannot be reduced to comparing peak FLOPS---what matters is whether a workload's data movement patterns align with what the hardware was actually designed to accelerate. For the engineer choosing hardware, this means the question is never "which chip is fastest?" but "which chip's memory system best matches my model's access patterns?" A model with large embedding tables and irregular lookups needs a very different accelerator than one performing dense matrix multiplications over compact weight tensors. Getting this match right is the difference between running at 10% of theoretical peak and running at 80%.
|
||||
|
||||
::: {.content-visible when-format="pdf"}
|
||||
\newpage
|
||||
@@ -54,7 +54,7 @@ start_chapter("vol1:hw_acceleration")
|
||||
## Acceleration Fundamentals {#sec-hardware-acceleration-ai-hardware-acceleration-fundamentals-9b28}
|
||||
|
||||
\index{D·A·M Taxonomy!machine axis}
|
||||
We have optimized the Data in @sec-data-selection and compressed the Algorithm (Model) in @sec-model-compression. Now we turn to the final axis of the D·A·M taxonomy (@sec-introduction): the Machine. Hardware acceleration exists because of a striking asymmetry in modern computing: arithmetic is *cheap*, but moving data is *expensive*. In the time a modern GPU computes a thousand floating-point operations, a single value travels from main memory. This inversion, where computation is the abundant resource and bandwidth is the scarce one, is the reason specialized hardware matters for machine learning.
|
||||
@sec-data-selection optimized the Data and @sec-model-compression compressed the Algorithm (Model). The final axis of the D·A·M taxonomy (@sec-introduction) is the Machine. Hardware acceleration exists because of a striking asymmetry in modern computing: arithmetic is *cheap*, but moving data is *expensive*. In the time a modern GPU computes a thousand floating-point operations, a single value travels from main memory. This inversion, where computation is the abundant resource and bandwidth is the scarce one, is the reason specialized hardware matters for machine learning.
|
||||
|
||||
::: {.callout-definition title="Hardware Acceleration"}
|
||||
|
||||
@@ -68,7 +68,7 @@ We have optimized the Data in @sec-data-selection and compressed the Algorithm (
|
||||
|
||||
The definition above frames the chapter's central engineering tradeoff. General-purpose processors devote substantial silicon area to branch prediction\index{Branch Prediction!eliminated in accelerators}, speculative execution\index{Speculative Execution!eliminated in accelerators}, and complex cache coherence protocols\index{Cache Coherence!accelerator trade-offs}. Accelerators strip away that generality, filling the die with arithmetic units tuned to the regular, data-parallel patterns that characterize neural network computation. The result is order-of-magnitude improvements in throughput per watt for the workloads that match these patterns.
|
||||
|
||||
Hardware alone, however, cannot achieve these gains. The algorithms must be designed to leverage what the hardware offers, and the hardware must be built to accelerate the operations algorithms actually use. This symbiosis motivates a complementary principle: *hardware-software co-design*.
|
||||
Hardware alone, however, cannot achieve these gains. The algorithms must be designed to exploit what the hardware offers, and the hardware must be built to accelerate the operations algorithms actually use. This symbiosis motivates a complementary principle: *hardware-software co-design*.
|
||||
|
||||
::: {.callout-definition title="Hardware-Software Co-design"}
|
||||
|
||||
@@ -104,7 +104,7 @@ $$ Speedup = \frac{1}{(1 - p) + \frac{p}{S}} $$ {#eq-amdahl}
|
||||
:::
|
||||
|
||||
\index{Acceleration Wall!diminishing returns}
|
||||
Amdahl's Law is not merely theoretical: it explains *why* many GPU upgrades disappoint in practice. The following heatmap (@fig-iron-law-heatmap) visualizes the *Acceleration Wall*—the diminishing returns from faster hardware when serial bottlenecks persist—showing that unless your workload is highly parallelizable ($p > 0.99$), investing in faster hardware yields diminishing returns. The contour values are illustrative ranges for intuition.
|
||||
Amdahl's Law is not merely theoretical: it explains *why* many GPU upgrades disappoint in practice. The following heatmap (@fig-iron-law-heatmap) visualizes the *Acceleration Wall*, the diminishing returns from faster hardware when serial bottlenecks persist. Unless a workload is highly parallelizable ($p > 0.99$), investing in faster hardware yields diminishing returns. The contour values are illustrative ranges for intuition.
|
||||
|
||||
::: {#fig-iron-law-heatmap fig-env="figure" fig-pos="htb" fig-cap="**The Iron Law Heatmap**: Total system speedup as a function of Accelerator Speed ($S$) and Parallel Fraction ($p$). The 'Acceleration Wall' at the top reveals that if a workload is even slightly serial ($p < 0.9$), increasing hardware speed yields almost no benefit. Contours span roughly 1×–500× speedup." fig-alt="Heatmap of Speedup vs Accelerator Speed and Parallel Fraction. High speedup (green/yellow) is only achieved in the bottom right corner where Parallel Fraction is near 1.0. The rest of the map is dominated by blue (low speedup), showing the serial bottleneck."}
|
||||
```{python}
|
||||
@@ -192,21 +192,21 @@ from mlsys.constants import (
|
||||
BILLION, MILLION, TRILLION, THOUSAND
|
||||
)
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AmdahlH100:
|
||||
"""
|
||||
Namespace for Amdahl's Law on H100.
|
||||
Scenario: Comparing speedup for Compute-Bound (ResNet) vs Memory-Bound (GPT-2).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
hw_speedup_factor = 500.0 # H100 vs CPU matmul
|
||||
|
||||
# Workload Parallel Fractions (p)
|
||||
p_resnet = 0.95 # 95% parallel (Compute Bound)
|
||||
p_gpt2 = 0.80 # 80% parallel (Bandwidth Bound / Serial Overhead)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Amdahl's Law: Speedup = 1 / ((1-p) + (p/s))
|
||||
|
||||
def calc_speedup(p, s):
|
||||
@@ -220,12 +220,12 @@ class AmdahlH100:
|
||||
# Theoretical ceiling (if s -> infinity)
|
||||
ceiling_gpt2 = 1 / (1 - p_gpt2)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(speedup_resnet >= speedup_gpt2 * 3,
|
||||
f"ResNet speedup ({speedup_resnet:.1f}x) should be much higher than GPT-2 ({speedup_gpt2:.1f}x).")
|
||||
check(speedup_gpt2 <= ceiling_gpt2, "Speedup cannot exceed theoretical ceiling.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
# Hardware context
|
||||
h100_tflops_int8 = f"{H100_FLOPS_INT8.m_as(TFLOPs/second):,.0f}"
|
||||
hw_speedup_str = fmt(hw_speedup_factor, precision=0, commas=False)
|
||||
@@ -289,16 +289,16 @@ These examples reveal that the critical question for any hardware optimization i
|
||||
|
||||
[^fn-arithmetic-intensity-roofline]: **Arithmetic Intensity**: The ratio of compute operations performed for each byte of data moved from memory (FLOP/byte). This metric provides the direct, quantitative answer to the text's central question: workloads with high arithmetic intensity, like ResNet's convolutions (>50 FLOP/byte), are compute-bound and accelerate with more TFLOPS. Workloads with low intensity, like GPT-2's attention layers (<10 FLOP/byte), are memory-bound, making faster chips irrelevant without more bandwidth. \index{Arithmetic Intensity!roofline diagnostic}
|
||||
|
||||
With this analytical lens in place, the chapter proceeds through four major topics. First, we trace the historical evolution of domain-specific architectures, from floating-point coprocessors through graphics processors to contemporary AI accelerators. Second, we examine the computational primitives that characterize ML workloads (matrix multiplication, vector operations, and nonlinear activation functions) and analyze how specialized hardware optimizes these operations through innovations such as systolic arrays and tensor cores. Third, we turn to memory hierarchy design, where data movement energy costs exceeding computation costs by more than 100$\times$ make on-chip buffer optimization and high-bandwidth memory interfaces critical. Fourth, the software stack: compiler optimization and runtime system support determine the extent to which theoretical hardware capabilities translate into measurable performance. Throughout, the focus remains on single-machine systems; multi-machine coordination constitutes an advanced topic beyond this scope.
|
||||
The analytical tools are now in place. Four topics build on them: the historical evolution of domain-specific architectures, from floating-point coprocessors through graphics processors to contemporary AI accelerators; the computational primitives that characterize ML workloads (matrix multiplication, vector operations, and nonlinear activation functions) and how specialized hardware optimizes them through systolic arrays and tensor cores; memory hierarchy design, where data movement energy costs exceeding computation costs by more than 100$\times$ make on-chip buffer optimization and high-bandwidth memory interfaces critical; and the software stack, where compiler optimization and runtime system support determine the extent to which theoretical hardware capabilities translate into measurable performance. Throughout, the focus remains on single-machine systems; multi-machine coordination constitutes an advanced topic beyond this scope.
|
||||
|
||||
The Amdahl's Law analysis and roofline framework establish the analytical tools; the rest of the chapter examines the hardware that these tools diagnose. We begin with the question that precedes all architecture: *why* did specialized hardware emerge, and what recurring design patterns does that history reveal?
|
||||
The question that precedes all architecture is: *why* did specialized hardware emerge, and what recurring design patterns does that history reveal?
|
||||
|
||||
## Hardware Specialization {#sec-hardware-acceleration-evolution-hardware-specialization-fdb7}
|
||||
|
||||
\index{Hardware Specialization!evolution}
|
||||
The definitions above establish *what* hardware acceleration achieves. Understanding *why* these architectural choices emerged requires tracing their historical development. Computing architectures follow a recurring pattern: as workloads grow in complexity, general-purpose processors become inefficient, prompting specialized hardware development. Machine learning acceleration represents the latest stage in this evolution, following a trajectory observed in floating-point arithmetic, graphics processing, and digital signal processing. Understanding this history serves a practical purpose, since the architectural innovations that addressed floating-point bottlenecks in the 1980s, graphics throughput in the 1990s, and media processing in the 2000s inform today's AI accelerator designs. Each era confronted the same constraint introduced in the Purpose section: data movement costs dominate computation costs, and specialization succeeds by minimizing unnecessary data movement.
|
||||
Computing architectures follow a recurring pattern: as workloads grow in complexity, general-purpose processors become inefficient, prompting specialized hardware development. Machine learning acceleration represents the latest stage in this evolution, following a trajectory observed in floating-point arithmetic, graphics processing, and digital signal processing. The architectural innovations that addressed floating-point bottlenecks in the 1980s, graphics throughput in the 1990s, and media processing in the 2000s inform today's AI accelerator designs. Each era confronted the same constraint introduced in the Purpose section: data movement costs dominate computation costs, and specialization succeeds by minimizing unnecessary data movement.
|
||||
|
||||
Modern ML accelerators (GPUs with tensor cores, Google's TPUs[^fn-tpu-origin], Apple's Neural Engine) emerged from these established architectural principles. This section traces the evolution through four phases: specialized computing origins, parallel graphics processing, domain-specific architectures, and the emergence of ML-specific hardware. Each phase reveals design principles that remain relevant for understanding and optimizing contemporary AI systems. The magnitude of the gains from domain-specific design became unmistakable in 2015, when Google's first TPU delivered an *efficiency shock* that reshaped the industry's approach to AI hardware.
|
||||
Modern ML accelerators (GPUs with tensor cores, Google's TPUs[^fn-tpu-origin], Apple's Neural Engine) emerged from these established architectural principles. The evolution spans four phases: specialized computing origins, parallel graphics processing, domain-specific architectures, and the emergence of ML-specific hardware. Each phase reveals design principles that remain relevant for understanding and optimizing contemporary AI systems. The magnitude of the gains from domain-specific design became unmistakable in 2015, when Google's first TPU delivered an *efficiency shock* that reshaped the industry's approach to AI hardware.
|
||||
|
||||
::: {.callout-example title="The TPUv1 vs. K80 Efficiency Shock"}
|
||||
**The Comparison**: In 2015, Google deployed its first Tensor Processing Unit (TPUv1)\index{TPU!v1 efficiency shock} and compared it to the dominant GPU of the era, the NVIDIA K80\index{NVIDIA!K80}.
|
||||
@@ -609,15 +609,15 @@ from mlsys.formatting import fmt
|
||||
class CpuMlInefficiency:
|
||||
"""CPU vs accelerator efficiency gap for ML workloads."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cpu_utilization_min_value = 5 # % utilization on ML workloads
|
||||
cpu_utilization_max_value = 10 # % utilization on ML workloads
|
||||
cpu_gflops_value = 100 # Typical CPU GFLOPS for ML
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# (values are already given; no derivation needed)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cpu_utilization_min_str = fmt(cpu_utilization_min_value, precision=0, commas=False)
|
||||
cpu_utilization_max_str = fmt(cpu_utilization_max_value, precision=0, commas=False)
|
||||
cpu_gflops_str = fmt(cpu_gflops_value, precision=0, commas=False)
|
||||
@@ -695,7 +695,7 @@ This historical progression reveals a key pattern: each wave of hardware special
|
||||
|
||||
What distinguishes AI acceleration from earlier specialization waves is the scale of integration required. AI accelerators must work seamlessly with frameworks like TensorFlow, PyTorch, and JAX. They require deep compiler support for graph-level transformations, kernel fusion, and memory scheduling. They must also deploy across environments from data centers to mobile devices, each with distinct performance and efficiency requirements. Such system-level transformation requires tight hardware-software coupling, a theme that recurs throughout this chapter.
|
||||
|
||||
First, we must understand _what_ bottleneck AI accelerators are designed to solve. Unlike floating-point coprocessors that addressed arithmetic precision or GPUs that addressed graphics throughput, AI accelerators target a qualitatively different constraint. The answer determines every subsequent architectural decision.
|
||||
The central question is _what_ bottleneck AI accelerators are designed to solve. Unlike floating-point coprocessors that addressed arithmetic precision or GPUs that addressed graphics throughput, AI accelerators target a qualitatively different constraint. The answer determines every subsequent architectural decision.
|
||||
|
||||
### The Integration Bottleneck {#sec-hardware-acceleration-integration-bottleneck-ai-needs-specialized-hardware-0b41}
|
||||
|
||||
@@ -1050,14 +1050,14 @@ from mlsys.formatting import fmt
|
||||
class WeightMatrixCalc:
|
||||
"""Weight matrix parameter count for a single linear layer."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
wm_in_value = 256
|
||||
wm_out_value = 512
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
wm_params_value = wm_in_value * wm_out_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
wm_params_str = f"{wm_params_value:,}"
|
||||
wm_out_str = fmt(wm_out_value, precision=0, commas=False)
|
||||
wm_in_str = fmt(wm_in_value, precision=0, commas=False)
|
||||
@@ -1587,7 +1587,7 @@ While tensor cores package matrix operations into structured computational units
|
||||
from mlsys.formatting import fmt
|
||||
from mlsys.constants import ENERGY_DRAM_ACCESS_PJ, SYSTOLIC_ARRAY_DIM
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AcceleratorEfficiencyAnchor:
|
||||
"""
|
||||
Namespace for systolic array efficiency anchor.
|
||||
@@ -1603,14 +1603,14 @@ class AcceleratorEfficiencyAnchor:
|
||||
systolic_macs_cycle_str = AcceleratorEfficiencyAnchor.systolic_macs_cycle_str
|
||||
accelerator_energy_dividend_str = AcceleratorEfficiencyAnchor.energy_dividend_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class SystolicEnergy:
|
||||
"""
|
||||
Namespace for Systolic Array Energy calculation.
|
||||
Scenario: Comparing energy per MAC for Vector Unit vs Systolic Array.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
dram_pj = ENERGY_DRAM_ACCESS_PJ.m_as('pJ')
|
||||
mac_pj = 1.0 # Compute cost
|
||||
|
||||
@@ -1620,7 +1620,7 @@ class SystolicEnergy:
|
||||
# Systolic Array: Amortizes loads across array width
|
||||
array_dim = SYSTOLIC_ARRAY_DIM # 128
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Vector Energy = (4 * DRAM) + MAC
|
||||
e_vector = (vector_dram_accesses * dram_pj) + mac_pj
|
||||
|
||||
@@ -1631,10 +1631,10 @@ class SystolicEnergy:
|
||||
|
||||
efficiency_ratio = e_vector / e_systolic
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(efficiency_ratio >= 100, f"Systolic efficiency ({efficiency_ratio:.1f}×) is too low. Should be >100× to justify TPU design.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
dram_access_str = fmt(dram_pj, precision=0, commas=False)
|
||||
systolic_size_str = fmt(array_dim, precision=0, commas=False)
|
||||
vector_accesses_str = fmt(vector_dram_accesses, precision=0, commas=False)
|
||||
@@ -1796,18 +1796,18 @@ node[right]{Data};
|
||||
from mlsys.constants import SYSTOLIC_ARRAY_DIM
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TilingPrinciple:
|
||||
"""
|
||||
Namespace for The Tiling Principle calculation.
|
||||
Scenario: Mapping a Transformer hidden layer to a Systolic Array.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
layer_dim = 4096 # Standard Transformer layer width
|
||||
array_dim = SYSTOLIC_ARRAY_DIM # 128
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Number of tiles per dimension
|
||||
tiles_per_dim = layer_dim / array_dim
|
||||
total_tiles = tiles_per_dim ** 2
|
||||
@@ -1815,10 +1815,10 @@ class TilingPrinciple:
|
||||
# Each tile of weights is loaded once and used for 'layer_dim' MACs
|
||||
reuse_factor = array_dim
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(total_tiles == 1024, f"Total tiles for 4096/128 should be 1024. Got {total_tiles:.0f}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
layer_dim_str = f"{layer_dim:,}"
|
||||
array_dim_str = f"{array_dim}"
|
||||
tile_count_str = f"{int(total_tiles):,}"
|
||||
@@ -1902,13 +1902,13 @@ from mlsys.constants import SYSTOLIC_ARRAY_DIM
|
||||
class SystolicOpsCalc:
|
||||
"""Peak MAC throughput of the canonical 128×128 systolic array."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
systolic_dim_value = SYSTOLIC_ARRAY_DIM
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
systolic_ops_value = systolic_dim_value * systolic_dim_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
systolic_dim_str = fmt(systolic_dim_value, precision=0, commas=False)
|
||||
systolic_ops_str = f"{systolic_ops_value:,}"
|
||||
|
||||
@@ -1971,7 +1971,7 @@ Modern AI processors exhibit a range of design trade-offs based on their intende
|
||||
| **Intel Sapphire** | 512-bit AVX | $32\times32$ INT8/BF16 | 56 cores | Inference |
|
||||
| **Apple M1** | 128-bit NEON | $16\times16$ FP16 | 8 NPU cores | Mobile inference |
|
||||
|
||||
: **AI Processor Configurations.** Modern AI processors prioritize different execution unit characteristics for specific workloads: NVIDIA A100 leverages wide SIMD and tensor cores for training, Google TPUv4 emphasizes high-throughput BF16 matrix multiplication, Intel Sapphire Rapids focuses on INT8-optimized inference, and Apple M1 prioritizes low-power FP16 execution. These variations in SIMD width, tensor core size, and processing element count reflect the growing diversity in AI hardware architectures. {#tbl-execution-units}
|
||||
: **AI Processor Configurations.** Modern AI processors prioritize different execution unit characteristics for specific workloads: NVIDIA A100 uses wide SIMD and tensor cores for training, Google TPUv4 emphasizes high-throughput BF16 matrix multiplication, Intel Sapphire Rapids focuses on INT8-optimized inference, and Apple M1 prioritizes low-power FP16 execution. These variations in SIMD width, tensor core size, and processing element count reflect the growing diversity in AI hardware architectures. {#tbl-execution-units}
|
||||
|
||||
The pattern across these configurations reveals a consistent engineering principle: each design sacrifices generality to optimize for its target workload's dominant operation and precision. Training chips invest silicon in wide floating-point datapaths; inference chips trade precision for throughput; mobile chips trade throughput for energy efficiency. No single design dominates across all workloads, which is precisely why hardware selection depends on workload analysis rather than headline specifications.
|
||||
|
||||
@@ -2025,7 +2025,7 @@ from mlsys import Hardware
|
||||
class AcceleratorEconomics:
|
||||
"""Cost-performance comparison across accelerator generations."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
h_v100 = Hardware.Cloud.V100
|
||||
h_a100 = Hardware.Cloud.A100
|
||||
h_h100 = Hardware.Cloud.H100
|
||||
@@ -2041,7 +2041,7 @@ class AcceleratorEconomics:
|
||||
gaudi_tf = 200
|
||||
gaudi_bw_value = 800
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
v100_tf = h_v100.peak_flops.m_as(TFLOPs/second)
|
||||
v100_ratio = price_v100 / v100_tf
|
||||
|
||||
@@ -2056,7 +2056,7 @@ class AcceleratorEconomics:
|
||||
|
||||
gaudi_ratio = price_gaudi / gaudi_tf
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
# V100 specs
|
||||
v100_tflops = f"{h_v100.peak_flops.m_as(TFLOPs/second):.0f}"
|
||||
v100_bw = f"{h_v100.memory_bw.m_as(GB/second):.0f}"
|
||||
@@ -2167,7 +2167,7 @@ Four perspectives inform memory system design. First, we quantify the growing di
|
||||
|
||||
:::
|
||||
|
||||
[^fn-von-neumann-bottleneck]\index{Von Neumann, John!stored-program architecture}The underlying cause of this wall—the Von Neumann Bottleneck that has constrained computing since 1945—is physical: moving data costs orders of magnitude more energy than processing it.
|
||||
[^fn-von-neumann-bottleneck]\index{Von Neumann, John!stored-program architecture}The underlying cause of this wall is physical: the Von Neumann Bottleneck, which has constrained computing since 1945, means that moving data costs orders of magnitude more energy than processing it.
|
||||
|
||||
[^fn-von-neumann-bottleneck]: **Von Neumann Bottleneck**: The physical separation of the processor from its memory forces all instructions and data to traverse an energy-intensive bus. This distance is the direct cause of the high energy cost of data movement; every byte must be fetched, paying a physical tax. Accessing a value from external DRAM can cost over 20,000× more energy than performing an 8-bit integer operation on that value [@horowitz2014computing]. \index{Von Neumann Bottleneck!ML accelerator constraint}
|
||||
|
||||
@@ -2380,16 +2380,16 @@ from mlsys.constants import (
|
||||
class TensorLifecycleCalc:
|
||||
"""Memory-hierarchy costs for a 1-second KWS audio tensor during inference."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
kws_samples_value = 16_000
|
||||
kws_bytes_fp16_value = BYTES_FP16.m_as('B')
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
kws_tensor_kb_value = kws_samples_value * kws_bytes_fp16_value / KIB_TO_BYTES
|
||||
dram_energy_pj_bit_value = ENERGY_DRAM_PJ_PER_BYTE.m_as('pJ/B') / 8
|
||||
a100_fp16_tflops_value = A100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
kws_tensor_str = fmt(kws_tensor_kb_value, precision=1, commas=False)
|
||||
kws_samples_str = fmt(kws_samples_value, precision=0, commas=True)
|
||||
kws_bytes_str = fmt(kws_bytes_fp16_value, precision=0, commas=False)
|
||||
@@ -2569,10 +2569,10 @@ At the highest level, large-capacity but slow storage devices provide long-term
|
||||
| **Memory Level** | **Approx. Latency** | **Bandwidth** | **Capacity** | **Example Use in Deep Learning** |
|
||||
|:-------------------------------------------------|--------------------:|:--------------|:-------------|:---------------------------------------------------------------------|
|
||||
| **Registers** | ~1 cycle | Highest | Few values | Storing operands for immediate computation |
|
||||
| **L1/L2 Cache (SRAM)**\index{SRAM!on-chip cache} | ~1-10 ns | High | KBs-MBs | Caching frequently accessed activations and small weight blocks |
|
||||
| **Scratchpad Memory** | ~5-20 ns | High | MBs | Software-managed storage for intermediate computations |
|
||||
| **L1/L2 Cache (SRAM)**\index{SRAM!on-chip cache} | ~1--10 ns | High | KBs--MBs | Caching frequently accessed activations and small weight blocks |
|
||||
| **Scratchpad Memory** | ~5--20 ns | High | MBs | Software-managed storage for intermediate computations |
|
||||
| **High-Bandwidth Memory (HBM)** | ~100 ns | Very High | GBs | Storing large model parameters and activations for high-speed access |
|
||||
| **Off-Chip DRAM (DDR, GDDR, LPDDR)** | ~50-150 ns | Moderate | GBs-TBs | Storing entire model weights that do not fit on-chip |
|
||||
| **Off-Chip DRAM (DDR, GDDR, LPDDR)** | ~50--150 ns | Moderate | GBs--TBs | Storing entire model weights that do not fit on-chip |
|
||||
| **Flash Storage (SSD/NVMe)** | ~100 µs - 1 ms | Low | TBs | Storing pre-trained models and checkpoints for later loading |
|
||||
|
||||
: **Memory Hierarchy Trade-Offs.** AI accelerators use a multi-level memory hierarchy to balance performance and capacity. Each level provides distinct latency, bandwidth, and capacity characteristics that dictate how neural network components (weights, activations, and intermediate results) should be allocated to minimize bottlenecks and maximize throughput. {#tbl-memory-hierarchy}
|
||||
@@ -2739,14 +2739,14 @@ from mlsys.constants import (
|
||||
GB, second, Gbps
|
||||
)
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class InterconnectHierarchy:
|
||||
"""
|
||||
Namespace for Interconnect Bandwidth Hierarchy.
|
||||
Scenario: The bandwidth taper from Chip -> Node -> Cluster.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Device
|
||||
hbm_bw = A100_MEM_BW.m_as(GB/second)
|
||||
|
||||
@@ -2763,12 +2763,12 @@ class InterconnectHierarchy:
|
||||
|
||||
ib_ndr_gbps = INFINIBAND_NDR_BW.m_as(Gbps)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
# The "Bandwidth Taper" must hold: HBM > NVLink > PCIe > Network
|
||||
check(hbm_bw > nvlink_h100 > pcie_gen4 > ib_hdr_gbs,
|
||||
f"Bandwidth hierarchy violated. HBM({hbm_bw}) > NVLink({nvlink_h100}) > PCIe({pcie_gen4}) > Net({ib_hdr_gbs})")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
nvlink_a100_str = f"{nvlink_a100:.0f}"
|
||||
nvlink_h100_str = f"{nvlink_h100:.0f}"
|
||||
ib_hdr_str = f"{ib_hdr_gbps:.0f}"
|
||||
@@ -2783,7 +2783,7 @@ ib_ndr = InterconnectHierarchy.ib_ndr_str
|
||||
|
||||
1. **Device-Device Interconnect (NVLink / Infinity Fabric)**[^fn-nvlink-bandwidth]\index{NVLink!GPU interconnect}\index{GPU Interconnect!switching fabric}\index{Infinity Fabric!AMD interconnect}: Modern multi-GPU nodes use specialized high-speed bridges like NVLink to connect accelerators directly, bypassing the host CPU. Bandwidth ranges from `{python} nvlink_a100` to `{python} nvlink_h100` GB/s per GPU. The primary use case is gradient synchronization (AllReduce)\index{AllReduce!gradient synchronization}[^fn-allreduce-gradient-sync] during distributed training. This bandwidth is critical for scaling; without it, multi-GPU training often scales poorly.
|
||||
|
||||
[^fn-nvlink-bandwidth]: **NVLink (NVIDIA Link)**: This direct GPU-to-GPU interconnect is required because gradient synchronization (AllReduce) operations must exchange the entire model's gradients on every training step. Without the 600-900 GB/s of bandwidth this provides---roughly 10-14x more than the standard PCIe bus---the communication overhead causes training to become bottlenecked, preventing scaling beyond 2 to 4 GPUs in a node. \index{NVLink!training scaling}
|
||||
[^fn-nvlink-bandwidth]: **NVLink (NVIDIA Link)**: This direct GPU-to-GPU interconnect is required because gradient synchronization (AllReduce) operations must exchange the entire model's gradients on every training step. Without the 600--900 GB/s of bandwidth this provides---roughly 10--14$\times$ more than the standard PCIe bus---the communication overhead causes training to become bottlenecked, preventing scaling beyond 2 to 4 GPUs in a node. \index{NVLink!training scaling}
|
||||
|
||||
[^fn-allreduce-gradient-sync]: **AllReduce**: A collective operation from MPI that aggregates values across all processes (the "reduce") and distributes the result back to every process (the "all"). In multi-GPU training, AllReduce synchronizes gradients every iteration: for a 7B-parameter model in FP16, each step exchanges roughly 14 GB across all GPUs. Ring AllReduce achieves optimal bandwidth utilization by having each GPU send and receive simultaneously, but the operation still imposes a serial fraction in Amdahl's Law that caps multi-GPU scaling efficiency. \index{AllReduce!gradient synchronization cost}
|
||||
|
||||
@@ -2952,14 +2952,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class RooflineGap:
|
||||
"""
|
||||
Namespace for Roofline Utilization Gap.
|
||||
Scenario: Comparing Ridge Points across generations (V100 -> H100).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Thresholds
|
||||
legacy_ai = 200.0
|
||||
relu_ai = 0.125
|
||||
@@ -2969,7 +2969,7 @@ class RooflineGap:
|
||||
h_a100 = Hardware.Cloud.A100
|
||||
h_h100 = Hardware.Cloud.H100
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Ridge Points (FLOP/Byte) directly from Twins
|
||||
v100_ridge = h_v100.ridge_point().m_as('flop/byte')
|
||||
a100_ridge = h_a100.ridge_point().m_as('flop/byte')
|
||||
@@ -2983,13 +2983,13 @@ class RooflineGap:
|
||||
flops_growth = h_h100.peak_flops / h_a100.peak_flops
|
||||
relu_gap = h100_ridge / relu_ai
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
# Ridge points must climb: H100 > A100 > V100
|
||||
check(h100_ridge > a100_ridge > v100_ridge,
|
||||
f"Ridge points must climb. H100({h100_ridge:.0f}) > A100({a100_ridge:.0f}) > V100({v100_ridge:.0f}).")
|
||||
check(relu_gap >= 1000, f"ReLU gap ({relu_gap:.0f}x) is too small.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
v100_ridge_str = f"{v100_ridge:.0f}"
|
||||
a100_ridge_str = f"{a100_ridge:.0f}"
|
||||
h100_ridge_str = f"{h100_ridge:.0f}"
|
||||
@@ -3033,11 +3033,11 @@ relu_below_roofline_str = RooflineGap.relu_below_roofline_str
|
||||
|
||||
| **Operation** | **Arithmetic Intensity** | **Classification** | **Lighthouse Example** |
|
||||
|:----------------------|-------------------------:|:-------------------|:------------------------|
|
||||
| **Conv2D (Dense)** | 50-200 FLOP/byte | Compute-bound | **ResNet-50** |
|
||||
| **Dense MatMul** | 64-256 FLOP/byte | Compute-bound | **GPT-2 (Projections)** |
|
||||
| **Depthwise Conv** | 10-20 FLOP/byte | Memory-bound | **MobileNet** |
|
||||
| **Attention Softmax** | 2-5 FLOP/byte | Memory-bound | **GPT-2 (Generation)** |
|
||||
| **LayerNorm** | 5-10 FLOP/byte | Memory-bound | **GPT-2 / Llama** |
|
||||
| **Conv2D (Dense)** | 50--200 FLOP/byte | Compute-bound | **ResNet-50** |
|
||||
| **Dense MatMul** | 64--256 FLOP/byte | Compute-bound | **GPT-2 (Projections)** |
|
||||
| **Depthwise Conv** | 10--20 FLOP/byte | Memory-bound | **MobileNet** |
|
||||
| **Attention Softmax** | 2--5 FLOP/byte | Memory-bound | **GPT-2 (Generation)** |
|
||||
| **LayerNorm** | 5--10 FLOP/byte | Memory-bound | **GPT-2 / Llama** |
|
||||
| **Embedding lookup** | <1 FLOP/byte | Memory-bound | **DLRM** |
|
||||
|
||||
: **Operations on the Roofline.** Neural network layers span a wide range of arithmetic intensities. By mapping these operations to the **Lighthouse Models**, ResNet-50 emerges as compute-bound (high AI) while MobileNet and DLRM are memory-bound (low AI). {#tbl-roofline-operations}
|
||||
@@ -3069,14 +3069,14 @@ from mlsys.constants import (
|
||||
class TransformerLayerCalc:
|
||||
"""Arithmetic intensity for QKV projection vs. softmax in a transformer layer."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
t_hidden_value = TRANSFORMER_HIDDEN_DIM_EXAMPLE
|
||||
t_batch_value = 32
|
||||
t_seq_value = TRANSFORMER_SEQ_LEN_EXAMPLE
|
||||
t_heads_value = TRANSFORMER_HEADS_EXAMPLE
|
||||
t_fp_bytes_value = BYTES_FP16.m_as(byte)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# QKV Projection
|
||||
qkv_flops_value = 3 * t_batch_value * t_seq_value * t_hidden_value * t_hidden_value
|
||||
qkv_flops_b_value = (qkv_flops_value * flop).m_as(GFLOPs)
|
||||
@@ -3095,7 +3095,7 @@ class TransformerLayerCalc:
|
||||
softmax_mb_value = (softmax_bytes_value * byte).m_as(MB)
|
||||
softmax_ai_value = softmax_flops_value / softmax_bytes_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
t_hidden_str = fmt(t_hidden_value, precision=0, commas=False)
|
||||
t_batch_str = fmt(t_batch_value, precision=0, commas=False)
|
||||
t_seq_str = fmt(t_seq_value, precision=0, commas=False)
|
||||
@@ -3177,7 +3177,7 @@ from mlsys.constants import BYTES_FP16, byte, MB, flop, GFLOPs
|
||||
class Conv2dAnalysisCalc:
|
||||
"""Roofline analysis for a 3×3 Conv2D layer showing compute-bound behaviour."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
conv_batch = 32
|
||||
conv_cin = 128
|
||||
conv_h = 56
|
||||
@@ -3186,7 +3186,7 @@ class Conv2dAnalysisCalc:
|
||||
conv_k = 3
|
||||
conv_fp_bytes = BYTES_FP16.m_as('B')
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
conv_out_elements = conv_batch * conv_cout * conv_h * conv_w
|
||||
conv_out_m = conv_out_elements / MILLION
|
||||
conv_flops_per_out = conv_cin * conv_k * conv_k * 2
|
||||
@@ -3199,7 +3199,7 @@ class Conv2dAnalysisCalc:
|
||||
|
||||
conv_ai = conv_total_gflops * 1e3 / conv_total_mb
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
conv_out_m_str = fmt(conv_out_m, precision=1, commas=False)
|
||||
conv_flops_per_out_str = f"{conv_flops_per_out:,}"
|
||||
conv_total_gflops_str = fmt(conv_total_gflops, precision=1, commas=False)
|
||||
@@ -3275,12 +3275,12 @@ from mlsys.constants import (
|
||||
class DenseLayerAnalysisCalc:
|
||||
"""Roofline analysis for a small GEMM showing memory-bound behaviour."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
dense_batch = 32
|
||||
dense_in = 2048
|
||||
dense_out = 2048
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
dense_total_mflops = (2 * dense_batch * dense_in * dense_out * flop).m_as(MFLOPs)
|
||||
|
||||
dense_input_kb = dense_batch * dense_in * 2 / KIB_TO_BYTES
|
||||
@@ -3295,7 +3295,7 @@ class DenseLayerAnalysisCalc:
|
||||
dense_attainable_tflops = (a100_bw_gbs_value * dense_ai * GFLOPs).m_as(TFLOPs)
|
||||
dense_util_pct = dense_attainable_tflops / a100_peak * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
dense_total_mflops_str = fmt(dense_total_mflops, precision=0, commas=False)
|
||||
dense_input_kb_str = fmt(dense_input_kb, precision=0, commas=False)
|
||||
dense_weights_mb_str = fmt(dense_weights_mb, precision=1, commas=False)
|
||||
@@ -3371,12 +3371,12 @@ from mlsys.constants import byte, MB, A100_MEM_BW, GB, GFLOPs, TFLOPs, second
|
||||
class LayernormAnalysisCalc:
|
||||
"""Roofline analysis for LayerNorm showing severely memory-bound behaviour."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
ln_batch = 32
|
||||
ln_seq = 512
|
||||
ln_hidden = 768
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
ln_elements = ln_batch * ln_seq * ln_hidden
|
||||
ln_elements_m = ln_elements / MILLION
|
||||
ln_flops_per = 6
|
||||
@@ -3392,7 +3392,7 @@ class LayernormAnalysisCalc:
|
||||
a100_bw_gbs_value = A100_MEM_BW.m_as(GB / second)
|
||||
ln_attainable_tflops = (a100_bw_gbs_value * ln_ai * GFLOPs).m_as(TFLOPs)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
ln_elements_m_str = fmt(ln_elements_m, precision=1, commas=False)
|
||||
ln_total_mflops_str = fmt(ln_total_mflops, precision=1, commas=False)
|
||||
ln_input_mb_str = fmt(ln_input_mb, precision=1, commas=False)
|
||||
@@ -3448,9 +3448,9 @@ The roofline analysis directly informs optimization priorities:
|
||||
1. **High AI (>200 FLOP/byte)**: Compute-bound operations like large convolutions
|
||||
- Priority: Maximize compute utilization
|
||||
- Techniques: Use Tensor Cores, optimize thread block dimensions, maximize occupancy
|
||||
- Impact: Can approach 90-95% of peak TFLOPS
|
||||
- Impact: Can approach 90--95% of peak TFLOPS
|
||||
|
||||
2. **Medium AI (20-200 FLOP/byte)**: Borderline operations like medium-sized dense layers
|
||||
2. **Medium AI (20--200 FLOP/byte)**: Borderline operations like medium-sized dense layers
|
||||
- Priority: Balance compute and memory optimization
|
||||
- Techniques: Increase batch size to improve AI, use register tiling, fuse with adjacent operations
|
||||
- Impact: Can move from memory-bound to compute-bound regime
|
||||
@@ -3458,7 +3458,7 @@ The roofline analysis directly informs optimization priorities:
|
||||
3. **Low AI (<20 FLOP/byte)**: Memory-bound operations like small dense layers, element-wise operations
|
||||
- Priority: Reduce memory traffic
|
||||
- Techniques: Aggressive operator fusion, reduce precision (FP16 → INT8), algorithmic changes
|
||||
- Impact: 2-4$\times$ speedup possible through fusion alone
|
||||
- Impact: 2--4$\times$ speedup possible through fusion alone
|
||||
|
||||
4. **Very Low AI (<2 FLOP/byte)**: Severely memory-bound operations like normalization, activation functions
|
||||
- Priority: Eliminate memory round-trips
|
||||
@@ -3511,10 +3511,10 @@ from mlsys.constants import (
|
||||
class Gpt2ThroughputCalc:
|
||||
"""Throughput ceiling for GPT-2 XL autoregressive inference at batch=1."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
gpt2_weight_gb = model_memory(GPT2_PARAMS, BYTES_FP16, GB)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
gpt2_decode_flops = 2 * GPT2_PARAMS.m_as('param')
|
||||
gpt2_decode_gflops = (gpt2_decode_flops * flop).m_as(GFLOPs)
|
||||
gpt2_decode_ai = gpt2_decode_gflops / gpt2_weight_gb
|
||||
@@ -3524,7 +3524,7 @@ class Gpt2ThroughputCalc:
|
||||
gpt2_max_tflops = gpt2_decode_ai * a100_bw_tbs_val
|
||||
gpt2_utilization = gpt2_max_tflops / a100_tflops_fp16_val * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
gpt2_weight_gb_str = fmt(gpt2_weight_gb, precision=1, commas=False)
|
||||
gpt2_decode_gflops_str = fmt(gpt2_decode_gflops, precision=1, commas=False)
|
||||
gpt2_decode_ai_str = fmt(gpt2_decode_ai, precision=1, commas=False)
|
||||
@@ -4010,17 +4010,17 @@ from mlsys.constants import BYTES_FP32, byte, MB
|
||||
class MemoryFootprintCalc:
|
||||
"""Intermediate tensor memory overhead for naïve ReLU-BatchNorm-scale execution."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
tensor_dim = 1024
|
||||
bytes_fp32 = BYTES_FP32.m_as('B')
|
||||
n_intermediates = 4 # X, X', X'', Y tensors stored
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
tensor_mb = (tensor_dim * tensor_dim * bytes_fp32 * byte).m_as(MB)
|
||||
total_mb = n_intermediates * tensor_mb
|
||||
footprint_ratio = n_intermediates
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
tensor_mb_str = fmt(tensor_mb, precision=1, commas=False)
|
||||
total_mb_str = fmt(total_mb, precision=1, commas=False)
|
||||
footprint_ratio_str = fmt(footprint_ratio, precision=0, commas=False)
|
||||
@@ -4081,17 +4081,17 @@ from mlsys.constants import BYTES_FP32
|
||||
class MemoryFootprintTableCalc:
|
||||
"""Integer-rounded memory values for @tbl-memory-footprint caption."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
tensor_dim = 1024
|
||||
bytes_per_float = BYTES_FP32.m_as('B')
|
||||
total_tensors = 4 # X, X', X'', Y stored in naive execution
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
tensor_bytes = tensor_dim * tensor_dim * bytes_per_float
|
||||
tensor_mb = tensor_bytes / MIB_TO_BYTES
|
||||
total_mb = tensor_mb * total_tensors
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
tensor_mb_str = fmt(tensor_mb, precision=0, commas=False)
|
||||
total_mb_str = fmt(total_mb, precision=0, commas=False)
|
||||
|
||||
@@ -4141,18 +4141,18 @@ from mlsys.constants import BYTES_FP32
|
||||
class FusionBenefitsCalc:
|
||||
"""Memory savings from fusing ReLU-BatchNorm-scale into a single kernel."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
tensor_dim = 1024
|
||||
bytes_per_float = BYTES_FP32.m_as('B')
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
tensor_mb = tensor_dim * tensor_dim * bytes_per_float / MIB_TO_BYTES
|
||||
total_mb = tensor_mb * 4
|
||||
|
||||
naive_mb = total_mb # 16 MB with all intermediates stored
|
||||
fused_mb = tensor_mb # 4 MB with only final result Y stored
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
naive_mb_str = f"{naive_mb:.0f} MB"
|
||||
fused_mb_str = f"{fused_mb:.0f} MB"
|
||||
|
||||
@@ -4203,7 +4203,7 @@ While modern AI accelerators offer high computational throughput, their performa
|
||||
\index{Tiling!definition}
|
||||
Tiling[^fn-tiling-cache-reuse] is a technique used to mitigate this issue by restructuring computations into smaller, memory-friendly subproblems. The core insight is simple but powerful: if we cannot make memory faster, we can at least make fewer trips to it. Instead of processing entire matrices or tensors at once, which leads to excessive memory traffic, tiling partitions computations into smaller blocks (tiles) that fit within fast local memory (e.g., caches, shared memory, or registers) [@lam1991cache].
|
||||
|
||||
[^fn-tiling-cache-reuse]: **Tiling (Loop Blocking)**: This restructuring directly enables the "fewer trips to memory" insight by partitioning a computation into blocks that fit entirely within fast local cache. Instead of fetching an element from slow DRAM $O(N)$ times in a naive matrix multiply, it is fetched just once per tile. This reduction in memory traffic is the primary source of the 10-50x speedup observed between naive and optimized GEMM routines. \index{Tiling!DRAM traffic reduction}
|
||||
[^fn-tiling-cache-reuse]: **Tiling (Loop Blocking)**: This restructuring directly enables the "fewer trips to memory" insight by partitioning a computation into blocks that fit entirely within fast local cache. Instead of fetching an element from slow DRAM $O(N)$ times in a naive matrix multiply, it is fetched just once per tile. This reduction in memory traffic is the primary source of the 10--50$\times$ speedup observed between naive and optimized GEMM routines. \index{Tiling!DRAM traffic reduction}
|
||||
|
||||
Matrix multiplication, widely used in AI models, demonstrates inefficient memory access when implemented naively. @lst-naive_matmul shows how, without tiling, repeated memory accesses for the same data lead to unnecessary bandwidth consumption.
|
||||
|
||||
@@ -4455,7 +4455,7 @@ The mapping strategies and dataflow optimizations examined in preceding sections
|
||||
Machine learning compilers automate the translation of dataflow strategies into executable code, addressing a critical challenge: the mapping decisions analyzed above must be instantiated differently for each hardware target. The gap between "knowing what optimizations exist" and "applying them correctly" is vast: a single convolution can be implemented with dozens of valid tiling strategies, kernel variants, and memory layouts, most of which perform poorly on any given hardware. Compilers navigate this complexity systematically. To see why this matters, consider what happens when you compile ResNet-50 for GPU inference:
|
||||
|
||||
1. **Graph optimization** fuses the 49 Conv2D-BatchNorm-ReLU sequences into 49 single kernels, eliminating 98 intermediate memory writes that would otherwise consume bandwidth
|
||||
2. **Kernel selection** chooses Tensor Core implementations for the $3\times3$ convolutions, exploiting the high arithmetic intensity (50-200 FLOP/byte) we calculated in the Roofline analysis
|
||||
2. **Kernel selection** chooses Tensor Core implementations for the $3\times3$ convolutions, exploiting the high arithmetic intensity (50--200 FLOP/byte) we calculated in the Roofline analysis
|
||||
3. **Memory planning** determines that intermediate activations require approximately 2.1 GB at batch size 32, fitting comfortably in the A100's 40 GB HBM
|
||||
4. **Computation scheduling** overlaps memory transfers for layer N+1 with computation of layer N, hiding a substantial portion of transfer latency
|
||||
|
||||
@@ -4479,14 +4479,14 @@ from mlsys.formatting import fmt
|
||||
class CompilerSpeedupCalc:
|
||||
"""ResNet-50 latency improvement from ML compiler graph and memory optimizations."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
naive_inference_ms = 47 # Naive execution latency (ms)
|
||||
optimized_inference_ms = 8 # Compiler-optimized latency (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
compiler_speedup = naive_inference_ms / optimized_inference_ms
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
compiler_speedup_str = fmt(compiler_speedup, precision=1, commas=False)
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -4571,7 +4571,7 @@ At this stage, the compiler translates the abstract operations in the computatio
|
||||
|
||||
Kernel selection builds upon graph optimization, mapping the structured execution plan to the most efficient implementation available for each operation. Poor kernel choices can nullify the benefits of prior optimizations by introducing unnecessary computation overhead or memory bottlenecks [@chen2018tvm].
|
||||
|
||||
In a Transformer model, the matrix multiplications that dominate self-attention computations can be executed using different strategies depending on the available hardware. On a CPU, a general-purpose matrix multiplication routine is typically employed, exploiting vectorized execution to improve efficiency. In contrast, on a GPU, the compiler may select an implementation that leverages tensor cores to accelerate matrix multiplications using mixed-precision arithmetic. When the model is deployed on a TPU, the operation can be mapped onto a systolic array, ensuring that data flows through the accelerator in a manner that maximizes reuse and minimizes off-chip memory accesses. For inference workloads, an integer arithmetic kernel may be preferable, as it performs computations in INT8 instead of floating-point precision, thereby reducing power consumption without significantly compromising accuracy.
|
||||
In a Transformer model, the matrix multiplications that dominate self-attention computations can be executed using different strategies depending on the available hardware. On a CPU, a general-purpose matrix multiplication routine is typically employed, exploiting vectorized execution to improve efficiency. In contrast, on a GPU, the compiler may select an implementation that uses tensor cores to accelerate matrix multiplications using mixed-precision arithmetic. When the model is deployed on a TPU, the operation can be mapped onto a systolic array, ensuring that data flows through the accelerator in a manner that maximizes reuse and minimizes off-chip memory accesses. For inference workloads, an integer arithmetic kernel may be preferable, as it performs computations in INT8 instead of floating-point precision, thereby reducing power consumption without significantly compromising accuracy.
|
||||
|
||||
In many cases, compilers do not generate custom kernels from scratch but instead select from vendor-optimized kernel libraries that provide highly tuned implementations for different architectures. For instance, cuDNN\index{cuDNN!NVIDIA deep learning library} and cuBLAS\index{cuBLAS!NVIDIA linear algebra} offer optimized kernels for deep learning on NVIDIA GPUs, while oneDNN\index{oneDNN!Intel optimization library} provides optimized execution for Intel architectures. Similarly, ACL (Arm Compute Library) is optimized for Arm-based devices, and Eigen and BLIS provide efficient CPU-based implementations of deep learning operations. These libraries allow the compiler to choose pre-optimized, high-performance kernels rather than having to reinvent execution strategies for each hardware platform.
|
||||
|
||||
@@ -4703,10 +4703,10 @@ from mlsys.constants import A100_TDP, watt
|
||||
class RuntimeProductionTdp:
|
||||
"""A100 SXM Thermal Design Power for production thermal-throttling context."""
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# (constant lookup, no derivation)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
a100_tdp = f"{A100_TDP.m_as(watt):.0f}"
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -4777,7 +4777,7 @@ Consider the scale of training GPT-3, which required approximately $3.14 \times
|
||||
## Multi-Chip Scaling {#sec-hardware-acceleration-multichip-scaling-c649}
|
||||
|
||||
\index{Multi-Chip Scaling!beyond single accelerator}
|
||||
This section provides awareness of multi-chip scaling while maintaining our focus on single-machine systems. The techniques we have covered, dataflow optimization, kernel fusion, memory hierarchy exploitation, and compiler optimization, remain the foundation for efficient execution even in distributed settings. Each individual accelerator in a multi-chip system must still be optimized using these principles. However, multi-chip architectures introduce additional concerns around communication overhead, memory coherence, and fault tolerance that transform optimization priorities. The detailed implementation of distributed training systems, including gradient synchronization protocols, parameter server architectures, and cluster-scale orchestration, is covered in advanced treatments of machine learning infrastructure.
|
||||
A single H100 delivers nearly 2,000 TFLOPS of FP8 throughput, yet training a frontier language model still requires thousands of such chips working in concert. The techniques covered in previous sections (dataflow optimization, kernel fusion, memory hierarchy exploitation, and compiler optimization) remain the foundation for efficient execution even in multi-chip settings; each individual accelerator must still be optimized using these principles. Multi-chip architectures introduce additional concerns around communication overhead, memory coherence, and fault tolerance that transform optimization priorities. The detailed implementation of distributed training systems, including gradient synchronization protocols, parameter server architectures, and cluster-scale orchestration, is covered in advanced treatments of machine learning infrastructure.
|
||||
|
||||
When single-accelerator capacity proves insufficient, AI systems must scale across multiple chips. Understanding these scaling approaches is important for practitioners who will encounter multi-chip systems in production environments, even when working primarily with single-accelerator deployments.
|
||||
|
||||
@@ -4818,7 +4818,7 @@ Data center scaling and edge deployment represent opposite ends of a deployment
|
||||
## Heterogeneous SoC Design {#sec-hardware-acceleration-heterogeneous-soc-ai-acceleration-b1bb}
|
||||
|
||||
\index{System-on-Chip!heterogeneous AI acceleration}
|
||||
At the edge end of the deployment spectrum, the hardware acceleration principles established in this chapter (specialized compute units, memory hierarchy optimization, and workload mapping strategies) must operate under dramatically different constraints. A smartphone's SoC operates within a 3–7 watt sustained power budget (with brief peaks to 10–15 W), autonomous vehicles require deterministic sub-100 ms latency for perception-to-action loops, and IoT sensors must function for months to years on battery power. These constraints necessitate heterogeneous System-on-Chip (SoC) architectures that integrate CPU cores, GPU shaders, digital signal processors (DSPs), and dedicated neural processing units (NPUs) within a single chip. Orchestrating these diverse processors to achieve optimal performance under strict power, thermal, and latency requirements demands wholly different approaches than data center deployments.
|
||||
A smartphone's SoC operates within a 3--7 watt sustained power budget (with brief peaks to 10--15 W), autonomous vehicles require deterministic sub-100 ms latency for perception-to-action loops, and IoT sensors must function for months to years on battery power. These constraints force specialized compute units, memory hierarchy optimization, and workload mapping strategies to operate under dramatically different rules than datacenter hardware. The result is heterogeneous System-on-Chip (SoC) architectures that integrate CPU cores, GPU shaders, digital signal processors (DSPs), and dedicated neural processing units (NPUs) within a single chip. Orchestrating these diverse processors to achieve optimal performance under strict power, thermal, and latency requirements demands wholly different approaches than datacenter deployments.
|
||||
|
||||
::: {.callout-lighthouse title="The Case for Heterogeneous Microcontrollers"}
|
||||
**The Extreme Edge**: The **Smart Doorbell** (Wake Vision) pushes heterogeneity to its logical limit. Unlike a smartphone SoC with a multi-watt budget, a doorbell camera often runs on a microcontroller with a **milliwatt budget**.
|
||||
@@ -4889,12 +4889,12 @@ The complexity of hardware acceleration, spanning data center architectures to h
|
||||
|
||||
## Fallacies and Pitfalls {#sec-hardware-acceleration-fallacies-pitfalls-dc1f}
|
||||
|
||||
Hardware acceleration involves counterintuitive performance characteristics where impressive specifications mask underlying bottlenecks. The fallacies and pitfalls below capture hardware selection and optimization errors that waste expensive accelerator resources and lead to deployments that achieve only 10-30% of theoretical performance.
|
||||
Hardware acceleration involves counterintuitive performance characteristics where impressive specifications mask underlying bottlenecks. The fallacies and pitfalls below capture hardware selection and optimization errors that waste expensive accelerator resources and lead to deployments that achieve only 10--30% of theoretical performance.
|
||||
|
||||
**Fallacy:** *More specialized hardware always provides better performance than general-purpose alternatives.*
|
||||
|
||||
\index{Hardware Selection!workload matching}
|
||||
Engineers assume specialized accelerators automatically outperform general-purpose processors for all AI workloads. In reality, specialized hardware achieves peak performance only when workloads match architectural assumptions. As demonstrated in @sec-hardware-acceleration-roofline-model-42ff, operations must exceed the accelerator's ridge point to be compute-bound; an A100 GPU has a ridge point of `{python} a100_ridge` FLOP/byte, meaning operations with arithmetic intensity below this threshold are memory-bound regardless of the accelerator's `{python} a100_tflops_fp16` TFLOPS peak compute. A transformer attention softmax with AI = 2-5 FLOP/byte achieves only 4–10 TFLOPS (3% utilization) on an A100, while achieving 80–90% of a CPU's lower peak because CPUs have ridge points of 10–20 FLOP/byte. Models with irregular memory access, small batch sizes, or dynamic computation graphs may perform better on flexible processors. Effective hardware selection requires matching workload arithmetic intensity to architectural ridge points, not assuming specialization always wins.
|
||||
Engineers assume specialized accelerators automatically outperform general-purpose processors for all AI workloads. In reality, specialized hardware achieves peak performance only when workloads match architectural assumptions. As demonstrated in @sec-hardware-acceleration-roofline-model-42ff, operations must exceed the accelerator's ridge point to be compute-bound; an A100 GPU has a ridge point of `{python} a100_ridge` FLOP/byte, meaning operations with arithmetic intensity below this threshold are memory-bound regardless of the accelerator's `{python} a100_tflops_fp16` TFLOPS peak compute. A transformer attention softmax with AI = 2--5 FLOP/byte achieves only 4--10 TFLOPS (3% utilization) on an A100, while achieving 80--90% of a CPU's lower peak because CPUs have ridge points of 10--20 FLOP/byte. Models with irregular memory access, small batch sizes, or dynamic computation graphs may perform better on flexible processors. Effective hardware selection requires matching workload arithmetic intensity to architectural ridge points, not assuming specialization always wins.
|
||||
|
||||
```{python}
|
||||
#| label: fp-memory-energy-calc
|
||||
@@ -4919,7 +4919,7 @@ from mlsys.constants import ENERGY_DRAM_ACCESS_PJ, ENERGY_SRAM_L1_PJ
|
||||
class FpMemoryEnergyCalc:
|
||||
"""Energy cost disparity and ridge-point pitfall for memory-bandwidth-limited ops."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
dram_pj = ENERGY_DRAM_ACCESS_PJ.m_as('pJ')
|
||||
sram_pj = ENERGY_SRAM_L1_PJ.m_as('pJ')
|
||||
|
||||
@@ -4927,14 +4927,14 @@ class FpMemoryEnergyCalc:
|
||||
peak_tflops = 300 # hypothetical accelerator
|
||||
peak_bw_tbs = 2 # TB/s
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
energy_ratio = int(dram_pj / sram_pj)
|
||||
fp_ridge_example = peak_tflops / peak_bw_tbs
|
||||
|
||||
layernorm_tflops = layernorm_ai * 2000 / 1000
|
||||
layernorm_util = layernorm_tflops / peak_tflops * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fp_ridge_example_str = fmt(fp_ridge_example, precision=0, commas=False)
|
||||
fp_layernorm_tflops_str = fmt(layernorm_tflops, precision=0, commas=False)
|
||||
fp_layernorm_util_str = fmt(layernorm_util, precision=0, commas=False)
|
||||
@@ -4947,7 +4947,7 @@ fp_layernorm_util_str = FpMemoryEnergyCalc.fp_layernorm_util_str
|
||||
|
||||
**Pitfall:** *Ignoring memory bandwidth limitations when selecting acceleration strategies.*
|
||||
|
||||
Practitioners focus on peak TFLOPS without analyzing whether their workloads can achieve compute-bound performance. As quantified in @sec-hardware-acceleration-understanding-ai-memory-wall-3ea9, accessing DRAM consumes 100-200 pJ per access versus 1-10 pJ for on-chip memory, creating orders-of-magnitude energy penalties. An accelerator advertising 300 TFLOPS with 2 TB/s bandwidth has a ridge point of `{python} fp_ridge_example_str` FLOP/byte; LayerNorm operations with AI = 1.5 FLOP/byte achieve only `{python} fp_layernorm_tflops_str` TFLOPS (`{python} fp_layernorm_util_str`% utilization). Organizations deploy expensive high-compute accelerators for memory-bound workloads, achieving 10–20% utilization when lower-cost, bandwidth-optimized alternatives would perform identically. Teams must calculate workload arithmetic intensity and compare against hardware ridge points before purchasing accelerators.
|
||||
Practitioners focus on peak TFLOPS without analyzing whether their workloads can achieve compute-bound performance. As quantified in @sec-hardware-acceleration-understanding-ai-memory-wall-3ea9, accessing DRAM consumes 100--200 pJ per access versus 1--10 pJ for on-chip memory, creating orders-of-magnitude energy penalties. An accelerator advertising 300 TFLOPS with 2 TB/s bandwidth has a ridge point of `{python} fp_ridge_example_str` FLOP/byte; LayerNorm operations with AI = 1.5 FLOP/byte achieve only `{python} fp_layernorm_tflops_str` TFLOPS (`{python} fp_layernorm_util_str`% utilization). Organizations deploy expensive high-compute accelerators for memory-bound workloads, achieving 10–20% utilization when lower-cost, bandwidth-optimized alternatives would perform identically. Teams must calculate workload arithmetic intensity and compare against hardware ridge points before purchasing accelerators.
|
||||
|
||||
```{python}
|
||||
#| label: fp-multigpu-scaling-calc
|
||||
@@ -4971,16 +4971,16 @@ from mlsys.constants import NVLINK_A100_BW, GB, second
|
||||
class FpMultigpuScalingCalc:
|
||||
"""NVLink gradient-sync overhead quantifying sublinear multi-GPU scaling."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
nvlink_bw_gbs = NVLINK_A100_BW.m_as(GB / second)
|
||||
gradient_size_gb = 1.0
|
||||
step_time_ms = 50
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
sync_time_ms = gradient_size_gb / nvlink_bw_gbs * 1000
|
||||
sync_overhead_pct = sync_time_ms / step_time_ms * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fp_nvlink_bw_str = fmt(nvlink_bw_gbs, precision=0, commas=False)
|
||||
fp_sync_time_str = fmt(sync_time_ms, precision=2, commas=False)
|
||||
fp_sync_overhead_str = fmt(sync_overhead_pct, precision=1, commas=False)
|
||||
@@ -4993,7 +4993,7 @@ fp_sync_overhead_str = FpMultigpuScalingCalc.fp_sync_overhead_str
|
||||
|
||||
**Fallacy:** *Hardware acceleration benefits scale linearly with additional accelerators.*
|
||||
|
||||
Teams expect 8 GPUs to train 8$\times$ faster than 1 GPU. Multi-accelerator scaling introduces communication overhead that violates linear scaling assumptions. As noted in @sec-hardware-acceleration-multichip-scaling-c649, AllReduce operations for gradient synchronization can require exchanging hundreds of gigabytes per training step for large models. With NVLink at `{python} fp_nvlink_bw_str` GB/s bidirectional, synchronizing 1 GB of gradients requires `{python} fp_sync_time_str` ms; for a 50 ms training step, this represents `{python} fp_sync_overhead_str`% overhead with perfect overlap. Without overlap, 8-GPU setups achieve 7.5$\times$ speedup (94% efficiency) at best, and typical workloads see 6--7$\times$ (75-87% efficiency) due to load imbalance and synchronization barriers. Small models with insufficient parallel work achieve even worse scaling, sometimes seeing 3--4$\times$ speedup on 8 GPUs (37-50% efficiency).
|
||||
Teams expect 8 GPUs to train 8$\times$ faster than 1 GPU. Multi-accelerator scaling introduces communication overhead that violates linear scaling assumptions. As noted in @sec-hardware-acceleration-multichip-scaling-c649, AllReduce operations for gradient synchronization can require exchanging hundreds of gigabytes per training step for large models. With NVLink at `{python} fp_nvlink_bw_str` GB/s bidirectional, synchronizing 1 GB of gradients requires `{python} fp_sync_time_str` ms; for a 50 ms training step, this represents `{python} fp_sync_overhead_str`% overhead with perfect overlap. Without overlap, 8-GPU setups achieve 7.5$\times$ speedup (94% efficiency) at best, and typical workloads see 6--7$\times$ (75--87% efficiency) due to load imbalance and synchronization barriers. Small models with insufficient parallel work achieve even worse scaling, sometimes seeing 3--4$\times$ speedup on 8 GPUs (37--50% efficiency).
|
||||
|
||||
**Fallacy:** *Peak FLOPS specifications determine real-world accelerator performance.*
|
||||
|
||||
@@ -5023,12 +5023,12 @@ from mlsys.constants import T4_FLOPS_FP16_TENSOR, T4_MEM_BW, TFLOPs, second, GB
|
||||
class FpSmallBatchCalc:
|
||||
"""Small-batch arithmetic intensity and T4 ridge point for inference economics."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
M = 2048
|
||||
N = 2048
|
||||
B = 256
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Batch=1
|
||||
flops_b1 = 2 * M * N
|
||||
bytes_b1 = (M * N + M + N) * 2
|
||||
@@ -5044,7 +5044,7 @@ class FpSmallBatchCalc:
|
||||
t4_bw = T4_MEM_BW.m_as(GB / second)
|
||||
t4_ridge = t4_flops * 1000 / t4_bw
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fp_ai_b1_str = fmt(ai_b1, precision=0, commas=False)
|
||||
fp_ai_b256_str = fmt(ai_b256, precision=0, commas=False)
|
||||
fp_t4_ridge_str = fmt(t4_ridge, precision=0, commas=False)
|
||||
@@ -5088,7 +5088,7 @@ from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
from mlsys.constants import GB, BYTES_FP16
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FeasibilityAssessment:
|
||||
"""
|
||||
Namespace for Hardware Feasibility Assessment.
|
||||
@@ -5097,7 +5097,7 @@ class FeasibilityAssessment:
|
||||
Check 3: Compute — 30 FPS video processing.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Memory check: Llama-7B on 16GB GPU
|
||||
params_7b = 7e9
|
||||
gpu_mem_gb = 16
|
||||
@@ -5109,7 +5109,7 @@ class FeasibilityAssessment:
|
||||
# Compute check: video at 30 FPS
|
||||
fps_target = 30
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Memory: does the model fit?
|
||||
model_7b_gb = model_memory(params_7b, BYTES_FP16, GB)
|
||||
headroom_gb = gpu_mem_gb - model_7b_gb
|
||||
@@ -5121,11 +5121,11 @@ class FeasibilityAssessment:
|
||||
# Compute: real-time frame budget
|
||||
frame_budget_ms = 1000 / fps_target
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(headroom_gb > 0, f"Model ({model_7b_gb}GB) doesn't fit on GPU ({gpu_mem_gb}GB)!")
|
||||
check(token_latency_ms > 0, "Token latency must be positive.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
headroom_str = fmt(headroom_gb, precision=0, commas=False)
|
||||
token_latency_ms_str = fmt(token_latency_ms, precision=0, commas=False)
|
||||
frame_budget_str = fmt(frame_budget_ms, precision=0, commas=False)
|
||||
@@ -5192,7 +5192,7 @@ from mlsys.constants import DAYS_PER_YEAR
|
||||
class CarbonRoiCalc:
|
||||
"""Carbon ROI of specialized NPU silicon vs generic CPU inference fleets."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cpu_power_w = 100 # Watts per CPU server doing inference
|
||||
cpu_tflops = 1 # Peak TFLOPS for CPU inference
|
||||
npu_power_w = 5 # Watts per NPU chip
|
||||
@@ -5202,7 +5202,7 @@ class CarbonRoiCalc:
|
||||
cpu_energy_kwh_day = 2400 # kWh/day for CPU fleet serving 1B inferences
|
||||
npu_energy_kwh_day = 12 # kWh/day for NPU fleet serving 1B inferences
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cpu_eff = cpu_tflops / cpu_power_w
|
||||
npu_eff = npu_tflops / npu_power_w
|
||||
eff_gap = npu_eff / cpu_eff
|
||||
@@ -5211,11 +5211,11 @@ class CarbonRoiCalc:
|
||||
co2_saved_kg_year = energy_savings_kwh_day * DAYS_PER_YEAR * carbon_kg_per_kwh
|
||||
co2_saved_metric_tons = co2_saved_kg_year / 1000
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(eff_gap == 200, f"Efficiency gap should be 200×, got {eff_gap}×")
|
||||
check(co2_saved_metric_tons > 300, f"CO2 savings should exceed 300 tons, got {co2_saved_metric_tons}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cpu_power_str = fmt(cpu_power_w, precision=0)
|
||||
npu_power_str = fmt(npu_power_w, precision=0)
|
||||
cpu_tflops_str = fmt(cpu_tflops, precision=0)
|
||||
@@ -5268,7 +5268,7 @@ The sustainability perspective reinforces a theme that has recurred throughout t
|
||||
|
||||
## Summary {#sec-hardware-acceleration-summary-a5f8}
|
||||
|
||||
The preceding sections established a decision framework for hardware selection and a sustainability perspective grounding these choices in broader responsibility. Hardware acceleration emerged as the force that transformed machine learning from academic curiosity to practical reality, reshaping how we design both computational systems and the algorithms that run on them. The evolution from general-purpose processors to specialized AI accelerators reflects a shift toward domain-specific computing where hardware and software are co-designed to optimize specific computational patterns. The progression from CPUs through GPUs to specialized TPUs, NPUs, and wafer-scale systems demonstrates how understanding workload characteristics drives architectural innovation, creating opportunities for orders-of-magnitude performance improvements through targeted specialization.
|
||||
Hardware acceleration is the force that transformed machine learning from academic curiosity to practical reality, reshaping how we design both computational systems and the algorithms that run on them. The evolution from general-purpose processors to specialized AI accelerators reflects a shift toward domain-specific computing where hardware and software are co-designed to optimize specific computational patterns. The progression from CPUs through GPUs to specialized TPUs, NPUs, and wafer-scale systems demonstrates how understanding workload characteristics drives architectural innovation, creating opportunities for orders-of-magnitude performance improvements through targeted specialization.
|
||||
|
||||
The technical challenges of AI acceleration span multiple layers of the computing stack, from low-level memory hierarchy optimization to high-level compiler transformations and runtime orchestration. Memory bandwidth limitations create bottlenecks that require targeted techniques like data tiling, kernel fusion, and hierarchy-aware scheduling to overcome. Mapping neural network computations to hardware involves complex trade-offs between different dataflow patterns, memory allocation strategies, and execution scheduling approaches that must balance computational efficiency with resource utilization.
|
||||
|
||||
@@ -5298,3 +5298,10 @@ We have now optimized the full D·A·M stack: data selection minimized training
|
||||
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:hw_acceleration")
|
||||
```
|
||||
|
||||
@@ -93,29 +93,29 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AIMomentStats:
|
||||
"""
|
||||
Namespace for opening statistics in 'The AI Moment' section.
|
||||
Establishes the scale of modern AI (searches) and hardware asymmetry (GPU vs CPU).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
searches_per_day = GOOGLE_SEARCHES_PER_DAY
|
||||
h100_flops = H100_FLOPS_FP16_TENSOR
|
||||
cpu_flops = CPU_FLOPS_FP32
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
searches_b = searches_per_day / BILLION
|
||||
gpu_tflops = h100_flops.m_as(TFLOPs / second)
|
||||
cpu_tflops = cpu_flops.m_as(TFLOPs / second)
|
||||
gpu_cpu_ratio = gpu_tflops / cpu_tflops
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(searches_b >= 5, f"Google searches ({searches_b:.1f}B) unexpectedly low.")
|
||||
check(gpu_cpu_ratio >= 500, f"GPU/CPU ratio ({gpu_cpu_ratio:.1f}x) too low for 'massive parallelism'.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
google_search_b_str = fmt(searches_b, precision=1)
|
||||
h100_fp16_tflops_str = fmt(gpu_tflops, precision=0, commas=False)
|
||||
cpu_fp32_tflops_str = fmt(cpu_tflops, precision=1, commas=False)
|
||||
@@ -209,29 +209,29 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class VerificationGap:
|
||||
"""
|
||||
Calculates the dimensionality of the ImageNet input space to illustrate
|
||||
the 'Verification Gap' between test sets and real-world input spaces.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
width = IMAGE_DIM_RESNET
|
||||
height = IMAGE_DIM_RESNET
|
||||
channels = IMAGE_CHANNELS_RGB
|
||||
depth = COLOR_DEPTH_8BIT
|
||||
test_size = IMAGENET_TEST_IMAGES.m_as('count')
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
total_pixels = width * height * channels
|
||||
# Space size is depth^total_pixels. We want log10(depth^total_pixels).
|
||||
digits = total_pixels * math.log10(depth)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(digits > 300_000, f"Verification gap ({digits:.0f} digits) unexpectedly small.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
vg_digits_str = fmt(digits, precision=0, commas=True)
|
||||
imagenet_test_images_str = fmt(test_size, precision=0, commas=True)
|
||||
|
||||
@@ -563,26 +563,26 @@ from mlsys import Models
|
||||
from mlsys.constants import MILLION
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AlexNetBreakthrough:
|
||||
"""
|
||||
Namespace for AlexNet breakthrough statistics.
|
||||
Scenario: ImageNet 2012 competition results.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
model = Models.Vision.ALEXNET
|
||||
alexnet_top5_error = 15.3
|
||||
second_place_error = 26.2
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
relative_improvement = (second_place_error - alexnet_top5_error) / second_place_error * 100
|
||||
params_m = model.parameters.m_as('Mparam')
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(relative_improvement >= 40, f"AlexNet improvement should be ~42%, got {relative_improvement:.1f}%")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
alexnet_relative_improvement_str = fmt(relative_improvement, precision=0)
|
||||
alexnet_params_m_str = fmt(params_m, precision=0)
|
||||
|
||||
@@ -1038,20 +1038,20 @@ from mlsys import Hardware, Models
|
||||
from mlsys.constants import Bparam, ZFLOPs, byte, GB
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT3Scale:
|
||||
"""
|
||||
Namespace for GPT-3 scale statistics.
|
||||
Scenario: Quantifying the compute and data requirements for GPT-3.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
model = Models.GPT3
|
||||
gpus_training = 1024
|
||||
tokens_b = 500
|
||||
avg_token_bytes = 1.4
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Size in Bparam
|
||||
params_b = model.parameters.m_as(Bparam)
|
||||
# Compute in ZFLOPs
|
||||
@@ -1059,10 +1059,10 @@ class GPT3Scale:
|
||||
# Data scale in GB
|
||||
data_gb = (tokens_b * BILLION * avg_token_bytes * byte).m_as(GB)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(params_b == 175, f"GPT-3 should be 175B params, got {params_b}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt3_params_b_str = fmt(params_b, precision=0, commas=False)
|
||||
gpt3_params_billion_str = f"{gpt3_params_b_str} billion"
|
||||
gpt3_training_zflops_str = fmt(round(training_zflops), precision=0, commas=False)
|
||||
@@ -1089,7 +1089,7 @@ With these four paradigm shifts traced, @tbl-ai-evolution-strengths summarizes t
|
||||
|
||||
: **AI Paradigm Evolution**: Each era is defined by the systems bottleneck that constrained it. Deep learning (far right) overcame the Feature Engineering bottleneck but introduced new infrastructure challenges, necessitating modern ML systems engineering. {#tbl-ai-evolution-strengths}
|
||||
|
||||
The progression through four paradigms reveals a consistent pattern: each era's breakthrough came not from cleverer algorithms but from removing a systems bottleneck that prevented existing algorithms from leveraging more data and computation. Symbolic AI had the algorithms for logic but lacked the data; expert systems had domain knowledge but could not scale it; statistical learning had the data but required human feature engineering; deep learning automated feature learning but demanded infrastructure that did not yet exist. The recurring theme is that *systems innovations*, not algorithmic innovations, enabled each transition. The pattern raises a provocative question: given limited resources, should organizations invest in better algorithms, larger datasets, or more powerful machines? One of AI's leading researchers examined the historical record systematically and reached a conclusion that challenges our deepest intuitions about how intelligence should be built.
|
||||
The progression through four paradigms reveals a consistent pattern: each era's breakthrough came not from cleverer algorithms but from removing a systems bottleneck that prevented existing algorithms from leveraging more data and computation. Symbolic AI had the algorithms for logic but lacked the data; expert systems had domain knowledge but could not scale it; statistical learning had the data but required human feature engineering; deep learning automated feature learning but demanded infrastructure that did not yet exist. The recurring theme is that *systems innovations*, not algorithmic innovations, enabled each transition. The pattern raises a provocative question: given limited resources, should organizations invest in better algorithms, larger datasets, or higher-throughput hardware? One of AI's leading researchers examined the historical record systematically and reached a conclusion that challenges our deepest intuitions about how intelligence should be built.
|
||||
|
||||
## Bitter Lesson {#sec-introduction-bitter-lesson-79c7}
|
||||
|
||||
@@ -1118,23 +1118,23 @@ from mlsys import Models
|
||||
from mlsys.constants import MILLION
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT4Scale:
|
||||
"""
|
||||
Namespace for GPT-4 training scale.
|
||||
Scenario: Quantifying the infrastructure requirement for GPT-4.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
model = Models.GPT4
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
gpu_days_m = model.training_gpu_days / MILLION
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(gpu_days_m >= 1.0, f"GPT-4 scale should be >=1M GPU days, got {gpu_days_m:.1f}M")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt4_gpu_m_str = fmt(gpu_days_m, precision=1)
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -1203,10 +1203,21 @@ Rather than beginning with an abstract definition, consider a system you likely
|
||||
from mlsys.constants import GMAIL_EMAILS_PER_DAY, TRILLION
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class EmailScale:
|
||||
"""Gmail annual email volume for spam-filtering scale context."""
|
||||
gmail_emails_t_value = GMAIL_EMAILS_PER_DAY * 365 / TRILLION
|
||||
gmail_emails_t_str = fmt(gmail_emails_t_value, precision=0)
|
||||
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
daily_emails = GMAIL_EMAILS_PER_DAY
|
||||
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
gmail_emails_t_value = daily_emails * 365 / TRILLION
|
||||
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(gmail_emails_t_value >= 1, f"Gmail annual volume ({gmail_emails_t_value:.0f}T) unexpectedly low.")
|
||||
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gmail_emails_t_str = fmt(gmail_emails_t_value, precision=0)
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
gmail_emails_t_str = EmailScale.gmail_emails_t_str
|
||||
@@ -1218,7 +1229,7 @@ This deceptively simple task reveals *what* distinguishes machine learning syste
|
||||
|
||||
The challenge extends to algorithms: the system must generalize from training examples to recognize spam it has never seen before, balancing precision against recall to avoid false positives that hide legitimate emails while catching actual spam. This probabilistic decision-making differs fundamentally from deterministic software logic.
|
||||
|
||||
And the challenge reaches into infrastructure: servers must process billions of emails daily, storing models that encode learned patterns, updating those models as spam evolves, and serving predictions with sub-100 ms latency across horizontally scaled data centers.
|
||||
The challenge reaches into infrastructure as well: servers must process billions of emails daily, storing models that encode learned patterns, updating those models as spam evolves, and serving predictions with sub-100 ms latency across horizontally scaled data centers.
|
||||
|
||||
These three interconnected concerns, obtaining and managing training data at scale, implementing algorithms that learn and generalize effectively, and building infrastructure that supports both training and real-time prediction, appear in every machine learning system. No traditional software system exhibits all three simultaneously. With this concrete grounding, we can state precisely what a machine learning system is:
|
||||
|
||||
@@ -1532,14 +1543,14 @@ optimized_time_value = dTime(
|
||||
efficiency_eta=target_eta_value,
|
||||
)
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT3Training:
|
||||
"""
|
||||
Namespace for the 'Training GPT-3' Napkin Math callout.
|
||||
Isolates variables (gpus, eta) so they don't leak into other scenarios.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
m_gpt3 = Models.GPT3
|
||||
h_a100 = Hardware.Cloud.A100
|
||||
|
||||
@@ -1549,7 +1560,7 @@ class GPT3Training:
|
||||
eta_base = 0.45
|
||||
eta_opt = 0.60
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# We use a static method or lambda for internal logic to avoid 'self' clutter
|
||||
@staticmethod
|
||||
def calc_days(ops, n, peak_tflops, eta):
|
||||
@@ -1563,12 +1574,12 @@ class GPT3Training:
|
||||
days_opt = calc_days(ops, num_gpus, peak_tflops, eta_opt)
|
||||
days_saved = days_base - days_opt
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(days_base > 20, f"Text implies >20 days, got {days_base:.1f}")
|
||||
check(days_saved > 5, f"Text claims significant savings, got {days_saved:.1f}")
|
||||
check(days_opt < days_base, "Optimization failed to reduce time")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
# Text strings
|
||||
num_gpus_str = fmt(num_gpus, precision=0, commas=False)
|
||||
eta_base_pct_str = fmt(eta_base * 100, precision=0, commas=False)
|
||||
@@ -1702,23 +1713,23 @@ Each archetype manifests different constraints along the D·A·M axes, ensuring
|
||||
from mlsys.constants import IMAGENET_IMAGES, MILLION
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ImageNetStats:
|
||||
"""
|
||||
Namespace for ImageNet Scale Statistics.
|
||||
Scenario: Quantifying dataset scale (1.2M images) for the ImageNet footnote.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
images_raw = IMAGENET_IMAGES
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
images_million = images_raw.m_as('count') / MILLION
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(images_million >= 1.0, f"ImageNet scale ({images_million}M) is too small.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
images_m_str = fmt(images_million, precision=1)
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -1827,26 +1838,26 @@ text width=85mm](GB8){Data Selection};
|
||||
# │ Imports: (none - pure calculation)
|
||||
# │ Exports: algo_efficiency_max_str, moores_speedup_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class EfficiencyGains:
|
||||
"""
|
||||
Namespace for Algorithmic Efficiency and Moore's Law comparison.
|
||||
Scenario: AI compute demand doubling (3.4mo) vs Silicon doubling (24mo).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
algo_efficiency_max = 44.5 # EfficientNet vs AlexNet (Hernandez & Brown 2020)
|
||||
moores_doubling_months = 24 # Silicon scaling
|
||||
ai_compute_doubling_months = 3.4 # Training compute scaling (Amodei 2018)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# How much faster is AI demand growing than Silicon supply?
|
||||
growth_gap_ratio = moores_doubling_months / ai_compute_doubling_months
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(growth_gap_ratio >= 5, f"AI growth ({ai_compute_doubling_months}mo) is not significantly faster than Moore's Law ({moores_doubling_months}mo). Gap: {growth_gap_ratio:.1f}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
algo_efficiency_max_str = fmt(algo_efficiency_max, precision=1)
|
||||
moores_speedup_str = fmt(growth_gap_ratio, precision=1)
|
||||
|
||||
@@ -1956,7 +1967,7 @@ plt.show()
|
||||
```
|
||||
:::
|
||||
|
||||
But efficiency gains tell only half the story. @fig-ai-training-compute-growth reveals the countervailing trend: even as individual architectures become more efficient, the field's total appetite for compute has grown exponentially, making efficiency optimization not a luxury but a necessity for continued progress.
|
||||
Efficiency gains tell only half the story. @fig-ai-training-compute-growth reveals the countervailing trend: even as individual architectures become more efficient, the field's total appetite for compute has grown exponentially, making efficiency optimization not a luxury but a necessity for continued progress.
|
||||
|
||||
::: {#fig-ai-training-compute-growth fig-env="figure" fig-pos="htb" fig-cap="**The Era of Scale.** Training Compute (FLOPs) vs. Year (Log Scale). While early Deep Learning (blue) showed rapid growth, the Transformer Era (red) accelerated this trend significantly. From AlexNet (2012) to GPT-4 (2023), compute requirements increased by $10^7$ (10 million times), far outpacing Moore's Law. This exponential demand drives the specialized infrastructure described in this book." fig-alt="Scatter plot of Training Compute FLOPs vs Year. Blue dots (2012-2018) show deep learning models like ResNet. Red dots (2018-2024) show large scale models like GPT-4, rising much faster on the log scale."}
|
||||
```{python}
|
||||
@@ -2058,7 +2069,7 @@ Which efficiency dimensions to prioritize depends heavily on deployment context:
|
||||
|
||||
## Defining AI Engineering {#sec-introduction-defining-ai-engineering-19ce}
|
||||
|
||||
With the Iron Law, Degradation Equation, and Efficiency Framework established, we can now define the discipline that applies them:
|
||||
The Iron Law decomposes performance into physical terms, the Degradation Equation quantifies silent decay, and the Efficiency Framework maps the three levers for managing scale. Together, they demand a discipline that integrates all three:
|
||||
|
||||
::: {.callout-definition title="AI Engineering"}
|
||||
|
||||
@@ -2080,7 +2091,7 @@ Defining a discipline is one thing; practicing it is another. The definition tel
|
||||
|
||||
## ML System Lifecycle {#sec-introduction-ml-system-lifecycle-849f}
|
||||
|
||||
Understanding the ML system lifecycle requires examining three interconnected dimensions: how the development process itself differs from traditional software engineering, how deployment context reshapes that process, and how multiple engineering disciplines must coordinate across it. We begin with the development lifecycle and its distinctive feedback loops, then explore how deployment targets from cloud to microcontroller alter what the lifecycle demands, and finally map the engineering disciplines that sustain it in production.
|
||||
A traditional software project follows a well-understood arc: design, implement, test, deploy, maintain. An ML project follows a different arc shaped by data-dependent behavior and silent degradation. The development process itself differs from traditional software engineering, the deployment context reshapes that process, and multiple engineering disciplines must coordinate across it. We begin with the development lifecycle and its distinctive feedback loops, then examine how deployment targets from cloud to microcontroller alter what the lifecycle demands, and finally map the engineering disciplines that sustain it in production.
|
||||
|
||||
### The ML Development Lifecycle {#sec-introduction-ml-development-lifecycle-4ea0}
|
||||
|
||||
@@ -2138,9 +2149,9 @@ In production, lifecycle stages create either virtuous or vicious cycles. Virtuo
|
||||
|
||||
### The Deployment Spectrum {#sec-introduction-deployment-spectrum-a38c}
|
||||
|
||||
The lifecycle stages apply universally to ML systems, but their specific implementation varies based on deployment environment. Understanding this deployment spectrum, from the most powerful data centers to the most constrained embedded devices, establishes the range of engineering challenges that shape how each lifecycle stage is realized in practice.
|
||||
The lifecycle stages apply universally to ML systems, but their specific implementation varies based on deployment environment. The deployment spectrum spans from megawatt-scale data centers to milliwatt-scale embedded devices, and each position on that spectrum reshapes how every lifecycle stage is realized in practice.
|
||||
|
||||
At one end of the spectrum, cloud-based ML systems\index{Cloud ML} run in massive data centers. These systems, including large language models and recommendation engines, process petabytes of data while serving millions of users simultaneously. They leverage virtually unlimited computing resources but manage enormous operational complexity and costs. @sec-ml-systems examines the architectural patterns for building such large-scale systems, while @sec-hardware-acceleration explores the hardware foundations that make this scale economically viable.
|
||||
At one end of the spectrum, cloud-based ML systems\index{Cloud ML} run in massive data centers. These systems, including large language models and recommendation engines, process petabytes of data while serving millions of users simultaneously. They draw on virtually unlimited computing resources but manage enormous operational complexity and costs. @sec-ml-systems examines the architectural patterns for building such large-scale systems, while @sec-hardware-acceleration explores the hardware foundations that make this scale economically viable.
|
||||
|
||||
At the other end, TinyML systems\index{TinyML} run on microcontrollers[^fn-microcontrollers-tinyml] and embedded devices, performing ML tasks with severe memory, computing power, and energy consumption constraints. Smart home devices like Alexa or Google Assistant must recognize voice commands using less power than LED bulbs, while sensors must detect anomalies on battery power for months or years. The efficiency framework developed earlier in this chapter (@sec-introduction-efficiency-framework-8dd4) introduces the principles underlying constrained deployment, while @sec-model-compression provides the specific techniques (quantization, pruning, distillation) that make TinyML feasible.
|
||||
|
||||
@@ -2220,27 +2231,27 @@ The interdependencies across the D·A·M axes create specific challenge categori
|
||||
from mlsys.constants import WAYMO_DATA_PER_HOUR_LOW, WAYMO_DATA_PER_HOUR_HIGH, TB, hour
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class WaymoStats:
|
||||
"""
|
||||
Namespace for Waymo Data Rates.
|
||||
Scenario: Autonomous vehicles generating massive data volumes (TB/hr).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# From constants (Waymo 1-5 TB/hr citation)
|
||||
rate_low_raw = WAYMO_DATA_PER_HOUR_LOW
|
||||
rate_high_raw = WAYMO_DATA_PER_HOUR_HIGH
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
val_low = rate_low_raw.m_as(TB / hour)
|
||||
val_high = rate_high_raw.m_as(TB / hour)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(val_low >= 1, f"Waymo data rate ({val_low} TB/hr) is too low.")
|
||||
check(val_high > val_low, "High rate must be > Low rate.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
low_str = fmt(val_low, precision=0, commas=False)
|
||||
high_str = fmt(val_high, precision=0, commas=False)
|
||||
|
||||
@@ -2271,7 +2282,7 @@ These four challenge categories, data, model, system, and ethical, do not exist
|
||||
|
||||
## Five-Pillar Framework {#sec-introduction-fivepillar-framework-8118}
|
||||
|
||||
The challenges we have explored, from silent performance degradation and data drift to model complexity and ethical concerns, reveal why ML systems engineering has emerged as a distinct discipline. Traditional software engineering practices cannot address systems that degrade quietly rather than failing visibly. These challenges require systematic engineering practices spanning the entire system lifecycle, from initial data collection through continuous operation and evolution.
|
||||
Industry surveys report that 60--85% of ML projects fail to reach production [@paleyes2022challenges], not because the algorithms are wrong but because no single team owns the full chain from data quality through model reliability to ethical governance. Silent performance degradation, data drift, model complexity, and ethical concerns each demand specialized engineering, yet they interact: a data quality failure degrades the model, which strains the serving infrastructure, which amplifies ethical risks. Traditional software engineering practices cannot address systems that degrade quietly rather than failing visibly. What is needed is a structured framework that assigns clear responsibility for each challenge category while ensuring coordination across all of them.
|
||||
|
||||
This work organizes ML systems engineering around five interconnected disciplines that directly address the challenge categories we have identified. @fig-pillars presents this organizational structure: five engineering pillars, each targeting a distinct challenge category, resting on a shared foundation that reflects the physical and economic constraints every pillar must respect. Together, they represent the core engineering capabilities required to bridge the gap between research prototypes and production systems capable of operating reliably at scale. While these pillars organize the *practice* of ML engineering, they are supported by the foundational technical imperatives of **Performance Optimization** and **Hardware Acceleration** (covered in Part III), which provide the efficiency required to make large-scale training and deployment economically and physically viable.
|
||||
|
||||
@@ -2330,15 +2341,15 @@ Assumptions that hold in traditional software, academic research, or pure mathem
|
||||
|
||||
**Fallacy:** *Better algorithms automatically produce better systems.*
|
||||
|
||||
Engineers assume algorithmic sophistication drives system performance, but this ignores the Iron Law (@sec-introduction-iron-law-ml-systems-c32a). A state-of-the-art Vision Transformer achieves 1-2% higher accuracy than ResNet-50 on ImageNet but requires 4$\times$ the FLOPs and 3$\times$ the memory bandwidth [@dosovitskiy2021image]. In production, a model that is 1% more accurate but violates latency requirements has effectively zero utility. Google's analysis found that only 5% of production ML code is the model itself; the remaining 95% is data pipelines, serving infrastructure, and monitoring [@sculley2015hidden]. A well-engineered system with a simpler model consistently outperforms a state-of-the-art architecture lacking robust infrastructure.
|
||||
Engineers assume algorithmic sophistication drives system performance, but this ignores the Iron Law (@sec-introduction-iron-law-ml-systems-c32a). A state-of-the-art Vision Transformer achieves 1–2% higher accuracy than ResNet-50 on ImageNet but requires 4$\times$ the FLOPs and 3$\times$ the memory bandwidth [@dosovitskiy2021image]. In production, a model that is 1% more accurate but violates latency requirements has effectively zero utility. Google's analysis found that only 5% of production ML code is the model itself; the remaining 95% is data pipelines, serving infrastructure, and monitoring [@sculley2015hidden]. A well-engineered system with a simpler model consistently outperforms a state-of-the-art architecture lacking robust infrastructure.
|
||||
|
||||
**Pitfall:** *Treating ML systems as traditional software that happens to include a model.*
|
||||
|
||||
Engineers apply traditional testing and deployment practices to ML systems, but these systems fail in qualitatively different ways (@sec-introduction-ml-vs-traditional-software-e19a). Traditional bugs produce stack traces within milliseconds; ML systems can silently degrade 10-15% over 3-6 months before anyone notices. A/B tests in conventional software show clear signals within 2-3 days; ML comparisons may require 4-6 weeks to detect 1-2% accuracy differences across subpopulations. Unit tests verify deterministic paths; ML systems require monitoring infrastructure to catch the 5-10% of predictions where models produce unreliable outputs. Teams deploying ML with only CI/CD pipelines risk silent failures affecting 20-30% of predictions before intervention.
|
||||
Engineers apply traditional testing and deployment practices to ML systems, but these systems fail in qualitatively different ways (@sec-introduction-ml-vs-traditional-software-e19a). Traditional bugs produce stack traces within milliseconds; ML systems can silently degrade 10–15% over 3–6 months before anyone notices. A/B tests in conventional software show clear signals within 2–3 days; ML comparisons may require 4–6 weeks to detect 1–2% accuracy differences across subpopulations. Unit tests verify deterministic paths; ML systems require monitoring infrastructure to catch the 5–10% of predictions where models produce unreliable outputs. Teams deploying ML with only CI/CD pipelines risk silent failures affecting 20–30% of predictions before intervention.
|
||||
|
||||
**Fallacy:** *High accuracy on benchmark datasets indicates production readiness.*
|
||||
|
||||
Engineers assume benchmark performance predicts production accuracy, but distribution shift and operational differences cause substantial degradation in deployment. A sentiment analysis model achieving 94% accuracy on curated test data drops to 78-82% accuracy in production as users employ slang, emojis, and context absent from benchmarks. The deployment spectrum (@sec-introduction-deployment-spectrum-a38c) shows that cloud, edge, and mobile environments each introduce distinct constraints: network latency adds 50-200 ms overhead, mobile devices' limited numerical precision reduces accuracy by 2-5%, and edge devices lack the memory for multi-model strategies that boosted benchmark scores. Production systems require failure mode analysis across demographic subgroups where performance may vary by 10-15 percentage points, monitoring infrastructure to detect drift, and validation protocols that match actual operating conditions rather than idealized test sets.
|
||||
Engineers assume benchmark performance predicts production accuracy, but distribution shift and operational differences cause substantial degradation in deployment. A sentiment analysis model achieving 94% accuracy on curated test data drops to 78–82% accuracy in production as users employ slang, emojis, and context absent from benchmarks. The deployment spectrum (@sec-introduction-deployment-spectrum-a38c) shows that cloud, edge, and mobile environments each introduce distinct constraints: network latency adds 50–200 ms overhead, mobile devices' limited numerical precision reduces accuracy by 2–5%, and edge devices lack the memory for multi-model strategies that boosted benchmark scores. Production systems require failure mode analysis across demographic subgroups where performance may vary by 10–15 percentage points, monitoring infrastructure to detect drift, and validation protocols that match actual operating conditions rather than idealized test sets.
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
@@ -2376,20 +2387,20 @@ overall_speedup_value = calc_amdahls_speedup(p_inf_value, s_inf_value)
|
||||
improvement_pct_value = (1 - (1 / overall_speedup_value)) * 100
|
||||
naive_pct_value = (1 - (1 / s_inf_value)) * 100
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AmdahlsPitfall:
|
||||
"""
|
||||
Namespace for Amdahl's Law Pitfall example.
|
||||
Scenario: Optimizing a 45 ms inference component in a 130 ms pipeline.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
t_inference = 45 # ms
|
||||
t_pre = 60 # ms
|
||||
t_post = 25 # ms
|
||||
s_inf = 3 # Component Speedup (3x)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
t_total = t_pre + t_inference + t_post
|
||||
t_inf_new = t_inference / s_inf
|
||||
t_total_new = t_pre + t_inf_new + t_post
|
||||
@@ -2400,11 +2411,11 @@ class AmdahlsPitfall:
|
||||
improvement_pct = (1 - (1 / overall_speedup)) * 100
|
||||
naive_pct = (1 - (1 / s_inf)) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(overall_speedup <= 1.5, f"System speedup ({overall_speedup:.2f}x) is too high for a 'Pitfall'.")
|
||||
check((improvement_pct / naive_pct) <= 0.5, "The discrepancy between naive and actual improvement is too small.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
t_inference_str = fmt(t_inference, precision=0, commas=False)
|
||||
t_inf_new_str = fmt(t_inf_new, precision=0, commas=False)
|
||||
t_pre_str = fmt(t_pre, precision=0, commas=False)
|
||||
@@ -2427,7 +2438,7 @@ naive_p = AmdahlsPitfall.naive_p
|
||||
|
||||
**Pitfall:** *Optimizing individual components without considering system interactions.*
|
||||
|
||||
Engineers optimize inference latency in isolation, but **Amdahl's Law** governs end-to-end performance. A team reduces model inference from `{python} t_inference_str` ms to `{python} t_inf_new_str` ms, expecting proportional improvement. But preprocessing consumes `{python} t_pre_str` ms and postprocessing adds `{python} t_post_str` ms, so total latency drops only from `{python} total_ms` ms to `{python} new_total_ms` ms: `{python} improv_pct`% improvement rather than the expected `{python} naive_p`%. The D·A·M taxonomy (@tbl-dam-taxonomy) shows that data, algorithms, and machines form interdependent systems where optimizing one component shifts bottlenecks rather than eliminating them. A model requiring 3$\times$ more preprocessing can increase total cost 40% while improving accuracy only 2%. Teams optimizing components independently often find 50-70% of their engineering effort fails to improve end-to-end metrics.
|
||||
Engineers optimize inference latency in isolation, but **Amdahl's Law** governs end-to-end performance. A team reduces model inference from `{python} t_inference_str` ms to `{python} t_inf_new_str` ms, expecting proportional improvement. Yet preprocessing consumes `{python} t_pre_str` ms and postprocessing adds `{python} t_post_str` ms, so total latency drops only from `{python} total_ms` ms to `{python} new_total_ms` ms: `{python} improv_pct`% improvement rather than the expected `{python} naive_p`%. The D·A·M taxonomy (@tbl-dam-taxonomy) shows that data, algorithms, and machines form interdependent systems where optimizing one component shifts bottlenecks rather than eliminating them. A model requiring 3$\times$ more preprocessing can increase total cost 40% while improving accuracy only 2%. Teams optimizing components independently often find 50–70% of their engineering effort fails to improve end-to-end metrics.
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
@@ -2447,28 +2458,28 @@ Engineers optimize inference latency in isolation, but **Amdahl's Law** governs
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DriftFallacy:
|
||||
"""
|
||||
Namespace for Drift Fallacy example.
|
||||
Scenario: A recommendation system degrading over 6 months.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
acc_initial = 85.0 # %
|
||||
drift_points_per_month = 0.8 # 0.8% accuracy loss per month (e.g. 85 -> 84.2)
|
||||
months = 6
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Linear degradation model for short-term estimation
|
||||
total_drop = drift_points_per_month * months
|
||||
acc_final = acc_initial - total_drop
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(total_drop >= 3, f"Degradation ({total_drop:.1f}%) is too small to be a 'Fallacy'.")
|
||||
check(acc_final >= 50, f"Model became random guessing ({acc_final:.1f}%), which is unrealistic for 6 months.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
acc_initial_str = fmt(acc_initial, precision=0, commas=False)
|
||||
acc_final_str = fmt(acc_final, precision=0, commas=False)
|
||||
acc_drop_str = fmt(total_drop, precision=1, commas=False) # Changed to 1 decimal for precision
|
||||
@@ -2483,11 +2494,11 @@ months_str = DriftFallacy.months_str
|
||||
|
||||
**Fallacy:** *ML systems can be deployed once and left to run indefinitely.*
|
||||
|
||||
Engineers assume deployed systems maintain performance indefinitely, but the Degradation Equation (@eq-degradation) quantifies why ML systems decay. A recommendation system deployed at `{python} acc_initial_str`% accuracy drops to `{python} acc_final_str`% within `{python} months_str` months as purchasing patterns shift, losing `{python} acc_drop_str` percentage points without any code changes. The ML development lifecycle (@sec-introduction-ml-development-lifecycle-4ea0) shows continuous monitoring and retraining as operational requirements. Fraud detection models degrade 5-10% per quarter as attackers adapt. NLP systems lose 2-3% accuracy annually from vocabulary drift. Without monitoring, systems appear healthy while 15-25% of predictions become unreliable. Organizations treating deployment as one-time typically discover failures through customer complaints 3-6 months after degradation begins.
|
||||
Engineers assume deployed systems maintain performance indefinitely, but the Degradation Equation (@eq-degradation) quantifies why ML systems decay. A recommendation system deployed at `{python} acc_initial_str`% accuracy drops to `{python} acc_final_str`% within `{python} months_str` months as purchasing patterns shift, losing `{python} acc_drop_str` percentage points without any code changes. The ML development lifecycle (@sec-introduction-ml-development-lifecycle-4ea0) shows continuous monitoring and retraining as operational requirements. Fraud detection models degrade 5–10% per quarter as attackers adapt. NLP systems lose 2–3% accuracy annually from vocabulary drift. Without monitoring, systems appear healthy while 15–25% of predictions become unreliable. Organizations treating deployment as one-time typically discover failures through customer complaints 3–6 months after degradation begins.
|
||||
|
||||
**Pitfall:** *Assuming that ML expertise alone is sufficient for ML systems engineering.*
|
||||
|
||||
Organizations hire ML researchers expecting production-ready systems, but the five engineering disciplines (@sec-introduction-five-engineering-disciplines-fa08) require integrated expertise across algorithms, software, systems, and operations. Teams with strong ML skills but limited systems experience ship systems achieving only 10-20% of throughput targets because they lack API design and database optimization expertise. Conversely, software engineers without ML understanding build infrastructure that introduces preprocessing bugs causing 5-15% accuracy degradation undetected for months. Industry surveys report 60-85% of ML projects fail to reach production, primarily due to systems engineering gaps rather than algorithmic limitations [@paleyes2022challenges]. Effective teams integrate ML researchers, software engineers, and operations specialists rather than expecting one role to master all skills.
|
||||
Organizations hire ML researchers expecting production-ready systems, but the five engineering disciplines (@sec-introduction-five-engineering-disciplines-fa08) require integrated expertise across algorithms, software, systems, and operations. Teams with strong ML skills but limited systems experience ship systems achieving only 10–20% of throughput targets because they lack API design and database optimization expertise. Conversely, software engineers without ML understanding build infrastructure that introduces preprocessing bugs causing 5–15% accuracy degradation undetected for months. Industry surveys report 60–85% of ML projects fail to reach production, primarily due to systems engineering gaps rather than algorithmic limitations [@paleyes2022challenges]. Effective teams integrate ML researchers, software engineers, and operations specialists rather than expecting one role to master all skills.
|
||||
|
||||
## Summary {#sec-introduction-summary-385d}
|
||||
|
||||
@@ -2521,3 +2532,10 @@ Welcome to AI Engineering.
|
||||
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:introduction")
|
||||
```
|
||||
|
||||
@@ -55,7 +55,7 @@ Traditional software fails loudly: a null pointer exception crashes the server,
|
||||
|
||||
## MLOps Overview {#sec-ml-operations-introduction-machine-learning-operations-04c6}
|
||||
|
||||
The preceding chapters taught you to build, optimize, benchmark, and serve ML systems. Benchmarking (@sec-benchmarking) told you how a model performs at a point in time; serving infrastructure (@sec-model-serving) showed how to answer requests in milliseconds. You deploy to production, and week one looks excellent. But what happens next?
|
||||
The preceding chapters taught you to build, optimize, benchmark, and serve ML systems. Benchmarking (@sec-benchmarking) told you how a model performs at a point in time; serving infrastructure (@sec-model-serving) showed how to answer requests in milliseconds. You deploy to production, and week one looks excellent. What happens next?
|
||||
|
||||
Data distributions shift, user behavior changes, and the world moves on from the conditions under which the model was trained. A large fraction of ML models that succeed in development never reach sustained production use — not because they were built incorrectly, but because no one watched them after deployment. The root cause is what we call *the operational mismatch* between how traditional software fails and how ML systems degrade:
|
||||
|
||||
@@ -97,7 +97,7 @@ The telemetry[^fn-telemetry-mlops] flowing through these interfaces provides the
|
||||
|
||||
## Principles and Foundations {#sec-ml-operations-mlops-3ea3}
|
||||
|
||||
\index{MLOps!DevOps comparison}MLOps builds on DevOps\index{DevOps!MLOps extension} but addresses the specific demands of ML system development and deployment. DevOps achieved remarkable success for traditional software by assuming deterministic behavior: the same code with the same inputs produces the same outputs. Machine learning systems violate this assumption because they depend on training data distributions, learned parameters, and environmental conditions that shift over time.
|
||||
\index{MLOps!DevOps comparison}MLOps builds on DevOps\index{DevOps!MLOps extension} but addresses the specific demands of ML system development and deployment. DevOps succeeded for traditional software by assuming deterministic behavior: the same code with the same inputs produces the same outputs. Machine learning systems violate this assumption because they depend on training data distributions, learned parameters, and environmental conditions that shift over time.
|
||||
|
||||
DevOps integrates and delivers deterministic software. MLOps must manage non-deterministic, data-dependent workflows spanning data acquisition, preprocessing, model training, evaluation, deployment, and continuous monitoring through an iterative cycle connecting design, model development, and operations. Trace the infinity-loop structure in @fig-mlops-diagram to see how these phases feed back into one another continuously. The following definition captures this discipline's scope:
|
||||
|
||||
@@ -222,19 +222,19 @@ class SkewEconomics:
|
||||
Scenario: The business impact of 1% skew-induced error on 1M daily queries.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
queries_daily = MILLION
|
||||
skew_error_rate = 0.01
|
||||
cost_per_error = 0.10
|
||||
days_per_year = DAYS_PER_YEAR
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
annual_cost = queries_daily * skew_error_rate * cost_per_error * days_per_year
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(annual_cost == 365_000, f"Annual cost should be 365,000, got {annual_cost:.0f}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
queries_daily_str = f"{queries_daily:,}"
|
||||
error_rate_pct_str = f"{int(skew_error_rate * 100)}"
|
||||
error_cost_str = f"{cost_per_error:.2f}"
|
||||
@@ -378,25 +378,25 @@ The abstract notion of technical debt becomes concrete when we examine cost dyna
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AutomationROI:
|
||||
"""
|
||||
Namespace for Automation ROI calculation.
|
||||
Scenario: Comparing manual retraining cost vs automated pipeline investment.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
hrs_manual_week = 4.0
|
||||
hrs_automation_once = 80.0
|
||||
time_horizon_years = 1.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
breakeven_weeks = hrs_automation_once / hrs_manual_week
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(breakeven_weeks <= 26, f"Automation take too long ({breakeven_weeks} weeks) to justify. Narrative implies fast ROI.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
mc_manual_hours_str = f"{int(hrs_manual_week)}"
|
||||
mc_pipeline_hours_str = f"{int(hrs_automation_once)}"
|
||||
mc_breakeven_str = f"{int(breakeven_weeks)}"
|
||||
@@ -774,11 +774,11 @@ Three principles organize the infrastructure that addresses these root causes: *
|
||||
|
||||
{{< margin-video "https://www.youtube.com/watch?v=gz-44N3MMOA&list=PLkDaE6sCZn6GMoA0wbpJLi3t34Gd8l0aK&index=33" "Data Pipelines" "MIT 6.S191" >}}
|
||||
|
||||
**Data consistency** requires that every artifact influencing model behavior, from raw datasets to engineered features, is versioned and reproducible. Without versioning, teams cannot trace which data produced which model, making debugging and rollback impossible. Dataset versioning tools such as DVC (Data Version Control)\index{DVC (Data Version Control)!dataset versioning}\index{Data Versioning!DVC tool} [@dvc] enable teams to version large datasets alongside code repositories managed by Git [@git], while cloud-based object storage systems such as [Amazon S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage) provide the durable, access-controlled backend for both raw and processed artifacts. @sec-ml-operations-versioning-lineage-b1cf examines implementation details including Git integration, metadata tracking, and lineage preservation. At the feature level, the **feature store**\index{Feature Store!Uber Michelangelo origin}, a concept pioneered by Uber's Michelangelo platform team in 2017, enforces consistency by computing features once and serving them identically to both training and serving pipelines. The team coined the term after realizing that feature engineering was duplicated across hundreds of ML models, and their solution became the template that inspired Feast, Tecton, and dozens of other platforms. @sec-ml-operations-feature-stores-c01c details implementation patterns for training-serving consistency.
|
||||
The first requirement is data consistency: every artifact influencing model behavior, from raw datasets to engineered features, must be versioned and reproducible. Without versioning, teams cannot trace which data produced which model, making debugging and rollback impossible. Dataset versioning tools such as DVC (Data Version Control)\index{DVC (Data Version Control)!dataset versioning}\index{Data Versioning!DVC tool} [@dvc] enable teams to version large datasets alongside code repositories managed by Git [@git], while cloud-based object storage systems such as [Amazon S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage) provide the durable, access-controlled backend for both raw and processed artifacts. @sec-ml-operations-versioning-lineage-b1cf examines implementation details including Git integration, metadata tracking, and lineage preservation. At the feature level, the **feature store**\index{Feature Store!Uber Michelangelo origin}, a concept pioneered by Uber's Michelangelo platform team in 2017, enforces consistency by computing features once and serving them identically to both training and serving pipelines. The team coined the term after realizing that feature engineering was duplicated across hundreds of ML models, and their solution became the template that inspired Feast, Tecton, and dozens of other platforms. @sec-ml-operations-feature-stores-c01c details implementation patterns for training-serving consistency.
|
||||
|
||||
**Data freshness** ensures that models train and serve on current data rather than stale snapshots. Automated data pipelines\index{Data Pipelines!automated workflows} maintain freshness by continuously transforming raw data into analysis-ready formats through structured stages: ingestion, schema validation, deduplication, transformation, and loading. Orchestration tools including Apache Airflow\index{Workflow Orchestration!pipeline automation} [@apache_airflow], Prefect [@prefect], and dbt [@dbt] define and manage these workflows. When managed as code, pipelines support versioning, modularity, and integration with CI/CD systems, so that data flows remain synchronized with evolving model requirements.
|
||||
Consistency alone is insufficient if the underlying data is stale. Data freshness ensures that models train and serve on current data rather than outdated snapshots. Automated data pipelines\index{Data Pipelines!automated workflows} maintain freshness by continuously transforming raw data into analysis-ready formats through structured stages: ingestion, schema validation, deduplication, transformation, and loading. Orchestration tools including Apache Airflow\index{Workflow Orchestration!pipeline automation} [@apache_airflow], Prefect [@prefect], and dbt [@dbt] define and manage these workflows. When managed as code, pipelines support versioning, modularity, and integration with CI/CD systems, so that data flows remain synchronized with evolving model requirements.
|
||||
|
||||
**Data quality** governs whether the data reaching models is accurate, complete, and consistently labeled. In supervised learning pipelines, labeling quality directly determines model ceilings. Labeling tools such as Label Studio [@label_studio] support scalable, team-based annotation with integrated audit trails and version histories, capabilities that become essential when labeling conventions evolve over time or require refinement across multiple project iterations.
|
||||
The third pillar, data quality, governs whether the data reaching models is accurate, complete, and consistently labeled. In supervised learning pipelines, labeling quality directly determines model ceilings. Labeling tools such as Label Studio [@label_studio] support scalable, team-based annotation with integrated audit trails and version histories, capabilities that become essential when labeling conventions evolve over time or require refinement across multiple project iterations.
|
||||
|
||||
To illustrate how these three principles reinforce each other in practice, consider a predictive maintenance application in an industrial setting. A continuous stream of sensor data is ingested and joined with historical maintenance logs through a scheduled pipeline managed in Airflow (*freshness*). The resulting features, including rolling averages and statistical aggregates, are stored in a feature store for both retraining and low-latency inference (*consistency*). The entire pipeline is versioned, monitored, and integrated with the model registry (*quality*), enabling full traceability from data to deployed model predictions. Data management, organized around these three principles, establishes the operational backbone for model reproducibility, auditability, and sustained deployment at scale.
|
||||
|
||||
@@ -1191,14 +1191,14 @@ class SilentFailureCost:
|
||||
Scenario: Comparing manual (monthly) vs automated (daily) drift detection.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
annual_revenue = 50_000_000 # $50M/year recommendation engine
|
||||
quality_drop = 0.05 # 5% conversion rate degradation
|
||||
days_manual = 28 # monthly review cycle (~4 weeks)
|
||||
days_auto = 1 # daily automated monitoring
|
||||
incidents_per_year = 4 # typical for high-drift domains
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Per-incident loss = Revenue × Quality Drop × (Detection Days / 365)
|
||||
loss_manual = annual_revenue * quality_drop * (days_manual / DAYS_PER_YEAR)
|
||||
loss_auto = annual_revenue * quality_drop * (days_auto / DAYS_PER_YEAR)
|
||||
@@ -1206,10 +1206,10 @@ class SilentFailureCost:
|
||||
savings_per_incident = loss_manual - loss_auto
|
||||
annual_savings = savings_per_incident * incidents_per_year
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(annual_savings >= 500_000, f"Annual savings (${annual_savings:,.0f}) too low to justify MLOps investment.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
annual_revenue_str = f"{int(annual_revenue // 1_000_000)}M"
|
||||
quality_drop_pct_str = f"{int(quality_drop * 100)}"
|
||||
quality_drop_str = f"{quality_drop}"
|
||||
@@ -1345,24 +1345,24 @@ import math
|
||||
from mlsys.formatting import fmt, check
|
||||
from IPython.display import Markdown
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class RetrainingInterval:
|
||||
"""Optimal retraining interval via square-root law for fraud detection."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
retrain_cost = 5000 # $5,000 per retraining run
|
||||
Q = 1_000_000 # transactions per day
|
||||
V = 0.50 # value per accuracy point
|
||||
A0 = 0.95 # initial accuracy
|
||||
lam = 0.02 # daily decay rate (2%)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
numerator = 2 * retrain_cost
|
||||
denominator = Q * V * A0 * lam
|
||||
ratio = numerator / denominator
|
||||
T_star = math.sqrt(ratio)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
retrain_cost_str = fmt(retrain_cost, precision=0, commas=True)
|
||||
Q_str = fmt(Q, precision=0, commas=True)
|
||||
V_str = fmt(V, precision=2, commas=False)
|
||||
@@ -1722,7 +1722,7 @@ These tools and practices, along with distributed orchestration frameworks like
|
||||
|
||||
#### Model Format Optimization {#sec-ml-operations-model-format-optimization-c9d6}
|
||||
|
||||
A PyTorch model that achieves state-of-the-art accuracy on a benchmark may serve predictions at 200 ms latency in production — ten times slower than the SLO requires. The gap between research frameworks and production serving is often substantial, and format optimization\index{Model Optimization!format conversion} bridges it. Optimized formats routinely achieve 2--10$\times$ latency improvements over naive deployment by converting models into representations tailored for specific hardware. The inference runtimes and precision strategies detailed in @sec-model-serving-inference-runtime-selection-5eef and @sec-model-serving-precision-selection-serving-55ba provide the technical foundations; this section focuses on the operational workflow.
|
||||
A PyTorch model that achieves top accuracy on a benchmark may serve predictions at 200 ms latency in production — ten times slower than the SLO requires. The gap between research frameworks and production serving is often substantial, and format optimization\index{Model Optimization!format conversion} bridges it. Optimized formats routinely achieve 2--10$\times$ latency improvements over naive deployment by converting models into representations tailored for specific hardware. The inference runtimes and precision strategies detailed in @sec-model-serving-inference-runtime-selection-5eef and @sec-model-serving-precision-selection-serving-55ba provide the technical foundations; this section focuses on the operational workflow.
|
||||
|
||||
##### Optimization Frameworks {#sec-ml-operations-optimization-frameworks-d013}
|
||||
|
||||
@@ -1789,14 +1789,14 @@ Regardless of which serving paradigm is used (online, offline, or near-online, a
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LatencyBudget:
|
||||
"""
|
||||
Namespace for Latency Budget Breakdown.
|
||||
Scenario: Allocating components for a 100ms P99 SLO.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
slo_p99 = 100
|
||||
|
||||
network = 15
|
||||
@@ -1804,13 +1804,13 @@ class LatencyBudget:
|
||||
inference = 45
|
||||
post_proc = 15
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
total = network + feature_fetch + inference + post_proc
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(total == slo_p99, f"Component budgets sum to {total}ms, but SLO is {slo_p99}ms.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
slo_p99_str = f"{slo_p99}"
|
||||
network_str = f"{network}"
|
||||
feature_fetch_str = f"{feature_fetch}"
|
||||
@@ -1993,20 +1993,20 @@ $$\text{Cost per 1K inferences} = \frac{\text{Hourly GPU cost} \times 1000}{\tex
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CostPerInference:
|
||||
"""Unit inference cost for an A100 instance at 50K inferences/hour."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
hourly_gpu_cost = 3.00 # $3/hour A100 instance
|
||||
inferences_per_hour = 50_000
|
||||
batch_size = 1000 # cost denominator: per 1K inferences
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_per_inference = hourly_gpu_cost / inferences_per_hour
|
||||
cost_per_batch = cost_per_inference * batch_size
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
hourly_gpu_cost_str = f"{hourly_gpu_cost:.0f}"
|
||||
inferences_per_hour_str = fmt(inferences_per_hour, precision=0, commas=True)
|
||||
cost_per_batch_str = f"{cost_per_batch:.2f}"
|
||||
@@ -2045,11 +2045,11 @@ Effective monitoring spans both model behavior and infrastructure performance. O
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DriftDetectionDelay:
|
||||
"""Minimum time to detect a 5% accuracy drop at 1 QPS vs. 100 req/day."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
baseline_acc = 0.95 # original accuracy
|
||||
drop = 0.05 # 5% drop to detect
|
||||
confidence = 0.95 # 95% statistical confidence
|
||||
@@ -2057,13 +2057,13 @@ class DriftDetectionDelay:
|
||||
# ~1000 samples for 5% diff at 95% confidence (rule of thumb)
|
||||
samples_needed = 1000
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
target_acc = baseline_acc - drop
|
||||
seconds_needed_high = samples_needed / qps_high
|
||||
minutes_needed_high = seconds_needed_high / 60
|
||||
days_needed_low = samples_needed / 100 # 100 requests per day
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
baseline_acc_pct_str = f"{baseline_acc * 100:.0f}"
|
||||
drop_pct_str = f"{drop * 100:.0f}"
|
||||
target_acc_pct_str = f"{target_acc * 100:.0f}"
|
||||
@@ -2392,11 +2392,11 @@ Translating these unit costs into a concrete budget estimate clarifies the real
|
||||
from mlsys.constants import byte, GB
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MonitoringBudget:
|
||||
"""Monthly monitoring infrastructure cost for a single ML Node."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_variants = 3 # prod, canary, staging
|
||||
metrics_per_variant = 50 # metrics per deployment
|
||||
samples_per_min = 4 # 15-second intervals
|
||||
@@ -2413,7 +2413,7 @@ class MonitoringBudget:
|
||||
work_days = 22
|
||||
query_cost_per = 0.02 # $0.02 per query
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
datapoints_mo = n_variants * metrics_per_variant * (
|
||||
samples_per_min * mins_per_hour * hours_per_day * days
|
||||
)
|
||||
@@ -2425,7 +2425,7 @@ class MonitoringBudget:
|
||||
query_cost = queries_mo * query_cost_per
|
||||
total_cost = ingestion_cost + storage_cost + query_cost
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
n_variants_str = f"{n_variants}"
|
||||
metrics_per_variant_str = f"{metrics_per_variant}"
|
||||
samples_per_min_str = f"{samples_per_min}"
|
||||
@@ -2501,7 +2501,7 @@ This scales linearly. Platform teams managing 50+ models face additional constra
|
||||
|
||||
###### Cost Optimization Strategies {.unnumbered}
|
||||
|
||||
The dominant cost driver in monitoring infrastructure is metric cardinality — high-cardinality labels such as user_id or request_id create a combinatorial explosion in storage requirements that can dwarf compute costs by an order of magnitude. Addressing cardinality through sampling or aggregation for high-cardinality dimensions typically yields the largest immediate savings. The second-largest cost driver is temporal resolution: storing all metrics at 15-second granularity for 30 days is rarely necessary, yet it is the default in most monitoring systems. A tiered retention policy — high-resolution (15s) for 24 hours, downsampled to 1-minute for 7 days, and 5-minute for 30 days — reduces storage by 80–90% while preserving the ability to investigate recent incidents at full fidelity. Dashboard query costs accumulate more subtly: each refresh triggers queries against the metrics backend, and default auto-refresh intervals (often 30 seconds) across dozens of dashboards and users generate continuous query load even when no one is actively watching. Setting 5-minute refresh intervals for non-critical dashboards and auto-pausing inactive tabs can reduce query costs by 60–80%. Finally, alert configuration affects both compute costs and operational effectiveness — consolidating related alerts into multi-condition rules reduces evaluation overhead while also reducing alert fatigue, aligning cost optimization with operational quality.
|
||||
The dominant cost driver in monitoring infrastructure is metric cardinality — high-cardinality labels such as user_id or request_id create a combinatorial explosion in storage requirements that can dwarf compute costs by an order of magnitude. Addressing cardinality through sampling or aggregation for high-cardinality dimensions typically yields the largest immediate savings. The second-largest cost driver is temporal resolution: storing all metrics at 15-second granularity for 30 days is rarely necessary, yet it is the default in most monitoring systems. A tiered retention policy (high-resolution at 15s for 24 hours, downsampled to 1-minute for 7 days, and 5-minute for 30 days) reduces storage by 80–90% while preserving the ability to investigate recent incidents at full fidelity. Dashboard query costs accumulate more subtly: each refresh triggers queries against the metrics backend, and default auto-refresh intervals (often 30 seconds) across dozens of dashboards and users generate continuous query load even when no one is actively watching. Setting 5-minute refresh intervals for non-critical dashboards and auto-pausing inactive tabs can reduce query costs by 60–80%. Finally, alert configuration affects both compute costs and operational effectiveness — consolidating related alerts into multi-condition rules reduces evaluation overhead while also reducing alert fatigue, aligning cost optimization with operational quality.
|
||||
|
||||
###### Cost-Benefit Framework {.unnumbered}
|
||||
|
||||
@@ -2527,19 +2527,19 @@ $$\text{Monitoring ROI} = \frac{\text{Incidents Prevented} \times \text{Avg Inci
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MonitoringROI:
|
||||
"""5-incident ROI calculation for $50K monitoring spend."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
incidents_prevented = 5
|
||||
avg_incident_cost = 50_000
|
||||
annual_monitoring_cost = 50_000
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
monitoring_roi = (incidents_prevented * avg_incident_cost) / annual_monitoring_cost
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
incidents_prevented_str = fmt(incidents_prevented, precision=0, commas=False)
|
||||
avg_incident_cost_str = fmt(avg_incident_cost, precision=0, commas=True)
|
||||
annual_monitoring_cost_str = fmt(annual_monitoring_cost, precision=0, commas=True)
|
||||
@@ -2675,9 +2675,9 @@ Debugging ML systems requires both systematic methodology and domain expertise.
|
||||
|
||||
#### On-Call Practices for ML Systems {#sec-ml-operations-oncall-practices-ml-systems-5191}
|
||||
|
||||
\index{On-Call Rotation!ML-specific requirements}The debugging techniques above work when an engineer is actively investigating an issue during business hours. But production systems fail at 3:00 AM on weekends, and the person responding may not be the one who built the model. Debugging resolves individual incidents; on-call practices\index{On-Call Practices!ML systems} sustain operational health over time by ensuring that *someone* with appropriate expertise is always available and equipped to respond. On-call rotation for ML systems requires specialized practices beyond traditional software operations, since ML incidents often manifest as gradual degradation rather than hard failures. A traditional software engineer responding to an alert can typically trace a stack trace to a root cause within minutes. An ML engineer facing a 3% accuracy drop must first determine whether the change represents statistical noise, legitimate concept drift, or a critical failure requiring immediate rollback. This distinction demands statistical context rather than simple log analysis.
|
||||
\index{On-Call Rotation!ML-specific requirements}The debugging techniques above work when an engineer is actively investigating an issue during business hours. Production systems, however, fail at 3:00 AM on weekends, and the person responding may not be the one who built the model. Debugging resolves individual incidents; on-call practices\index{On-Call Practices!ML systems} sustain operational health over time by ensuring that *someone* with appropriate expertise is always available and equipped to respond. On-call rotation for ML systems requires specialized practices beyond traditional software operations, since ML incidents often manifest as gradual degradation rather than hard failures. A traditional software engineer responding to an alert can typically trace a stack trace to a root cause within minutes. An ML engineer facing a 3% accuracy drop must first determine whether the change represents statistical noise, legitimate concept drift, or a critical failure requiring immediate rollback. This distinction demands statistical context rather than simple log analysis.
|
||||
|
||||
This ambiguity compounds with delayed impact visibility. Unlike latency spikes that surface immediately in dashboards, ML degradation may take hours or days to manifest in business metrics. A recommendation model that began serving slightly worse suggestions on Monday might not produce measurable revenue impact until Friday, by which time the window for easy diagnosis has closed. Cross-system dependencies further complicate response: ML issues often originate in upstream data systems owned by different teams, requiring coordination across organizational boundaries during incident response. Perhaps most critically, effective response demands understanding model behavior, not just infrastructure health. A database administrator can restart a crashed service without understanding its business logic, but an ML engineer cannot meaningfully debug accuracy degradation without understanding the model's feature dependencies and expected behavior patterns.
|
||||
This ambiguity compounds with delayed impact visibility. Unlike latency spikes that surface immediately in dashboards, ML degradation may take hours or days to manifest in business metrics. A recommendation model that began serving slightly worse suggestions on Monday might not produce measurable revenue impact until Friday, by which time the window for easy diagnosis has closed. Cross-system dependencies further complicate response: ML issues often originate in upstream data systems owned by different teams, requiring coordination across organizational boundaries during incident response. The deepest challenge is that effective response demands understanding model behavior, not just infrastructure health. A database administrator can restart a crashed service without understanding its business logic, but an ML engineer cannot meaningfully debug accuracy degradation without understanding the model's feature dependencies and expected behavior patterns.
|
||||
|
||||
These challenges motivate tiered escalation structures\index{Tiered Escalation!incident response} that match expertise to incident complexity. @tbl-oncall-structure illustrates a recommended on-call structure for ML teams, where primary responders handle routine issues using standardized runbooks while escalation paths connect to specialists capable of deeper investigation.
|
||||
|
||||
@@ -2820,18 +2820,18 @@ Resource justification requires translating technical requirements into business
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FraudDetectionImprovement:
|
||||
"""Business framing of a 92% → 94% fraud detection accuracy improvement."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
current_rate = 0.92
|
||||
target_rate = 0.94
|
||||
infra_cost_increase = 0.30 # 30% higher infrastructure cost
|
||||
annual_loss_prevented = 2_000_000
|
||||
false_positives_reduced = 50_000 # fewer false alerts/month
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
current_rate_pct_str = f"{current_rate * 100:.0f}"
|
||||
target_rate_pct_str = f"{target_rate * 100:.0f}"
|
||||
infra_cost_increase_pct_str = f"{infra_cost_increase * 100:.0f}"
|
||||
@@ -2905,7 +2905,7 @@ Checking boxes is necessary but not sufficient. Production readiness requires un
|
||||
|
||||
## Design and Maturity Framework {#sec-ml-operations-system-design-maturity-framework-9901}
|
||||
|
||||
The infrastructure and practices examined above are not adopted all at once. Organizations evolve through distinct maturity stages, from ad hoc experimentation to fully automated operations. Understanding where a team stands on this continuum — and what investments yield the highest returns at each stage — is as important as knowing the technical components themselves [@paleyes2022challenges]. This section first defines *operational maturity* as the systemic integration of practices, then identifies concrete *maturity levels* that describe stages of organizational evolution. With these levels established, we examine how maturity shapes system design, identify recurring design patterns, contextualize MLOps within domain-specific constraints through two case studies, and conclude with the investment economics that govern how organizations should prioritize their progression.
|
||||
A startup deploys its first ML model with a Jupyter notebook, a cron job, and a prayer. A Fortune 500 company runs thousands of models through automated pipelines with drift detection, canary deployments, and continuous validation. Both are doing "MLOps," yet the gap between them spans orders of magnitude in reliability, cost efficiency, and engineering velocity. Organizations evolve through distinct maturity stages, from ad hoc experimentation to fully automated operations, and understanding where a team stands on this continuum — and what investments yield the highest returns at each stage — is as important as knowing the technical components themselves [@paleyes2022challenges]. This section first defines *operational maturity* as the systemic integration of practices, then identifies concrete *maturity levels* that describe stages of organizational evolution. With these levels established, we examine how maturity shapes system design, identify recurring design patterns, contextualize MLOps within domain-specific constraints through two case studies, and conclude with the investment economics that govern how organizations should prioritize their progression.
|
||||
|
||||
### Operational Maturity {#sec-ml-operations-operational-maturity-3d14}
|
||||
|
||||
@@ -3084,23 +3084,23 @@ $$\text{Annual ROI} = \frac{\text{Incidents Avoided} \times \text{Avg Incident C
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class SingleModelROI:
|
||||
"""Annual ROI for a $30K MLOps investment on a $1M revenue model."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
incidents_avoided = 4 # incidents/year prevented
|
||||
incident_cost = 25_000 # $25K per incident
|
||||
hours_saved_monthly = 20 # deployment time saved per month
|
||||
hourly_cost = 150 # $150/hr engineering cost
|
||||
mlops_investment = 30_000 # $30K/year MLOps spend
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
incident_savings = incidents_avoided * incident_cost
|
||||
time_savings = hours_saved_monthly * 12 * hourly_cost
|
||||
single_model_roi = (incident_savings + time_savings) / mlops_investment
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
incidents_avoided_str = f"{incidents_avoided}"
|
||||
incident_cost_str = f"{incident_cost // 1000:.0f}K"
|
||||
hours_saved_monthly_str = f"{hours_saved_monthly}"
|
||||
@@ -3141,7 +3141,7 @@ The technical infrastructure and economic framework above provide the foundation
|
||||
|
||||
## Case Studies {#sec-ml-operations-case-studies-641d}
|
||||
|
||||
The principles, patterns, and infrastructure examined throughout this chapter converge in real-world implementations. We examine two cases representing distinct deployment contexts: the Oura Ring, where pipeline debt and configuration management challenge resource-constrained edge environments, and ClinAIOps, where feedback loops and governance requirements drive specialized healthcare operations. The following *principle mapping guide* structures the comparison.
|
||||
A sleep-tracking ring with 16 KB of RAM and a blood-pressure monitor governed by FDA regulations both run ML models in production, yet their operational constraints share almost nothing in common. The principles, patterns, and infrastructure examined throughout this chapter converge differently depending on the deployment context. We examine two cases: the Oura Ring, where pipeline debt and configuration management challenge resource-constrained edge environments, and ClinAIOps, where feedback loops and governance requirements drive specialized healthcare operations. The following *principle mapping guide* structures the comparison.
|
||||
|
||||
::: {.callout-example title="Principle Mapping Guide"}
|
||||
As you read these case studies, look for how each implements the five foundational MLOps principles:
|
||||
@@ -3224,7 +3224,7 @@ This case exemplifies how MLOps principles adapt to domain-specific constraints.
|
||||
|
||||
[^fn-ctm-clinical-ops]: **Continuous Therapeutic Monitoring (CTM)**: Healthcare approach using wearable sensors for real-time physiological data collection and personalized treatment adjustments. CTM forces MLOps to confront constraints absent in typical deployments: feedback loops must include human-in-the-loop approval for safety-critical decisions, retraining requires clinician-validated labels rather than implicit signals, and model updates must satisfy regulatory compliance before deployment. These constraints reshape every MLOps principle, making CTM a stress test for operational maturity. \index{CTM!clinical MLOps constraints}
|
||||
|
||||
CTM leverages wearable sensors to collect real-time physiological and behavioral data from patients. AI systems must be integrated into clinical workflows, aligned with regulatory requirements, and designed to augment rather than replace human decision-making. The traditional MLOps paradigm does not adequately account for patient safety, clinician judgment, and ethical constraints.
|
||||
CTM uses wearable sensors to collect real-time physiological and behavioral data from patients. AI systems must be integrated into clinical workflows, aligned with regulatory requirements, and designed to augment rather than replace human decision-making. The traditional MLOps paradigm does not adequately account for patient safety, clinician judgment, and ethical constraints.
|
||||
|
||||
ClinAIOps\index{ClinAIOps!healthcare ML operations} [@chen2023framework], a framework for operationalizing AI in clinical environments, shows how MLOps principles must evolve for regulatory and human-centered requirements. Unlike conventional MLOps, ClinAIOps directly addresses **feedback loop** challenges by designing them into the system architecture. The framework's structured coordination between patients, clinicians, and AI systems represents practical implementation of **governance and collaboration** principles.
|
||||
|
||||
@@ -3724,3 +3724,10 @@ We have built a system that is efficient, scalable, and reliable. A system can a
|
||||
|
||||
::: {.quiz-end}
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:ml_ops")
|
||||
```
|
||||
|
||||
@@ -90,14 +90,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MLSystemsSetup:
|
||||
"""
|
||||
Namespace for ML Systems chapter overview statistics.
|
||||
Scenario: Deployment paradigms (Cloud/Edge/Mobile/Tiny) and Lighthouse Models.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Tiers
|
||||
t_mobile = Tiers.Mobile
|
||||
t_cloud = Tiers.Cloud
|
||||
@@ -136,14 +136,14 @@ class MLSystemsSetup:
|
||||
mobile_npu_tops = h_phone.peak_flops.m_as(TFLOPs/second)
|
||||
phone_battery_wh = h_phone.battery_capacity.m_as('Wh') if h_phone.battery_capacity else 15
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# GPT-3 Petaflop-days calculation using standardized units
|
||||
gpt3_petaflop_days = (m_gpt3.training_ops / (PFLOPs * SEC_PER_DAY)).to_base_units().m_as(ureg.dimensionless)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(gpt3_petaflop_days >= 3000, f"GPT-3 training should be >=3000 PF-days, got {gpt3_petaflop_days:.0f}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
mobile_ram_range_str = mobile_ram_range
|
||||
mobile_storage_range_str = mobile_storage_range
|
||||
mobile_bw_range_str = mobile_bw_range
|
||||
@@ -174,7 +174,7 @@ class MLSystemsSetup:
|
||||
# DLRM Embedding (using Models Twin)
|
||||
dlrm_embedding_str = fmt(Models.DLRM.model_size.m_as(GB), precision=0)
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ThrottlingScenario:
|
||||
"""
|
||||
Namespace for illustrative mobile thermal throttling.
|
||||
@@ -228,7 +228,7 @@ What makes these systems so different? The physical constraints that govern each
|
||||
|
||||
These physical constraints interact with the **Iron Law of ML Systems** (@sec-introduction-iron-law-ml-systems-c32a), which decomposes end-to-end latency into data movement, computation, and overhead. Different deployment environments stress different terms of this equation: cloud systems are typically compute-bound, mobile systems hit power walls, and TinyML devices are memory-capacity-limited. By pairing the physical constraints with the Iron Law, we develop a quantitative vocabulary for reasoning about *which* paradigm fits a given workload and *why*. To anchor this analysis concretely, the chapter introduces five **Lighthouse Models**—ResNet-50, GPT-2, DLRM, MobileNet, and a Keyword Spotter—that span the deployment spectrum and isolate distinct system bottlenecks. These reference workloads recur throughout the book, providing a consistent basis for comparing optimization techniques across chapters.
|
||||
|
||||
The chapter proceeds in three stages. First, we examine the physics that creates the paradigm boundaries and develop the analytical tools (Iron Law, Bottleneck Principle, Workload Archetypes) for mapping workloads to deployment targets. Second, we trace each paradigm in depth, analyzing the infrastructure, trade-offs, and representative applications that define each regime. Third, we develop a comparative decision framework and explore the hybrid architectures that combine paradigms when no single deployment target satisfies all requirements.
|
||||
The physics that creates these paradigm boundaries comes first, followed by the analytical tools (Iron Law, Bottleneck Principle, Workload Archetypes) for mapping workloads to deployment targets. Each paradigm then receives an in-depth treatment covering infrastructure, trade-offs, and representative applications. The chapter closes with a comparative decision framework and the hybrid architectures that combine paradigms when no single deployment target satisfies all requirements.
|
||||
|
||||
These four paradigms function as distinct operating envelopes, each defined by how much power, memory, and network connectivity is available. Every ML application must fit within at least one of these envelopes, and that fit determines which algorithms, hardware, and engineering trade-offs apply. The four paradigms span a continuous spectrum from centralized cloud infrastructure to distributed ultra-low-power devices. @fig-cloud-edge-TinyML-comparison traces this spectrum visually, mapping where each paradigm sits along the centralization axis, while @tbl-deployment-paradigms-overview pins down the quantitative trade-offs.
|
||||
|
||||
@@ -445,27 +445,27 @@ $$\text{Latency}_{\min} = \frac{2 \times \text{Distance}}{c_{\text{fiber}}} \app
|
||||
from mlsys.constants import SPEED_OF_LIGHT_FIBER_KM_S, ureg
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LightLatency:
|
||||
"""
|
||||
Namespace for Light-Speed Latency calculation.
|
||||
Scenario: Cross-country packet transmission (CA to VA) vs 10ms budget.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
distance_km = 3600 * ureg.km # California to Virginia (straight-line)
|
||||
safety_budget_ms = 10
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Latency = (Distance * 2) / Speed of Light (Round-trip time)
|
||||
min_latency = (distance_km * 2) / SPEED_OF_LIGHT_FIBER_KM_S
|
||||
min_latency_ms = min_latency.m_as(ureg.ms)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(min_latency_ms > safety_budget_ms,
|
||||
f"Physics allows cloud ({min_latency_ms:.1f}ms) within {safety_budget_ms}ms budget!")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
min_latency_str = fmt(min_latency_ms, precision=0, commas=False)
|
||||
distance_str = f"{distance_km.m_as('km'):,}"
|
||||
|
||||
@@ -505,24 +505,24 @@ Doubling clock frequency required approximately 8$\times$ more power. The breakd
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MemoryWall:
|
||||
"""
|
||||
Namespace for the Memory Wall calculation.
|
||||
Scenario: Comparing annual growth rates of Compute vs Memory Bandwidth.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
compute_growth_annual = 1.6 # 60% increase/year
|
||||
mem_bw_growth_annual = 1.2 # 20% increase/year
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
divergence_ratio = compute_growth_annual / mem_bw_growth_annual
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(divergence_ratio > 1.0, "Memory is keeping up with Compute (Gap <= 1x).")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
compute_growth_str = fmt(compute_growth_annual, precision=1, commas=False)
|
||||
mem_bw_growth_str = fmt(mem_bw_growth_annual, precision=1, commas=False)
|
||||
mem_wall_ratio_str = fmt(divergence_ratio, precision=2, commas=False)
|
||||
@@ -594,28 +594,28 @@ This principle dictates that if your system is **Memory Bound**\index{memory-bou
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class EnergyTransmission:
|
||||
"""
|
||||
Namespace for Energy of Transmission vs Compute.
|
||||
Scenario: Cost of sending 1MB to cloud vs running MobileNet locally.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
data_size_mb = 1.0 # 1 sec audio
|
||||
tx_energy_per_mb = 100.0 # mJ/MB (Wi-Fi/LTE)
|
||||
local_energy_op = 0.1 # mJ/inference (MobileNet on NPU)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
cloud_energy_total = data_size_mb * tx_energy_per_mb
|
||||
local_energy_total = local_energy_op
|
||||
|
||||
ratio = cloud_energy_total / local_energy_total
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ratio >= 500, f"Transmission ({cloud_energy_total}mJ) is not expensive enough vs Compute ({local_energy_total}mJ). Ratio: {ratio:.1f}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
data_mb_str = fmt(data_size_mb, precision=0, commas=False)
|
||||
tx_energy_str = fmt(tx_energy_per_mb, precision=0, commas=False)
|
||||
compute_energy_str = fmt(local_energy_op, precision=1, commas=False)
|
||||
@@ -663,22 +663,22 @@ $$T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak} \cdot \eta} + L_{lat}$$
|
||||
|
||||
:::
|
||||
|
||||
To summarize: the Iron Law tells you the *cost of each ingredient*; the Bottleneck Principle tells you the *speed of the assembly line*. As a rule of thumb, use the **additive form** (@eq-iron-law) when analyzing the **latency** of a single task, and the **max form** (@eq-bottleneck) when analyzing the **throughput** of a continuous stream of tasks.
|
||||
The Iron Law tells you the *cost of each ingredient*; the Bottleneck Principle tells you the *speed of the assembly line*. As a rule of thumb, use the **additive form** (@eq-iron-law) when analyzing the **latency** of a single task, and the **max form** (@eq-bottleneck) when analyzing the **throughput** of a continuous stream of tasks.
|
||||
|
||||
### Workload Archetypes {#sec-ml-systems-workload-archetypes-fd10}
|
||||
|
||||
\index{D·A·M taxonomy!workload classification}
|
||||
Now that we understand bottlenecks, we can classify workloads by which constraint dominates. Recall the **D·A·M taxonomy** from @sec-introduction: every ML system comprises **Data**, **Algorithm**, and **Machine**. Different deployment environments create different bottlenecks along these axes—a cloud server with terabytes of memory faces Algorithm constraints, while a microcontroller with kilobytes faces Machine constraints.
|
||||
The Bottleneck Principle raises an immediate question: for a given workload, which constraint dominates? The answer depends on the **D·A·M taxonomy** from @sec-introduction, which decomposes every ML system into **Data**, **Algorithm**, and **Machine**. Different deployment environments create different bottlenecks along these axes—a cloud server with terabytes of memory faces Algorithm constraints, while a microcontroller with kilobytes faces Machine constraints.
|
||||
|
||||
To navigate these constraints systematically, we categorize ML workloads into four **Archetypes**\index{Workload Archetypes}[^fn-archetype-bottleneck]. These represent the primary physical bottlenecks, not just specific model architectures. We introduce each archetype briefly here; the Lighthouse Models that follow will ground each category in concrete, recurring examples.
|
||||
|
||||
**Archetype I: The Compute Beast**\index{arithmetic intensity!high intensity workloads}. These workloads perform many calculations per byte of data loaded. The binding constraint is raw computational throughput. Training large neural networks falls into this category.
|
||||
The first archetype, the **Compute Beast**\index{arithmetic intensity!high intensity workloads}, describes workloads that perform many calculations per byte of data loaded. The binding constraint is raw computational throughput. Training large neural networks falls into this category.
|
||||
|
||||
**Archetype II: The Bandwidth Hog**\index{autoregressive generation!memory-bound}. These workloads spend more time loading data than computing. Memory bandwidth becomes the binding constraint. Autoregressive text generation (like ChatGPT producing one token at a time) falls into this category.
|
||||
The second archetype, the **Bandwidth Hog**\index{autoregressive generation!memory-bound}, describes workloads that spend more time loading data than computing. Memory bandwidth becomes the binding constraint. Autoregressive text generation (like ChatGPT producing one token at a time) falls into this category.
|
||||
|
||||
**Archetype III: The Sparse Scatter**\index{embedding tables!memory capacity bound}. Irregular memory access patterns with poor cache locality define this archetype. Memory capacity and access latency constrain performance. Recommendation systems with massive embedding tables are canonical examples.
|
||||
The third archetype, the **Sparse Scatter**\index{embedding tables!memory capacity bound}, describes workloads with irregular memory access patterns and poor cache locality. Memory capacity and access latency constrain performance. Recommendation systems with massive embedding tables are canonical examples.
|
||||
|
||||
**Archetype IV: The Tiny Constraint**\index{energy per inference!binding constraint}\index{always-on sensing!power constraints}. Extreme power envelopes ($< 1$ mW) and memory limits ($< 256$ KB) characterize these workloads. The binding constraint is energy per inference—efficiency, not raw speed. Always-on sensing operates in this regime.
|
||||
The fourth archetype, the **Tiny Constraint**\index{energy per inference!binding constraint}\index{always-on sensing!power constraints}, describes workloads operating under extreme power envelopes ($< 1$ mW) and memory limits ($< 256$ KB). The binding constraint is energy per inference—efficiency, not raw speed. Always-on sensing operates in this regime.
|
||||
|
||||
These archetypes map naturally to deployment paradigms: **Compute Beasts** and **Sparse Scatter** workloads gravitate toward **Cloud ML** where resources are abundant. **Bandwidth Hogs** span Cloud and Edge depending on latency requirements. **Tiny Constraint** workloads are exclusively **TinyML** territory. To make these abstractions concrete, we anchor each archetype to a specific model that recurs throughout this book as one of *five reference workloads*.
|
||||
|
||||
@@ -735,14 +735,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LighthouseModels:
|
||||
"""
|
||||
Namespace for Lighthouse Models statistics.
|
||||
Scenario: Quantifying the 5 reference workloads.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
m_resnet = Models.ResNet50
|
||||
m_gpt2 = Models.GPT2
|
||||
m_llama = Models.Language.Llama2_70B
|
||||
@@ -750,7 +750,7 @@ class LighthouseModels:
|
||||
m_mobilenet = Models.MobileNetV2
|
||||
m_kws = Models.Tiny.DS_CNN
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
resnet_flops_g = RESNET50_FLOPs.m_as(GFLOPs)
|
||||
resnet_params_m = m_resnet.parameters.m_as(Mparam)
|
||||
resnet_fp32_mb = m_resnet.size_in_bytes(4 * byte).m_as(MB)
|
||||
@@ -768,11 +768,11 @@ class LighthouseModels:
|
||||
kws_params = m_kws.parameters.m_as(Kparam)
|
||||
kws_size_kb = m_kws.size_in_bytes(4 * byte).m_as(KB)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(resnet_fp32_mb >= 90, f"ResNet50 size should be ~98MB, got {resnet_fp32_mb:.0f}MB")
|
||||
check(mobilenet_flops_reduction > 10, "MobileNet reduction should be >10x vs ResNet.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
resnet_gflops_str = fmt(resnet_flops_g, precision=1)
|
||||
resnet_params_m_str = fmt(resnet_params_m, precision=1)
|
||||
resnet_fp32_mb_str = fmt(resnet_fp32_mb, precision=0)
|
||||
@@ -817,15 +817,15 @@ Throughout this book, we use five Lighthouse Models introduced in @sec-introduct
|
||||
|
||||
To ground the abstract interdependencies of the Iron Law in concrete practice, we analyze the Lighthouse Models introduced in @sec-introduction. The following summaries recap each workload from a systems perspective, connecting them to the specific Iron Law bottlenecks they exemplify.
|
||||
|
||||
**ResNet-50**\index{ResNet-50!systems characteristics} classifies images into 1,000 categories, processing each image through approximately `{python} resnet_gflops_str` billion floating-point operations using `{python} resnet_params_m_str` million parameters (`{python} resnet_fp32_mb_str` MB at FP32). Used in medical imaging diagnostics, autonomous vehicle perception pipelines, and as the backbone for content moderation systems, its regular, compute-dense structure makes it the canonical benchmark for hardware accelerator performance.
|
||||
The first lighthouse, **ResNet-50**\index{ResNet-50!systems characteristics}, classifies images into 1,000 categories, processing each image through approximately `{python} resnet_gflops_str` billion floating-point operations using `{python} resnet_params_m_str` million parameters (`{python} resnet_fp32_mb_str` MB at FP32). Used in medical imaging diagnostics, autonomous vehicle perception pipelines, and as the backbone for content moderation systems, its regular, compute-dense structure makes it the canonical benchmark for hardware accelerator performance.
|
||||
|
||||
**GPT-2 / Llama**\index{GPT-2!autoregressive bottleneck}\index{Llama!memory-bound inference} power chatbots, code assistants, and content generation tools. These models generate text one token at a time, requiring the model to read its full parameter set (`{python} gpt2_params_b_str` billion for GPT-2, `{python} llama_range_str` billion for Llama) from memory for each output token. This sequential memory access pattern creates the autoregressive bottleneck that dominates serving costs.
|
||||
The language models **GPT-2 / Llama**\index{GPT-2!autoregressive bottleneck}\index{Llama!memory-bound inference} power chatbots, code assistants, and content generation tools. These models generate text one token at a time, requiring the model to read its full parameter set (`{python} gpt2_params_b_str` billion for GPT-2, `{python} llama_range_str` billion for Llama) from memory for each output token. This sequential memory access pattern creates the autoregressive bottleneck that dominates serving costs.
|
||||
|
||||
**DLRM**\index{DLRM!memory capacity bound}\index{recommendation systems!DLRM} (Deep Learning Recommendation Model) powers the "You might also like" recommendations on platforms like Meta and Netflix. It maps users and items to embedding vectors stored in tables that can exceed `{python} dlrm_embedding_str` GB, making memory capacity rather than computation the binding constraint.
|
||||
The recommendation lighthouse, **DLRM**\index{DLRM!memory capacity bound}\index{recommendation systems!DLRM} (Deep Learning Recommendation Model), powers the "You might also like" recommendations on platforms like Meta and Netflix. It maps users and items to embedding vectors stored in tables that can exceed `{python} dlrm_embedding_str` GB, making memory capacity rather than computation the binding constraint.
|
||||
|
||||
**MobileNet**\index{MobileNet!depthwise separable convolutions}\index{MobileNet!efficiency gains} runs in smartphone camera apps for real-time photo categorization and on-device visual search. It performs the same image classification task as ResNet but uses depthwise separable convolutions to reduce computation by `{python} mobilenet_flops_reduction_str`$\times$, enabling real-time inference on smartphones at `{python} mobile_tdp_range_str` watts.
|
||||
The mobile lighthouse, **MobileNet**\index{MobileNet!depthwise separable convolutions}\index{MobileNet!efficiency gains}, runs in smartphone camera apps for real-time photo categorization and on-device visual search. It performs the same image classification task as ResNet but uses depthwise separable convolutions to reduce computation by `{python} mobilenet_flops_reduction_str`$\times$, enabling real-time inference on smartphones at `{python} mobile_tdp_range_str` watts.
|
||||
|
||||
**Keyword Spotting (KWS)**\index{Keyword Spotting (KWS)!TinyML archetype} represents the always-on TinyML archetype. Used in applications like Smart Doorbells, it detects wake words ("Ding Dong", "Hello") using a depthwise separable CNN with approximately `{python} kws_params_str` parameters (small variants; the DS-CNN benchmark in MLPerf Tiny uses ~200K) fitting in under `{python} kws_size_kb_str` KB, running continuously at under 1 milliwatt.
|
||||
The TinyML lighthouse, **Keyword Spotting (KWS)**\index{Keyword Spotting (KWS)!TinyML archetype}, represents the always-on sensing archetype. Used in applications like Smart Doorbells, it detects wake words ("Ding Dong", "Hello") using a depthwise separable CNN with approximately `{python} kws_params_str` parameters (small variants; the DS-CNN benchmark in MLPerf Tiny uses ~200K) fitting in under `{python} kws_size_kb_str` KB, running continuously at under 1 milliwatt.
|
||||
|
||||
The huge range in compute requirements (20 MFLOPs → 4 GFLOPs) and memory (800 KB → 100 GB) explains why no single deployment paradigm fits all workloads. A keyword spotter runs comfortably on a \$2 microcontroller; a recommendation system requires a warehouse-scale computer. These five Lighthouse Models will serve as concrete anchors throughout the book, each isolating a distinct system bottleneck that we will revisit in every chapter.
|
||||
|
||||
@@ -857,26 +857,26 @@ With the analytical tools (Iron Law, Bottleneck Principle, Workload Archetypes)
|
||||
# │ lat_kws_str, lat_face_str, lat_gpt4_str, lat_train_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LatencyConstants:
|
||||
"""Namespace for Latency Constants."""
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
lat_compute_str = "~1 ns" # GPU matrix multiply (per op)
|
||||
lat_npu_str = "5–20 ms" # NPU inference (MobileNet)
|
||||
lat_llm_str = "20–100 ms" # LLM token generation
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
lat_l1_str = "~1 ns" # L1 cache hit
|
||||
lat_hbm_str = "20–50 ns" # HBM read (GPU)
|
||||
lat_dram_str = "50–100 ns" # DRAM read (mobile)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
lat_net_dc_str = "0.5 ms" # same datacenter
|
||||
lat_net_region_str = "1–5 ms" # same region
|
||||
lat_net_cross_str = "50–150 ms" # cross-region
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
lat_kws_str = "100 μs" # wake-word detection (TinyML)
|
||||
lat_face_str = "10–30 ms" # face detection (mobile)
|
||||
lat_gpt4_str = "200–500 ms" # GPT-4 first token
|
||||
@@ -979,18 +979,18 @@ The following worked example demonstrates how to apply this analysis quantitativ
|
||||
from mlsys.constants import RESNET50_FLOPs, RESNET50_PARAMS, GFLOPs, Mparam, byte, MB
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResnetSetup:
|
||||
"""Namespace for Resnet Setup."""
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
resnet_fp32_bytes_value = RESNET50_PARAMS.m_as('param') * 4 * byte # 4 bytes per FP32 param
|
||||
resnet_fp16_bytes_value = RESNET50_PARAMS.m_as('param') * 2 * byte # 2 bytes per FP16 param
|
||||
resnet_int8_bytes_value = RESNET50_PARAMS.m_as('param') * 1 * byte # 1 byte per INT8 param
|
||||
resnet_gflops_value = RESNET50_FLOPs.m_as(GFLOPs)
|
||||
resnet_params_m_value = RESNET50_PARAMS.m_as(Mparam)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
resnet_gflops_str = fmt(resnet_gflops_value, precision=1, commas=False) # e.g. "4.1" GFLOPs
|
||||
resnet_params_m_str = fmt(resnet_params_m_value, precision=1, commas=False) # e.g. "25.6" M
|
||||
resnet_fp32_mb_str = fmt(resnet_fp32_bytes_value.m_as(MB), precision=0, commas=False) # e.g. "102" MB
|
||||
@@ -1033,11 +1033,11 @@ from mlsys.constants import (
|
||||
from mlsys.formulas import calc_bottleneck
|
||||
from mlsys.formatting import sci, fmt, sci_latex, md_frac
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResnetCloud:
|
||||
"""Namespace for Resnet Cloud."""
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
h_a100 = Hardware.A100
|
||||
cloud_stats = calc_bottleneck(
|
||||
ops=RESNET50_FLOPs,
|
||||
@@ -1062,7 +1062,7 @@ class ResnetCloud:
|
||||
cloud_memory_frac = md_frac(resnet_fp16_bytes_latex, a100_bw_latex, f"{cloud_memory_ms_value:.3f}", "ms")
|
||||
cloud_ai_frac = md_frac(resnet_flops_latex, resnet_fp16_bytes_latex, f"{cloud_ai_value:.0f}", "FLOPs/byte")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
a100_tflops_str = fmt(a100_tflops_value, precision=0, commas=False) # e.g. "312" TFLOPS
|
||||
a100_bw_tbs_str = fmt(a100_bw_tbs_value, precision=0, commas=False) # e.g. "2" TB/s
|
||||
cloud_compute_ms_str = fmt(cloud_compute_ms_value, precision=3, commas=False)
|
||||
@@ -1110,11 +1110,11 @@ from mlsys.constants import (
|
||||
from mlsys.formulas import calc_bottleneck
|
||||
from mlsys.formatting import sci_latex, md_frac, fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResnetMobile:
|
||||
"""Namespace for Resnet Mobile."""
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
h_phone = Hardware.Edge.Generic_Phone
|
||||
m_resnet = Models.ResNet50
|
||||
h_a100 = Hardware.A100
|
||||
@@ -1143,7 +1143,7 @@ class ResnetMobile:
|
||||
mobile_compute_frac = md_frac(resnet_flops_latex, mobile_npu_flops_latex, f"{mobile_compute_ms_value:.2f}", "ms")
|
||||
mobile_memory_frac = md_frac(resnet_int8_bytes_latex, mobile_npu_bw_latex, f"{mobile_memory_ms_value:.2f}", "ms")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mobile_tops_str = fmt(mobile_tops_value, precision=0, commas=False) # e.g. "10" TOPS
|
||||
mobile_bw_gbs_str = fmt(mobile_bw_gbs_value, precision=0, commas=False) # e.g. "50" GB/s
|
||||
mobile_ratio_x_str = fmt(mobile_ratio_x_value, precision=0, commas=False) # memory/compute ratio
|
||||
@@ -1238,30 +1238,30 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class HardwareSpectrumSetup:
|
||||
"""Namespace for Hardware Spectrum Setup."""
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
tpu_chips_str = f"{TPU_POD_CHIPS:,}" # e.g. "4,096" chips
|
||||
cloud_mem_tb_str = fmt(TPU_POD_MEM.m_as(TB), precision=0, commas=False) # e.g. "131" TB
|
||||
cloud_pwr_mw_str = fmt(TPU_POD_POWER.m_as("megawatt"), precision=0, commas=False) # e.g. "4" MW
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
edge_mem_gb_str = fmt(DGX_RAM.m_as(GB), precision=0, commas=False) # e.g. "128" GB
|
||||
edge_stor_tb_str = fmt(DGX_STORAGE.m_as(TB), precision=0, commas=False) # e.g. "4" TB
|
||||
edge_pwr_w_str = fmt(DGX_POWER.m_as(watt), precision=0, commas=False) # e.g. "500" W
|
||||
edge_price_min_str = f"{DGX_PRICE_MIN.m_as(USD):,.0f}" # e.g. "3,000"
|
||||
edge_price_max_str = f"{DGX_PRICE_MAX.m_as(USD):,.0f}" # e.g. "5,000"
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
tiny_ram_kb_str = fmt(ESP32_RAM.m_as(KiB), precision=0, commas=False) # e.g. "520" KB
|
||||
tiny_flash_mb_str = fmt(ESP32_FLASH.m_as(MB), precision=0, commas=False) # e.g. "4" MB
|
||||
tiny_pwr_min_str = f"{ESP32_POWER_MIN.m_as(watt)}" # e.g. "0.1" W
|
||||
tiny_pwr_max_str = f"{ESP32_POWER_MAX.m_as(watt)}" # e.g. "0.5" W
|
||||
tiny_price_str = f"{ESP32_PRICE.m_as(USD)}" # e.g. "10" USD
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cloud_thresh_tflops_str = "1000" # TFLOPS threshold for cloud
|
||||
cloud_thresh_bw_str = "100" # GB/s memory bandwidth
|
||||
edge_thresh_pflops_str = "1" # PFLOPS AI compute threshold
|
||||
@@ -1442,19 +1442,19 @@ from mlsys.constants import SPEED_OF_LIGHT_FIBER_KM_S
|
||||
from mlsys.formulas import calc_network_latency_ms
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DistancePenalty:
|
||||
"""Namespace for Distance Penalty."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
distance_km_value = 1500 # km to cloud datacenter
|
||||
safety_budget_ms_value = 10 # ms safety requirement
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
round_trip_ms_value = calc_network_latency_ms(distance_km_value)
|
||||
deficit_ms_value = safety_budget_ms_value - round_trip_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
sol_kms_str = f"{SPEED_OF_LIGHT_FIBER_KM_S.m_as('km/s'):,.0f}" # e.g. "200,000" km/s
|
||||
rtt_formatted_str = fmt(round_trip_ms_value, precision=0, commas=False) # e.g. "15" ms
|
||||
deficit_str = fmt(deficit_ms_value, precision=0, commas=False) # e.g. "-5" ms
|
||||
@@ -1544,14 +1544,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CloudEdgeTCO:
|
||||
"""
|
||||
Namespace for Cloud vs. Edge TCO comparison.
|
||||
Scenario: 1M req/day inference service cost analysis.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Scenario
|
||||
requests_per_day = 1_000_000
|
||||
inference_ms = 10
|
||||
@@ -1576,7 +1576,7 @@ class CloudEdgeTCO:
|
||||
devops_fte = 0.1
|
||||
devops_salary = 150000
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Cloud
|
||||
c_gpu = gpu_instances * HOURS_PER_YEAR * gpu_price_per_hr
|
||||
egress_gb_per_day = (requests_per_day * response_kb) / MIB_TO_BYTES
|
||||
@@ -1596,10 +1596,10 @@ class CloudEdgeTCO:
|
||||
edge_savings_pct = ((c_total - e_total) / c_total) * 100
|
||||
labor_pct = (e_labor / e_total) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(c_total >= e_total, f"Edge should be cheaper at 1M volume. Cloud=${c_total:.0f}, Edge=${e_total:.0f}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
requests_str = f"{requests_per_day/MILLION:.0f}M"
|
||||
inference_str = f"{inference_ms}ms"
|
||||
response_str = f"{response_kb}KB"
|
||||
@@ -1739,14 +1739,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class VoiceAssistantWall:
|
||||
"""
|
||||
Namespace for Voice Assistant Scaling logic.
|
||||
Scenario: 1 Billion devices, economics vs infrastructure limits.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Economics
|
||||
ww_devices_b = 1
|
||||
ww_cloud_cost_per_device = 0.50
|
||||
@@ -1764,7 +1764,7 @@ class VoiceAssistantWall:
|
||||
vi_waking_hours = 16
|
||||
vi_peak_multiplier = 3
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Economics
|
||||
ww_total_cloud_cost = ww_devices_b * BILLION * ww_cloud_cost_per_device
|
||||
|
||||
@@ -1782,10 +1782,10 @@ class VoiceAssistantWall:
|
||||
# Total audio bandwidth across 1B devices
|
||||
vi_total_audio_tb_per_sec = (vi_audio_bytes_per_sec * vi_devices_b * BILLION) / TRILLION
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(vi_datacenters_peak >= 20, f"Infrastructure wall ({vi_datacenters_peak:.0f} DCs) unexpectedly low.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
# Economics Strings
|
||||
ww_devices_b_str = fmt(ww_devices_b, precision=0, commas=False)
|
||||
ww_cloud_cost_str = fmt(ww_cloud_cost_per_device, precision=2, commas=False)
|
||||
@@ -1874,7 +1874,7 @@ We define this paradigm formally as *Edge ML*.
|
||||
***Edge Machine Learning***\index{Edge ML!definition} is the deployment paradigm optimized for **Latency Determinism** and **Data Locality** by locating computation physically adjacent to data sources.
|
||||
|
||||
1. **Significance (Quantitative):** It circumvents the **Distance Penalty** ($L_{lat}$) of the cloud, trading elastic scale for a fixed **Local Compute Capacity** ($R_{peak}$).
|
||||
2. **Distinction (Durable):** Unlike **Cloud ML**, which prioritizes **Throughput**, Edge ML prioritizes **Determinism** and privacy. Unlike **TinyML**, Edge ML may still utilize workstation-class accelerators (GPGPUs).
|
||||
2. **Distinction (Durable):** Unlike **Cloud ML**, which prioritizes **Throughput**, Edge ML prioritizes **Determinism** and privacy. Unlike **TinyML**, Edge ML may still use workstation-class accelerators (GPGPUs).
|
||||
3. **Common Pitfall:** A frequent misconception is that Edge ML refers to a specific hardware class. In reality, it is a **Location Paradigm**: it spans from IoT gateways to on-premise servers, unified by physical proximity to the data source.
|
||||
|
||||
:::
|
||||
@@ -1977,14 +1977,14 @@ from mlsys.constants import (
|
||||
VIDEO_FPS_STANDARD, CLOUD_EGRESS_PER_GB, MB, GB, second, MILLION,
|
||||
)
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class BandwidthBottleneck:
|
||||
"""
|
||||
Namespace for Bandwidth Bottleneck calculation.
|
||||
Scenario: 100 cameras at 1080p saturating a 10Gbps link.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
num_cameras = 100
|
||||
fps = VIDEO_FPS_STANDARD
|
||||
width = VIDEO_1080P_WIDTH
|
||||
@@ -1992,7 +1992,7 @@ class BandwidthBottleneck:
|
||||
bpp = VIDEO_BYTES_PER_PIXEL_RGB
|
||||
network = Hardware.Networks.Ethernet_10G
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
bytes_per_frame = width * height * bpp
|
||||
bytes_per_sec_single = bytes_per_frame * fps
|
||||
|
||||
@@ -2004,11 +2004,11 @@ class BandwidthBottleneck:
|
||||
# Cost (using helper formula)
|
||||
monthly_cost = calc_monthly_egress_cost(total_bytes_per_sec, CLOUD_EGRESS_PER_GB)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(total_bytes_per_sec > network_cap_bytes, f"Bandwidth ({total_bytes_per_sec}) fits within Network ({network_cap_bytes})! No bottleneck.")
|
||||
check(shortfall_ratio >= 2, f"Shortfall ({shortfall_ratio:.1f}x) is too small to be a 'crisis'.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
cam_rate_mbs_str = fmt(bytes_per_sec_single.m_as(MB/second), precision=0, commas=False)
|
||||
total_rate_gbs_str = fmt(total_bytes_per_sec.m_as(GB/second), precision=1, commas=False)
|
||||
monthly_cost_m_str = fmt(monthly_cost / MILLION, precision=1, commas=False)
|
||||
@@ -2180,14 +2180,14 @@ from mlsys.constants import GFLOPs, CLOUD_ELECTRICITY_PER_KWH, HOURS_PER_YEAR, T
|
||||
from mlsys.formulas import calc_fleet_tco
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class EdgeSizing:
|
||||
"""
|
||||
Namespace for Edge Inference Sizing.
|
||||
Scenario: Hardware selection for retail chain (500 stores).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Scenario
|
||||
stores = 500
|
||||
cameras_per_store = 20
|
||||
@@ -2208,7 +2208,7 @@ class EdgeSizing:
|
||||
nuc_cost = 400
|
||||
years = 3
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Throughput
|
||||
inf_per_sec = cameras_per_store * fps
|
||||
# YOLOv8 Nano Inference FLOPs from Models Twin
|
||||
@@ -2225,13 +2225,13 @@ class EdgeSizing:
|
||||
coral_fleet_capex = coral_cost * stores
|
||||
coral_power_opex = coral_tco - coral_fleet_capex
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
if required_tflops > coral.peak_flops.m_as(TFLOPs/second):
|
||||
# Note: Coral is 4 TOPS (INT8). YOLO is FP32/INT8?
|
||||
# The original code used 4 TOPS vs 2 TFLOPS required.
|
||||
pass
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
stores_str = f"{stores}"
|
||||
cameras_per_store_str = f"{cameras_per_store}"
|
||||
fps_str = f"{fps}"
|
||||
@@ -2482,26 +2482,26 @@ from mlsys import Hardware
|
||||
from mlsys.constants import OBJECT_DETECTOR_POWER_W, ureg
|
||||
from mlsys.formatting import md_frac, fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class BatteryTax:
|
||||
"""
|
||||
Namespace for Battery Tax calculation.
|
||||
Scenario: Always-on object detection draining a phone battery.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
phone = Hardware.Edge.Generic_Phone
|
||||
power_draw = OBJECT_DETECTOR_POWER_W
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
battery_wh = phone.battery_capacity.to(ureg.Wh)
|
||||
runtime_hours = (battery_wh / power_draw).to(ureg.hour)
|
||||
daily_budget_pct = (power_draw * runtime_hours) / battery_wh * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(runtime_hours.m_as(ureg.hour) <= 24, f"Always-on ML should drain battery fast, but got {runtime_hours:.1f} hours.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
runtime_str = fmt(runtime_hours.m_as(ureg.hour), precision=1, commas=False)
|
||||
pwr_w_str = fmt(power_draw.m_as(ureg.watt), precision=0, commas=False)
|
||||
batt_wh_str = fmt(battery_wh.m_as(ureg.Wh), precision=0, commas=False)
|
||||
@@ -2550,7 +2550,7 @@ The battery constraint limits total energy consumption over time. However, even
|
||||
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ThermalQuantCalc:
|
||||
"""Namespace for Thermal Quant Calc."""
|
||||
|
||||
@@ -2674,14 +2674,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class EnergyInference:
|
||||
"""
|
||||
Namespace for Energy Per Inference comparison.
|
||||
Scenario: Battery life across Cloud vs. Edge vs. TinyML paradigms.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
batt_energy_j = BATTERY_ENERGY_J
|
||||
|
||||
# Energy per inference (full-system estimates)
|
||||
@@ -2691,7 +2691,7 @@ class EnergyInference:
|
||||
e_mobilenet_j = 0.05 * ureg.joule # ~50 mJ mobile MobileNet
|
||||
e_kws_j = 0.00001 * ureg.joule # ~10 µJ TinyML keyword spotting
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Queries per full battery charge
|
||||
q_gpt4 = batt_energy_j / e_gpt4_j
|
||||
q_resnet_cloud = batt_energy_j / e_resnet_cloud_j
|
||||
@@ -2699,7 +2699,7 @@ class EnergyInference:
|
||||
q_mobilenet = batt_energy_j / e_mobilenet_j
|
||||
q_kws = batt_energy_j / e_kws_j
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
e_gpt4_str = "~1 kJ"
|
||||
e_resnet_cloud_str = "~10 J"
|
||||
e_resnet_edge_str = "~500 mJ"
|
||||
@@ -2835,7 +2835,7 @@ Beyond these technical constraints, operational challenges compound the difficul
|
||||
|
||||
\index{TinyML!wake-word detection} \index{TinyML!precision agriculture} \index{TinyML!medical wearables}TinyML succeeds across domains where ultra-low power, low per-node cost, and local processing enable applications that no other paradigm can sustain.
|
||||
|
||||
Wake-word detection is perhaps the most familiar consumer application of TinyML. These systems listen continuously at sub-milliwatt power consumption, processing audio streams locally and activating higher-power components only when a wake phrase is detected—a design that dramatically reduces average device power draw[^fn-wearable-always-on].
|
||||
Wake-word detection is the most familiar consumer application of TinyML. These systems listen continuously at sub-milliwatt power consumption, processing audio streams locally and activating higher-power components only when a wake phrase is detected—a design that dramatically reduces average device power draw[^fn-wearable-always-on].
|
||||
|
||||
Precision agriculture exploits TinyML's economic advantages where traditional solutions prove cost-prohibitive. Deployments can instrument thousands of monitoring points with multi-year battery operation, transmitting summaries instead of raw sensor streams to reduce connectivity costs.
|
||||
|
||||
@@ -2876,7 +2876,7 @@ Each paradigm emerged as a response to specific physical constraints: Cloud ML a
|
||||
# │ Exports: cloud_*_str, edge_*_str, mobile_*_str, tiny_*_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ParadigmsTable:
|
||||
"""Namespace for Paradigms Table."""
|
||||
|
||||
@@ -3246,11 +3246,11 @@ Successful deployment balances technical optimization against organizational cap
|
||||
|
||||
\index{Hybrid ML!train-serve split} \index{Hybrid ML!hierarchical processing} \index{Hybrid ML!progressive deployment}Three essential patterns address common integration challenges:
|
||||
|
||||
**Train-Serve Split**\index{train-serve split!economics}: Training occurs in the cloud while inference happens on edge, mobile, or tiny devices. This pattern uses cloud scale for training while benefiting from local inference latency and privacy. Training costs may reach millions of dollars for large models, while inference costs mere cents per query when deployed efficiently.[^fn-train-serve-cost-asymmetry]
|
||||
The **Train-Serve Split**\index{train-serve split!economics} places training in the cloud while inference happens on edge, mobile, or tiny devices. This pattern exploits cloud scale for training while benefiting from local inference latency and privacy. Training costs may reach millions of dollars for large models, while inference costs mere cents per query when deployed efficiently.[^fn-train-serve-cost-asymmetry]
|
||||
|
||||
**Hierarchical Processing**\index{hierarchical processing!data flow}: Data and intelligence flow between computational tiers. TinyML sensors perform basic anomaly detection, edge devices aggregate and analyze data from multiple sensors, and cloud systems handle complex analytics and model updates. Each tier handles tasks appropriate to its capabilities.
|
||||
In **Hierarchical Processing**\index{hierarchical processing!data flow}, data and intelligence flow between computational tiers. TinyML sensors perform basic anomaly detection, edge devices aggregate and analyze data from multiple sensors, and cloud systems handle complex analytics and model updates. Each tier handles tasks appropriate to its capabilities.
|
||||
|
||||
**Progressive Deployment**\index{progressive deployment!model compression}: Models are systematically compressed for deployment across tiers. A large cloud model becomes progressively optimized versions for edge servers, mobile devices, and tiny sensors. Amazon Alexa exemplifies this: wake-word detection uses <1 KB models consuming <1 mW, while complex natural language understanding requires GB+ models in cloud infrastructure.
|
||||
**Progressive Deployment**\index{progressive deployment!model compression} systematically compresses models for deployment across tiers. A large cloud model becomes progressively optimized versions for edge servers, mobile devices, and tiny sensors. Amazon Alexa exemplifies this pattern: wake-word detection uses <1 KB models consuming <1 mW, while complex natural language understanding requires GB+ models in cloud infrastructure.
|
||||
|
||||
[^fn-train-serve-cost-asymmetry]: **Train-Serve Cost Asymmetry**: Training is a one-time, compute-intensive search for model parameters, while inference is a single, cheap forward pass using those parameters. This creates the economic rationale for the split, as the massive fixed training cost is amortized over billions of subsequent low-cost inference queries. The resulting cost gap between a multi-million dollar training run and a sub-cent inference can exceed 1,000,000x. \index{Train-Serve Split!cost asymmetry}
|
||||
|
||||
@@ -3502,7 +3502,7 @@ A related misconception holds that moving computation closer to the user always
|
||||
|
||||
from mlsys.formatting import fmt, check, md_frac
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MobilePowerFallacyCalc:
|
||||
"""Namespace for Mobile Power Fallacy Calc."""
|
||||
|
||||
@@ -3560,7 +3560,7 @@ The difference is qualitative, not just quantitative. As @sec-ml-systems-tinyml-
|
||||
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TcoPitfallCalc:
|
||||
"""Namespace for Tco Pitfall Calc."""
|
||||
|
||||
@@ -3619,17 +3619,17 @@ Teams optimize per-unit resource consumption while ignoring operational overhead
|
||||
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AmdahlCameraCalc:
|
||||
"""Namespace for Amdahl Camera Calc."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cam_isp_ms_value = 100 # ms, ISP + auto-exposure
|
||||
cam_ml_ms_value = 60 # ms, ML scene classification
|
||||
cam_post_ms_value = 40 # ms, tone mapping + HDR merge
|
||||
cam_ml_speedup_value = 10 # 10× faster ML model
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cam_total_ms_value = cam_isp_ms_value + cam_ml_ms_value + cam_post_ms_value # 200 ms
|
||||
cam_ml_frac_value = cam_ml_ms_value / cam_total_ms_value # 0.30
|
||||
cam_non_ml_frac_value = 1 - cam_ml_frac_value # 0.70
|
||||
@@ -3641,7 +3641,7 @@ class AmdahlCameraCalc:
|
||||
cam_ml_optimized_ms_value +
|
||||
cam_post_ms_value) # 146 ms
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cam_isp_str = fmt(cam_isp_ms_value, precision=0, commas=False) # "100"
|
||||
cam_ml_str = fmt(cam_ml_ms_value, precision=0, commas=False) # "60"
|
||||
cam_post_str = fmt(cam_post_ms_value, precision=0, commas=False) # "40"
|
||||
@@ -3707,3 +3707,10 @@ Understanding *where* ML systems run provides the foundation for understanding *
|
||||
|
||||
::: {.quiz-end}
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:ml_systems")
|
||||
```
|
||||
|
||||
@@ -87,7 +87,7 @@ Consider what happens without orchestration. Day 1: "Build a diagnostic model fo
|
||||
|
||||
The model's accuracy was excellent. The team's machine learning skills were excellent. The failure was a *workflow* failure. A deployment constraint that should have shaped every decision from day one was discovered only after the work was done. The tablet's memory limit should have propagated *backward* to the first architecture meeting, constraining which models were even worth considering. Instead, the team optimized each component in isolation—data collection, architecture selection, training—and the integration failure appeared only when the pieces were assembled. This is the default outcome when ML development lacks systematic orchestration.
|
||||
|
||||
This chapter introduces the **ML Workflow**\index{Workflow!systematic framework}, an engineering framework that prevents such failures by making constraints explicit at each development stage and tracing how they propagate across stages. The Workflow marks a transition from model researcher to systems engineer. A researcher optimizes individual elements -- a better architecture, a cleaner dataset, a faster accelerator. A systems engineer orchestrates those elements into production systems that reliably deliver value. Why present this framework *before* the detailed technical chapters? Because understanding how the pieces fit together changes how you learn each piece. A data engineer who understands that preprocessing decisions constrain model architectures approaches data pipelines differently than one who treats data preparation as an isolated task. A model developer who knows the deployment target from day one makes different architecture choices than one optimizing accuracy in a vacuum. The Workflow provides the mental map that makes each subsequent chapter's contributions legible within the larger system.
|
||||
This chapter introduces the **ML Workflow**\index{Workflow!systematic framework}, an engineering framework that prevents such failures by making constraints explicit at each development stage and tracing how they propagate across stages. The Workflow marks a transition from model researcher to systems engineer. A researcher optimizes individual elements: a better architecture, a cleaner dataset, a faster accelerator. A systems engineer orchestrates those elements into production systems that reliably deliver value. Why present this framework *before* the detailed technical chapters? Because understanding how the pieces fit together changes how you learn each piece. A data engineer who understands that preprocessing decisions constrain model architectures approaches data pipelines differently than one who treats data preparation as an isolated task. A model developer who knows the deployment target from day one makes different architecture choices than one optimizing accuracy in a vacuum. The Workflow provides the mental map that makes each subsequent chapter's contributions legible within the larger system.
|
||||
|
||||
\index{CRISP-DM!origin and influence}
|
||||
The orchestration framework is what we call the *machine learning lifecycle*\index{Systems Thinking!principles in ML}—a structured, iterative process[^fn-crisp-dm-lifecycle] that guides the development, evaluation, and improvement of ML systems [@amershi2019software]. We define it formally:
|
||||
@@ -257,7 +257,7 @@ These proportions explain *why* data engineering capabilities often determine pr
|
||||
\index{Stage Interface Contracts!specification}
|
||||
The cost of late discovery follows an exponential pattern[^fn-boehm-cost-curve] that we formalize as the **Constraint Propagation Principle** in @sec-ml-workflow-integrating-systems-thinking-principles-24c0. Late-stage constraint discoveries create exponential cost escalation because violations must be corrected across multiple preceding stages. This exponential cost structure motivates the stage interface contracts in @tbl-stage-interface: validating outputs at each stage transition catches violations early when correction costs remain manageable.
|
||||
|
||||
[^fn-boehm-cost-curve]: **Exponential Cost Escalation**: Barry Boehm's 1981 *Software Engineering Economics* first quantified this pattern for traditional software, showing that defects found post-deployment cost up to 100$\times$ more to fix than those caught during requirements. ML systems exhibit even steeper escalation because late-discovered constraints invalidate learned model weights, not just code -- retraining has no counterpart in traditional software remediation, and the data engineering rework it triggers compounds across every preceding pipeline stage. \index{Boehm!cost escalation curve}
|
||||
[^fn-boehm-cost-curve]: **Exponential Cost Escalation**: Barry Boehm's 1981 *Software Engineering Economics* first quantified this pattern for traditional software, showing that defects found post-deployment cost up to 100$\times$ more to fix than those caught during requirements. ML systems exhibit even steeper escalation because late-discovered constraints invalidate learned model weights, not just code — retraining has no counterpart in traditional software remediation, and the data engineering rework it triggers compounds across every preceding pipeline stage. \index{Boehm!cost escalation curve}
|
||||
|
||||
| **Stage** | **Input Contract** | **Output Contract** | **Quality Invariant** |
|
||||
|:-----------------------|:-------------------------------------------|:---------------------------------------------------------------------------|:------------------------------------------------------------------------------|
|
||||
@@ -296,7 +296,7 @@ This compounding cost of slow iteration creates what we call the *iteration tax*
|
||||
# │ small_potential_iters_str, large_final_str, small_final_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class IterationTax:
|
||||
"""
|
||||
Namespace for Iteration Tax calculation.
|
||||
@@ -304,7 +304,7 @@ class IterationTax:
|
||||
over a fixed 6-month development window.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
weeks_total = 26
|
||||
hours_per_week = 168
|
||||
|
||||
@@ -319,7 +319,7 @@ class IterationTax:
|
||||
small_cycle_time_hours = 1 # 1 hour
|
||||
small_effective_iters = 100 # Realistic cap on useful experiments
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Large model: 1 iter/week -> 26 iters
|
||||
large_iters = weeks_total * (hours_per_week / large_cycle_time_hours)
|
||||
large_final_acc = min(large_start_acc + (large_iters * large_gain_per_iter), 99.0)
|
||||
@@ -328,11 +328,11 @@ class IterationTax:
|
||||
# We allow small model to reach same ceiling
|
||||
small_final_acc = min(small_start_acc + (small_effective_iters * small_gain_per_iter), 99.0)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
# Invariant: Small model must CATCH UP or BEAT Large model
|
||||
check(small_final_acc >= large_final_acc, f"Small model ({small_final_acc}%) failed to beat Large model ({large_final_acc}%) despite speed.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
weeks_str = fmt(weeks_total, precision=0, commas=False)
|
||||
hours_week_str = fmt(hours_per_week, precision=0, commas=False)
|
||||
|
||||
@@ -374,7 +374,7 @@ small_model_experiments_str = IterationTax.small_total_capacity_str
|
||||
1. **Large Model**: `{python} weeks_in_6mo_str` experiments at `{python} large_train_time_str` each. Each experiment improves accuracy by ~`{python} large_gain_str`% (diminishing returns).
|
||||
2. **Small Model**: `{python} weeks_in_6mo_str`$\times$ `{python} hours_per_week_str` = `{python} small_model_experiments_str` experiments at `{python} small_train_time_str` each. Even with smaller gains per iteration, the compound effect is substantial.
|
||||
|
||||
**The Systems Insight**: If each iteration improves accuracy by `{python} small_gain_str`% on average, the small model reaches: `{python} small_accuracy_str`% + (`{python} small_potential_iters_str`$\times$ `{python} small_gain_str`%) = `{python} small_final_str`% theoretical ceiling. The large model reaches: `{python} large_accuracy_str`% + (`{python} weeks_in_6mo_str`$\times$ `{python} large_gain_str`%) = `{python} large_final_str`% (capped at ceiling). Even assuming we utilize only a fraction of the theoretical capacity (e.g., 100 effective iterations out of thousands possible), the compound effect dominates. In practice, the small model's rapid iteration enables discovering better architectures, data augmentations, and hyperparameters.
|
||||
**The Systems Insight**: If each iteration improves accuracy by `{python} small_gain_str`% on average, the small model reaches: `{python} small_accuracy_str`% + (`{python} small_potential_iters_str`$\times$ `{python} small_gain_str`%) = `{python} small_final_str`% theoretical ceiling. The large model reaches: `{python} large_accuracy_str`% + (`{python} weeks_in_6mo_str`$\times$ `{python} large_gain_str`%) = `{python} large_final_str`% (capped at ceiling). Even assuming we use only a fraction of the theoretical capacity (e.g., 100 effective iterations out of thousands possible), the compound effect dominates. In practice, the small model's rapid iteration enables discovering better architectures, data augmentations, and hyperparameters.
|
||||
|
||||
**Conclusion**: **Iteration Velocity is a Feature.** A system that allows 10 experiments/day will almost always eventually outperform a system that allows 1 experiment/week, even if the latter starts with a better model. This "iteration tax" explains why startups with fast iteration often outperform larger teams with slower cycles. For our DR screening scenario, the lightweight model's rapid iteration cycle enables the team to experiment with data augmentations, preprocessing pipelines, and architecture variations far more quickly, ultimately converging on a more robust screening system despite starting at lower accuracy.
|
||||
:::
|
||||
@@ -422,7 +422,7 @@ This shift from code-centric to data-centric development erodes more than just p
|
||||
|
||||
ML workflows violate these abstractions at scale. A multi-terabyte dataset being randomly shuffled during every training epoch presents a "worst-case" workload for traditional file system buffers and virtual memory prefetchers. When every "instruction" (a sample) is fetched stochastically from a massive pool, the OS's predictive caching logic fails, and the system defaults to expensive disk I/O or network transfers. A systems engineer must acknowledge that the "Abstractions of the 1970s"—once designed to hide hardware latency—are often the primary sources of the **Overhead Term ($L_{lat}$)** in the **Iron Law** for Software 2.0. Bridging this gap requires the specialized data engineering and hardware-aware optimizations we examine in the following Parts.
|
||||
|
||||
These distinctions translate directly into the structured six-stage framework that organizes how ML projects unfold, each stage presenting unique challenges that traditional software methodologies cannot adequately address. Before moving to that framework, verify that you can articulate the differences just covered.
|
||||
These distinctions translate directly into the structured six-stage framework that organizes how ML projects unfold, each stage presenting unique challenges that traditional software methodologies cannot address. Verify that you can articulate the differences just covered before examining that framework.
|
||||
|
||||
::: {.callout-checkpoint title="ML vs. Traditional DevOps" collapse="false"}
|
||||
MLOps is not just DevOps for models. Ensure you grasp the key differences:
|
||||
@@ -499,15 +499,15 @@ from mlsys.constants import MOBILENETV2_PARAMS, MOBILENETV2_FLOPs, MB, MFLOPs, b
|
||||
|
||||
class MobilenetSpecs:
|
||||
"""MobileNetV2 model size and FLOPs from canonical constants."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# MobileNetV2: 3.4M params, 300M FLOPs (inference)
|
||||
# Assuming FP32 (4 bytes per param) for size estimate
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
mobilenet_size_mb = (MOBILENETV2_PARAMS * 4 * byte).m_as(MB)
|
||||
mobilenet_flops_m = MOBILENETV2_FLOPs.m_as(MFLOPs)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mobilenet_size_str = f"{mobilenet_size_mb:.0f}" # e.g. "14" MB
|
||||
mobilenet_flops_str = f"{mobilenet_flops_m:.0f}" # e.g. "300" MFLOPs
|
||||
|
||||
@@ -554,9 +554,9 @@ The binding constraint differs dramatically across workload archetypes, causing
|
||||
|
||||
| **Stage** | **ResNet-50 (Compute Beast)** | **DLRM (Sparse Scatter)** | **Keyword Spotting (Tiny Constraint)** |
|
||||
|:-------------|:-----------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------|
|
||||
| **Data Eng** | *Throughput*: Target **> 80% GPU** utilization via prefetching and compiled augmentation | *Latency*: Feature store lookups **< 2ms**; embedding tables dominate storage costs | *Capacity*: Curate data to fit **256KB** RAM; aggressive filtering over accumulation |
|
||||
| **Data Eng** | *Throughput*: Target **> 80% GPU** utilization via prefetching and compiled augmentation | *Latency*: Feature store lookups **< 2 ms**; embedding tables dominate storage costs | *Capacity*: Curate data to fit **256 KB** RAM; aggressive filtering over accumulation |
|
||||
| **Training** | *Compute Bound*: Maximize Model FLOPs Utilization ($\eta$); mixed precision to saturate Tensor Cores | *I/O Bound*: Optimize sparse embedding lookups; memory bandwidth ($BW$) limits throughput | *Model Search*: Neural Architecture Search (NAS) for smallest architecture; quantization-aware training (QAT) required |
|
||||
| **Deploy** | *Batching*: Batch size **> 128** to maximize throughput; latency secondary to cost | *SLA*: Strict **< 10ms p99** latency; feature freshness requirements | *Energy*: **< 1mW** budget; always-on inference without battery drain |
|
||||
| **Deploy** | *Batching*: Batch size **> 128** to maximize throughput; latency secondary to cost | *SLA*: Strict **< 10 ms p99** latency; feature freshness requirements | *Energy*: **< 1 mW** budget; always-on inference without battery drain |
|
||||
|
||||
: **Workflow Variations by Lighthouse Model**: The same lifecycle stages target different Iron Law terms depending on the workload's binding constraint. ResNet-50 optimizes for Throughput ($O/s$); DLRM is bound by Memory Bandwidth ($D_{vol}/BW$); TinyML is strictly bound by Energy ($J$) and Memory Capacity. {#tbl-lighthouse-workflow-comparison}
|
||||
|
||||
@@ -606,14 +606,14 @@ from IPython.display import Markdown
|
||||
|
||||
class ConstraintPropagation:
|
||||
"""Exponential cost multiplier for late constraint discovery across lifecycle stages."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
stage_deployment = 5 # Deployment stage
|
||||
stage_definition = 1 # Problem Definition stage
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_factor = 2 ** (stage_deployment - stage_definition)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cost_factor_str = f"{cost_factor}" # e.g. "16" (2^4 = 16×)
|
||||
constraint_math = Markdown(
|
||||
f"$2^{{{stage_deployment - stage_definition}}} = {cost_factor}\\times$"
|
||||
@@ -647,16 +647,16 @@ constraint_math = ConstraintPropagation.constraint_math
|
||||
|
||||
**Time saved**: 2–4 iteration cycles avoided, approximately 8–16 weeks of rework prevented.
|
||||
|
||||
**The same pattern applies to MobileNetV2**: if Problem Definition specifies "mobile deployment" without the specific constraints established earlier (model size and FLOP budget), the team might develop a 200 MB ResNet-50 variant that achieves state-of-the-art accuracy, only to discover at Deployment that it violates every mobile constraint.
|
||||
**The same pattern applies to MobileNetV2**: if Problem Definition specifies "mobile deployment" without the specific constraints established earlier (model size and FLOP budget), the team might develop a 200 MB ResNet-50 variant optimized for accuracy, only to discover at Deployment that it violates every mobile constraint.
|
||||
|
||||
:::
|
||||
|
||||
With the DR case study providing concrete context and the Stage Interface Specification establishing formal contracts, we now examine each lifecycle stage in detail.
|
||||
The DR case study and Stage Interface Specification provide the concrete context and formal contracts that ground each lifecycle stage. The first stage — Problem Definition — determines every constraint that subsequent stages must satisfy.
|
||||
|
||||
## Problem Definition {#sec-ml-workflow-problem-definition-stage-5974}
|
||||
|
||||
\index{Problem Definition!ML vs traditional}
|
||||
Machine learning system development begins with a challenge distinct from traditional software development: define not just *what* the system should do, but *how* it should learn to do it. Conventional software requirements translate directly into implementation rules, while ML systems require teams to consider *how* the system will learn from data while operating within real-world constraints. This first stage—the leftmost box in @fig-lifecycle-overview—lays the foundation for all subsequent phases in the ML lifecycle.
|
||||
A product manager writes: "Build a model that detects diabetic retinopathy." That single sentence conceals a dozen engineering decisions. What sensitivity threshold protects patient safety? What hardware runs in a rural clinic? What latency keeps a clinician engaged? What regulatory framework governs approval? In traditional software, requirements translate directly into implementation rules. In ML systems, defining *what* the system should do is inseparable from defining *how* it will learn to do it — and the physical constraints under which it must operate. This first stage, the leftmost box in @fig-lifecycle-overview, lays the foundation for all subsequent phases in the ML lifecycle.
|
||||
|
||||
\index{Problem Definition!multi-constraint optimization}
|
||||
The DR screening case makes this concrete. What appears to be a straightforward classification task (detect disease in retinal photographs) actually requires balancing five competing constraints: diagnostic accuracy (patient safety), computational efficiency (rural clinic hardware), workflow integration (clinical adoption), regulatory compliance (FDA approval), and cost-effectiveness (sustainable deployment in resource-limited settings). Each constraint tightens the feasible design space for the others: pursuing higher accuracy through larger models conflicts with the hardware budget; achieving regulatory compliance demands annotation protocols that increase data collection costs. This multi-constraint optimization problem has no analogue in traditional software development.
|
||||
@@ -712,14 +712,14 @@ High-resolution retinal scans can generate tens of megabytes per image, creating
|
||||
# │ bw_summary_kb_str, bw_reduction_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class BandwidthCompute:
|
||||
"""
|
||||
Namespace for Bandwidth vs Compute calculation.
|
||||
Scenario: Rural clinic with 2Mbps uplink trying to upload raw retinal scans.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
patients_day = 150 # Increased from 100 to ensure bandwidth saturation
|
||||
photos_per_patient = 10
|
||||
mb_per_photo = 5.0
|
||||
@@ -727,7 +727,7 @@ class BandwidthCompute:
|
||||
uplink_mbps = 2.0
|
||||
summary_kb = 10.0 # Size of edge-processed result
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Data Volume
|
||||
daily_mb = patients_day * photos_per_patient * mb_per_photo
|
||||
daily_gb = (daily_mb * MB).m_as(GB)
|
||||
@@ -745,11 +745,11 @@ class BandwidthCompute:
|
||||
summary_total_kb = patients_day * summary_kb
|
||||
reduction_factor = original_kb / summary_total_kb
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(upload_hours >= clinic_hours, f"Upload fits in clinic day ({upload_hours:.1f}h < {clinic_hours}h). Edge not required.")
|
||||
check(reduction_factor >= 1000, "Edge compression ratio too small to justify complexity.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
bw_patients_str = fmt(patients_day, precision=0, commas=False)
|
||||
bw_photos_str = fmt(photos_per_patient, precision=0, commas=False)
|
||||
bw_mb_per_photo_str = fmt(mb_per_photo, precision=0, commas=False)
|
||||
@@ -830,15 +830,15 @@ from mlsys.constants import STORAGE_COST_NVME_LOW, STORAGE_COST_S3_STD, USD, GB,
|
||||
|
||||
class StorageCosts:
|
||||
"""Hot vs. cold storage cost comparison for tiered storage footnote."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# NVMe SSD: ~$0.10/GB/month (AWS EBS gp2 baseline)
|
||||
# S3 Standard: ~$0.023/GB/month
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cost_nvme_gb_mo = STORAGE_COST_NVME_LOW.m_as(USD / GB / ureg.month)
|
||||
cost_s3_gb_mo = STORAGE_COST_S3_STD.m_as(USD / GB / ureg.month)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cost_nvme_str = f"{cost_nvme_gb_mo:.2f}" # e.g. "0.10" $/GB/mo
|
||||
cost_s3_str = f"{cost_s3_gb_mo:.3f}" # e.g. "0.023" $/GB/mo
|
||||
|
||||
@@ -930,7 +930,7 @@ These feedback pathways reinforce a central point: data collection does not end
|
||||
|
||||
## Model Development {#sec-ml-workflow-model-development-training-stage-d901}
|
||||
|
||||
With validated datasets and preprocessing pipelines in place, the workflow advances to model creation. In Iron Law terms, this stage defines the Operations ($O$) term: architectural choices set the computational floor that hardware must sustain. Model development and training form the core of machine learning systems, yet the challenges extend well beyond selecting algorithms and tuning hyperparameters[^fn-hyperparameter-search-cost]. @sec-model-training covers the training methodologies, infrastructure requirements, and distributed training strategies in detail. In high-stakes domains like healthcare, every design decision affects clinical outcomes, so technical performance and operational constraints must be integrated from the start.
|
||||
The DR team has 128,000 labeled retinal images, a validated preprocessing pipeline, and a target: >90% sensitivity on edge hardware with <50 ms inference latency. The question is no longer *what data* to collect but *what model* to build — and that question has no answer independent of the deployment constraints already established. In Iron Law terms, this stage defines the Operations ($O$) term: architectural choices set the computational floor that hardware must sustain. The challenges extend well beyond selecting algorithms and tuning hyperparameters[^fn-hyperparameter-search-cost]. @sec-model-training covers the training methodologies, infrastructure requirements, and distributed training strategies in detail. In high-stakes domains like healthcare, every design decision affects clinical outcomes, so technical performance and operational constraints must be integrated from the start.
|
||||
|
||||
[^fn-hyperparameter-search-cost]: **Hyperparameter**: These architectural and optimizer choices (e.g., learning rate, network depth) directly define the computational operations ($O$) for each training run. Because each combination requires a full and independent training run, the search for an optimal configuration incurs a multiplicative, not additive, cost. A naive grid search over just five hyperparameters with four values each requires 1,024 ($4^5$) complete training experiments, making it economically infeasible. \index{Hyperparameter!search cost}
|
||||
|
||||
@@ -946,7 +946,7 @@ Using transfer learning combined with a meticulously labeled dataset of 128,000
|
||||
Achieving high accuracy is only the first challenge. Edge deployment constraints impose strict efficiency requirements: models may need to fit within tens to hundreds of megabytes, complete inference in tens of milliseconds, and operate within tight memory budgets.
|
||||
|
||||
\index{Ensemble Learning!accuracy vs deployment trade-off}
|
||||
From a workflow perspective, accuracy gains must always be weighed against deployment feasibility. Ensemble learning[^fn-ensemble-deployment-cost] illustrates this trade-off clearly: combining predictions from multiple models often yields better performance than any individual model, but at the cost of multiplied inference time and memory usage. Common ensemble methods include bagging (training multiple models on different data subsets), boosting (sequentially training models to correct previous errors), and stacking (using a meta-model to combine base model predictions). Winning entries in ML competitions[^fn-competition-production-gap] typically ensemble 10 to 50 models, achieving impressive accuracy that proves difficult to deploy under real-world latency and memory constraints.
|
||||
From a workflow perspective, accuracy gains must always be weighed against deployment feasibility. Ensemble learning[^fn-ensemble-deployment-cost] illustrates this trade-off: combining predictions from multiple models often yields better performance than any individual model, but at the cost of multiplied inference time and memory usage. Common ensemble methods include bagging (training multiple models on different data subsets), boosting (sequentially training models to correct previous errors), and stacking (using a meta-model to combine base model predictions). Winning entries in ML competitions[^fn-competition-production-gap] typically ensemble 10 to 50 models, achieving impressive accuracy that proves difficult to deploy under real-world latency and memory constraints.
|
||||
|
||||
[^fn-competition-production-gap]: **Competition-Production Gap**: The Netflix Prize (2006--2009) is the canonical example: the winning BellKor ensemble improved RMSE by 10.06% over Netflix's baseline and earned a $\$$1M prize, but Netflix never deployed it because the engineering complexity of serving 800+ constituent models exceeded the business value of the accuracy gain. Netflix engineers later found that simpler models plus better data infrastructure delivered more production value, validating this chapter's thesis that iteration velocity and deployment feasibility outweigh isolated accuracy optimization. \index{Netflix Prize!production gap}
|
||||
|
||||
@@ -985,7 +985,7 @@ The ensemble trade-off illustrates a broader pattern: choosing an ensemble of li
|
||||
|
||||
Real-world constraints shape model development from initial exploration through final optimization, demanding systematic experimentation. Development begins when data scientists collaborate with domain experts — ophthalmologists in the DR case — to identify characteristics indicative of target conditions. An ophthalmologist knows that microaneurysms smaller than 125 micrometers are the earliest sign of retinopathy; without that domain knowledge, a model architect might choose a resolution or receptive field that makes these features invisible to the network. This interdisciplinary approach ensures that model architectures capture clinically relevant features while respecting the computational constraints identified during data collection.
|
||||
|
||||
Computational constraints profoundly shape experimental approaches. Production ML workflows create multiplicative costs: multiple model variants, multiple hyperparameter sweeps, and multiple preprocessing approaches can quickly translate into on the order of \(10^2\) training runs. When each run costs hundreds to thousands of dollars in compute, iteration costs can reach six figures per experiment cycle. This economic reality drives investments in efficient experimentation — better job scheduling, caching of intermediate results, early stopping, and automated resource optimization. Systematic hyperparameter optimization dramatically reduces computational costs compared to exhaustive search; @sec-model-training presents techniques that can substantially reduce experiment counts while achieving comparable or better results. Teams that invest in optimization infrastructure early recover the investment within the first few experiment cycles.
|
||||
Computational constraints profoundly shape experimental approaches. Production ML workflows create multiplicative costs: multiple model variants, multiple hyperparameter sweeps, and multiple preprocessing approaches can quickly translate into on the order of \(10^2\) training runs. When each run costs hundreds to thousands of dollars in compute, iteration costs can reach six figures per experiment cycle. This economic reality drives investments in efficient experimentation — better job scheduling, caching of intermediate results, early stopping, and automated resource optimization. Systematic hyperparameter optimization dramatically reduces computational costs compared to exhaustive search; @sec-model-training presents techniques that can reduce experiment counts by 10--100$\times$ while achieving comparable or better results. Teams that invest in optimization infrastructure early recover the investment within the first few experiment cycles.
|
||||
|
||||
\index{Ablation Studies!definition}
|
||||
\index{A/B Testing!ML deployment}
|
||||
@@ -1106,7 +1106,7 @@ Evaluation and validation address different questions. Evaluation measures model
|
||||
|
||||
Effective evaluation begins with metrics that align with problem definition objectives. For our DR screening system, standard classification metrics like accuracy prove insufficient. Clinical requirements demand specific sensitivity and specificity thresholds: sensitivity above 90% ensures few cases of disease-causing retinopathy are missed, while specificity above 80% prevents overwhelming referral systems with false positives.
|
||||
|
||||
Beyond aggregate metrics, stratified evaluation reveals performance variations across patient subgroups. A model achieving 94% overall accuracy might show significantly lower performance for patients with specific comorbidities, particular age groups, or images captured under certain lighting conditions. These disparities, invisible in aggregate metrics, become critical in production where every patient deserves reliable predictions. @sec-benchmarking provides systematic treatment of these evaluation methodologies.
|
||||
Beyond aggregate metrics, stratified evaluation reveals performance variations across patient subgroups. A model achieving 94% overall accuracy might drop below 80% for patients with specific comorbidities, particular age groups, or images captured under certain lighting conditions. These disparities, invisible in aggregate metrics, become critical in production where every patient deserves reliable predictions. @sec-benchmarking provides systematic treatment of these evaluation methodologies.
|
||||
|
||||
\index{Calibration!definition and etymology}
|
||||
Evaluation must also address calibration[^fn-calibration-clinical-trust]: when the model predicts 80% confidence, does the prediction prove correct 80% of the time? Poorly calibrated models undermine clinical trust even when accuracy metrics appear strong. Clinicians relying on confidence scores for triage decisions need those scores to reflect true uncertainty.
|
||||
@@ -1181,14 +1181,14 @@ These requirements influence deployment strategies. The edge deployment decision
|
||||
# │ edge_capex_str, edge_maintenance_str, payback_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DeploymentEconomics:
|
||||
"""
|
||||
Namespace for Cloud vs Edge Deployment Economics.
|
||||
Scenario: 500 clinics processing 1M images/month total.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
n_clinics = 500
|
||||
patients_day = 50
|
||||
days_year = 365
|
||||
@@ -1204,7 +1204,7 @@ class DeploymentEconomics:
|
||||
edge_inf_cost = 0.001 # $/image (Electricity)
|
||||
edge_latency_ms = 50
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Volume
|
||||
total_images_year = n_clinics * patients_day * days_year
|
||||
|
||||
@@ -1221,13 +1221,13 @@ class DeploymentEconomics:
|
||||
annual_savings = cloud_total_year - edge_opex_year
|
||||
payback_years = edge_capex / annual_savings
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(payback_years <= 3.0, f"Payback period ({payback_years:.1f} years) is too long to justify Edge CapEx.")
|
||||
if edge_capex < cloud_total_year:
|
||||
# Edge should be expensive upfront but cheap later
|
||||
pass
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
n_clinics_str = f"{n_clinics:,}"
|
||||
patients_per_day_str = fmt(patients_day, precision=0, commas=False)
|
||||
days_per_year_str = fmt(days_year, precision=0, commas=False)
|
||||
@@ -1293,9 +1293,9 @@ Integration with existing systems poses additional challenges. The ML system mus
|
||||
|
||||
Deployment proceeds through phases that progressively expose the system to real-world complexity, because each phase catches different failure modes. Simulated environments catch integration issues before any real users are affected. Pilot sites reveal real-world variability invisible in simulation: equipment differences, operator skill levels, patient population diversity. Full deployment exposes scale effects that pilot sites cannot replicate: network contention, storage bottlenecks, and rare edge cases that appear only at volume.
|
||||
|
||||
Scaling across multiple sites compounds these challenges. Each clinic presents unique constraints -- different imaging equipment, varying network reliability, diverse operator expertise levels, and distinct workflow patterns -- creating data quality inconsistencies that force preprocessing adjustments no pilot could have anticipated. The deployment paradigm itself constrains solutions: edge deployment minimizes latency but imposes strict model complexity limits, while cloud deployment enables flexibility but introduces network latency that may violate clinical workflow requirements.
|
||||
Scaling across multiple sites compounds these challenges. Each clinic presents unique constraints — different imaging equipment, varying network reliability, diverse operator expertise levels, and distinct workflow patterns — creating data quality inconsistencies that force preprocessing adjustments no pilot could have anticipated. The deployment paradigm itself constrains solutions: edge deployment minimizes latency but imposes strict model complexity limits, while cloud deployment enables flexibility but introduces network latency that may violate clinical workflow requirements.
|
||||
|
||||
Successful deployment requires more than technical optimization. Clinician feedback often reveals that initial interfaces need significant redesign before adoption. User trust and proficiency matter as much as algorithmic performance. Reliability mechanisms -- automated image quality checks, fallback workflows for errors, stress testing for peak volumes -- keep systems operating robustly across conditions.
|
||||
Successful deployment requires more than technical optimization. Clinician feedback often reveals that initial interfaces need significant redesign before adoption. User trust and proficiency matter as much as algorithmic performance. Reliability mechanisms — automated image quality checks, fallback workflows for errors, stress testing for peak volumes — keep systems operating robustly across conditions.
|
||||
|
||||
Managing improvements across distributed deployments requires centralized version control and automated update pipelines. Deployment feedback — usability concerns, performance regressions, integration surprises — shapes the monitoring strategies that keep the system healthy over time. Deployment is not an endpoint but a transition into continuous operations, where the system's behavior must be watched as carefully as any patient it screens.
|
||||
|
||||
@@ -1335,11 +1335,11 @@ A DR screening system — where missed diagnoses cause blindness — demands rea
|
||||
|
||||
class MonitoringThresholds:
|
||||
"""Production monitoring targets and alert thresholds for the DR screening system."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Sensitivity/specificity targets from Problem Definition stage
|
||||
# Latency targets from deployment paradigm constraints
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
sens_target = 90 # target sensitivity %
|
||||
spec_target = 80 # target specificity %
|
||||
sens_alert = 88 # alert threshold (2% margin)
|
||||
@@ -1360,7 +1360,7 @@ A production DR system tracks several categories of metrics across a hierarchy d
|
||||
|
||||
- **Model performance metrics** (requiring ground truth, available with delay): sensitivity (target >`{python} sens_target`%, alert if 7-day rolling average drops below `{python} sens_alert`%), specificity (target >`{python} spec_target`%, alert if drops below `{python} spec_alert`%), and subgroup performance (alert if any demographic drops >5% below baseline).
|
||||
- **Proxy metrics** (available immediately, without ground truth): prediction confidence distribution (alert if mean confidence drops >10%), referral rate (alert if rate changes >15% from baseline), and image quality rejection rate (alert if >20% of images fail quality checks).
|
||||
- **Operational metrics**: inference latency (P95 <`{python} latency_p95_target`ms, alert if >`{python} latency_alert`ms), throughput (alert if queue depth >50 images), and error rate (alert if >0.1% of requests fail).
|
||||
- **Operational metrics**: inference latency (P95 <`{python} latency_p95_target` ms, alert if >`{python} latency_alert` ms), throughput (alert if queue depth >50 images), and error rate (alert if >0.1% of requests fail).
|
||||
- **Data drift detection**: Population Stability Index (PSI >0.2 indicates significant drift) and feature distribution changes (Kolmogorov-Smirnov test, alert if p<0.01).
|
||||
|
||||
The hierarchy matters: operational metrics catch immediate problems (seconds), proxy metrics catch model issues without waiting for ground truth (hours), and performance metrics catch accuracy degradation requiring labeled data (weeks).
|
||||
@@ -1376,7 +1376,7 @@ Scaling from pilot sites to hundreds of clinics causes monitoring complexity to
|
||||
|
||||
[^fn-data-lineage-audit]: **Data Lineage**: The automated recording of metadata linking each clinic's production logs to the exact data, code, and model version that generated them. Without this explicit trail, correlating a site-specific accuracy drop with a training experiment requires a manual forensic analysis across hundreds of gigabytes of logs, turning a minutes-long metadata query into a multi-week engineering task. \index{Data Lineage!audit trail}
|
||||
|
||||
Proactive maintenance closes the lifecycle loop: predictive models identify potential problems from operational patterns, continuous learning pipelines retrain on new data, and production insights feed back to refine problem definitions, data quality standards, and architectural decisions. The patterns underlying these dynamics -- why constraints propagate, why feedback operates at multiple timescales, why system-level behavior diverges from component-level behavior -- are the subject of the next section.
|
||||
Proactive maintenance closes the lifecycle loop: predictive models identify potential problems from operational patterns, continuous learning pipelines retrain on new data, and production insights feed back to refine problem definitions, data quality standards, and architectural decisions. The patterns underlying these dynamics — why constraints propagate, why feedback operates at multiple timescales, why system-level behavior diverges from component-level behavior — are the subject of the next section.
|
||||
|
||||
## Systems Thinking {#sec-ml-workflow-integrating-systems-thinking-principles-24c0}
|
||||
|
||||
@@ -1429,7 +1429,7 @@ A team discovers during monitoring (Stage 6) that their DR model fails for patie
|
||||
*The Lesson: Define demographic and deployment constraints before collecting data.*
|
||||
:::
|
||||
|
||||
With these principles established, we can now identify the most common ways teams violate them.
|
||||
These principles predict specific failure modes. The following fallacies and pitfalls capture the most common ways teams violate them.
|
||||
|
||||
## Fallacies and Pitfalls {#sec-ml-workflow-fallacies-pitfalls-4d91}
|
||||
|
||||
@@ -1437,7 +1437,7 @@ ML workflows introduce counterintuitive complexities that lead teams to apply fa
|
||||
|
||||
**Fallacy:** *ML development can follow traditional software workflows without modification.*
|
||||
|
||||
Engineers assume waterfall or standard agile processes will work for ML projects. In production, ML replaces deterministic specifications with probabilistic optimization, static behavior with dynamic adaptation, and isolated development with continuous feedback loops (@tbl-sw-ml-cycles). Traditional approaches treat requirements as fixed and testing as binary pass/fail, but ML systems require iterative experimentation where problem definitions evolve through exploration. Industry estimates suggest ML projects fail at several$\times$ the rate of traditional software, with a majority never reaching deployment. Projects forced into rigid phase gates miss the 4–8 iteration cycles that production-ready systems require. Organizations that adapt workflows to accommodate ML's experimental nature have reported significantly shorter time-to-deployment.
|
||||
Engineers assume waterfall or standard agile processes will work for ML projects. In production, ML replaces deterministic specifications with probabilistic optimization, static behavior with dynamic adaptation, and isolated development with continuous feedback loops (@tbl-sw-ml-cycles). Traditional approaches treat requirements as fixed and testing as binary pass/fail, but ML systems require iterative experimentation where problem definitions evolve through exploration. Industry surveys report that 60--80% of ML projects never reach production deployment. Projects forced into rigid phase gates miss the 4–8 iteration cycles that production-ready systems require. Organizations that adapt workflows to accommodate ML's experimental nature report 2--3$\times$ shorter time-to-deployment.
|
||||
|
||||
**Pitfall:** *Treating data preparation as a one-time preprocessing step.*
|
||||
|
||||
@@ -1453,7 +1453,7 @@ Teams assume that scaling dataset size is the most reliable path to accuracy gai
|
||||
|
||||
**Pitfall:** *Skipping validation stages to accelerate timelines.*
|
||||
|
||||
Teams assume cutting validation time ships faster. In production, the multi-stage validation process exists because each stage catches different failure modes (@sec-ml-workflow-evaluation-validation-stage-b47d). Skipping shadow mode testing causes integration issues with 10--50$\times$ latency spikes (@sec-ml-workflow-validation-production-conditions-a351). Bypassing canary deployment leads to incidents affecting millions of users. Post-deployment fixes cost 10--100$\times$ more than catching issues during validation. Inadequate validation extends time-to-production by 2–5 months through unplanned remediation. A team that "saves" 2 weeks by skipping validation spends 6–8 weeks on emergency remediation. Organizations investing in systematic validation infrastructure achieve substantially fewer production incidents and higher first-deployment success rates.
|
||||
Teams assume cutting validation time ships faster. In production, the multi-stage validation process exists because each stage catches different failure modes (@sec-ml-workflow-evaluation-validation-stage-b47d). Skipping shadow mode testing causes integration issues with 10--50$\times$ latency spikes (@sec-ml-workflow-validation-production-conditions-a351). Bypassing canary deployment leads to incidents affecting millions of users. Post-deployment fixes cost 10--100$\times$ more than catching issues during validation. Inadequate validation extends time-to-production by 2–5 months through unplanned remediation. A team that "saves" 2 weeks by skipping validation spends 6–8 weeks on emergency remediation. Organizations investing in systematic validation infrastructure report 3--5$\times$ fewer production incidents and higher first-deployment success rates.
|
||||
|
||||
**Pitfall:** *Deferring deployment paradigm selection until after model development.*
|
||||
|
||||
@@ -1495,3 +1495,10 @@ The workflow framework established here provides the organizing structure for ev
|
||||
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:workflow")
|
||||
```
|
||||
|
||||
@@ -82,12 +82,12 @@ from mlsys.formatting import fmt
|
||||
class GpuSpecs:
|
||||
"""Hardware specifications for V100, A100, and H100 GPUs."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
h_v100 = Hardware.Cloud.V100
|
||||
h_a100 = Hardware.Cloud.A100
|
||||
h_h100 = Hardware.Cloud.H100
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
v100_tflops_fp32_value = h_v100.peak_flops_fp32.m_as(TFLOPs / second)
|
||||
v100_bw_value = h_v100.memory_bw.m_as(GB / second)
|
||||
a100_bw_tbs_value = h_a100.memory_bw.m_as(TB / second)
|
||||
@@ -95,7 +95,7 @@ class GpuSpecs:
|
||||
h100_mem_value = h_h100.memory_capacity.m_as(GiB)
|
||||
h100_tdp_value = h_h100.tdp.m_as(watt)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
v100_tflops_fp32 = f"{v100_tflops_fp32_value:.1f}" # e.g. "14.1" TFLOPS
|
||||
v100_bw = f"{v100_bw_value:.0f}" # e.g. "900" GB/s
|
||||
a100_bw_tbs = f"{a100_bw_tbs_value:.1f}" # e.g. "2.0" TB/s
|
||||
@@ -199,14 +199,14 @@ The consequences of ignoring this inversion become apparent during a *traffic sp
|
||||
class BlackFridayCalc:
|
||||
"""Models the nonlinear failure mode of serving systems under a 10× traffic spike."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
bf_latency_ms_value = 50 # normal operation latency (ms)
|
||||
bf_qps_normal_value = 1000 # normal queries per second
|
||||
bf_qps_spike_value = 10000 # Black Friday peak QPS
|
||||
bf_spike_factor_value = 10 # spike multiplier (10x)
|
||||
bf_collapse_latency_s_value = 10 # latency during collapse (seconds)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
bf_latency_ms_str = f"{bf_latency_ms_value}" # e.g. "50" ms
|
||||
bf_qps_normal_str = f"{bf_qps_normal_value:,}" # e.g. "1,000" QPS
|
||||
bf_qps_spike_str = f"{bf_qps_spike_value:,}" # e.g. "10,000" QPS
|
||||
@@ -383,7 +383,7 @@ This chapter develops the engineering principles needed to orchestrate this pipe
|
||||
|
||||
### Static vs Dynamic Inference {#sec-model-serving-static-vs-dynamic-inference-e864}
|
||||
|
||||
The preceding examples explain *why* serving systems must maintain capacity headroom. But before diving into *how* to optimize inference latency, we must address a prior question: *when* should predictions be computed at all? The first architectural decision in any serving system is whether predictions happen before or during user requests [@google2024staticdynamic]. This choice shapes system design, cost structure, and capability boundaries.
|
||||
The preceding examples explain *why* serving systems must maintain capacity headroom. However, before optimizing *how* to reduce inference latency, a prior question must be addressed: *when* should predictions be computed at all? The first architectural decision in any serving system is whether predictions happen before or during user requests [@google2024staticdynamic]. This choice shapes system design, cost structure, and capability boundaries.
|
||||
|
||||
#### Static Inference {#sec-model-serving-static-inference-35f4}
|
||||
|
||||
@@ -414,15 +414,15 @@ from mlsys.formatting import fmt
|
||||
class StaticBatchCalc:
|
||||
"""Contrasts static vs. dynamic inference economics for photo classification."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_photos_value = 10_000 # photos in user library
|
||||
inference_ms_value = 5 # ResNet-50 inference time (ms)
|
||||
dynamic_latency_budget_ms_value = 100 # real-time latency budget (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
batch_total_s_value = n_photos_value * inference_ms_value / 1000
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
n_photos_str = f"{n_photos_value:,}" # e.g. "10,000" photos
|
||||
inference_ms_str = f"{inference_ms_value}" # e.g. "5" ms
|
||||
batch_total_s_str = fmt(batch_total_s_value, precision=0, commas=False)# e.g. "50" seconds
|
||||
@@ -462,14 +462,14 @@ from mlsys.formatting import fmt, check
|
||||
class CostLatencyCalc:
|
||||
"""Quantifies the economic tradeoff between latency and hardware cost per million queries."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
gpu_cost_per_hour_value = 4.0 # GPU rental cost ($/hour)
|
||||
latency_a_ms_value = 5 # Scenario A: low latency (ms)
|
||||
throughput_a_rps_value = 200 # Scenario A: throughput (req/s)
|
||||
latency_b_ms_value = 10 # Scenario B: higher latency (ms)
|
||||
throughput_b_rps_value = 800 # Scenario B: throughput (req/s)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
queries_per_hour_a_value = throughput_a_rps_value * SEC_PER_HOUR
|
||||
cost_per_million_a_value = gpu_cost_per_hour_value / (queries_per_hour_a_value / MILLION)
|
||||
queries_per_hour_b_value = throughput_b_rps_value * SEC_PER_HOUR
|
||||
@@ -477,10 +477,10 @@ class CostLatencyCalc:
|
||||
cost_increase_pct_value = (cost_per_million_a_value / cost_per_million_b_value - 1) * 100
|
||||
cost_ratio_value = cost_per_million_a_value / cost_per_million_b_value
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(cost_ratio_value > 1, "Scenario A (low latency) must cost more per query than Scenario B.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
gpu_cost_per_hour_str = fmt(gpu_cost_per_hour_value, precision=0, commas=False)
|
||||
latency_a_ms_str = f"{latency_a_ms_value}"
|
||||
throughput_a_rps_str = f"{throughput_a_rps_value}"
|
||||
@@ -596,14 +596,14 @@ from mlsys import Models, Tiers
|
||||
from mlsys.constants import BYTES_FP16, BYTES_INT8
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResNetServingSpectrum:
|
||||
"""
|
||||
Namespace for ResNet-50 Serving Spectrum comparison.
|
||||
Scenario: Mapping the same architecture (or alternatives) to Cloud, Mobile, TinyML.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
m_resnet = Models.ResNet50
|
||||
m_mobilenet = Models.MobileNetV2
|
||||
|
||||
@@ -628,7 +628,7 @@ class ResNetServingSpectrum:
|
||||
tiny_inf_ms = 120.0
|
||||
tiny_energy_mj = 12.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Calculate sizes using the Digital Twins
|
||||
cloud_size_mb = m_resnet.size_in_bytes(BYTES_FP16).m_as('MB')
|
||||
mobile_size_mb = m_resnet.size_in_bytes(BYTES_INT8).m_as('MB')
|
||||
@@ -638,13 +638,13 @@ class ResNetServingSpectrum:
|
||||
tiny_limit_mb = t_tiny.storage.m_as('MB')
|
||||
tiny_feasibility = tiny_original_mb < tiny_limit_mb
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(not tiny_feasibility,
|
||||
f"ResNet-50 ({tiny_original_mb:.1f}MB) should NOT fit on TinyML (<{tiny_limit_mb:.1f}MB).")
|
||||
check(mobile_energy_cpu_mj >= mobile_energy_npu_mj * 3,
|
||||
"NPU should be significantly more energy efficient than CPU.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
cloud_model_mb_str = fmt(cloud_size_mb, precision=0)
|
||||
cloud_inf_b1_ms_str = f"{cloud_inf_b1_ms}"
|
||||
cloud_inf_b16_ms_str = f"{cloud_inf_b16_ms}"
|
||||
@@ -743,11 +743,11 @@ Memory locking (`mlock`)\index{Memory Locking!mlock} addresses a related but dis
|
||||
|
||||
The third technique, interrupt shielding\index{Interrupt Shielding!latency isolation}, completes the isolation picture. Network and storage interrupts routed to inference cores can preempt GPU command submission at unpredictable moments. Steering these interrupts to non-inference cores ensures that bursts of incoming traffic do not disrupt the GPU's command stream, which is particularly important for maintaining stable tail latency under load.
|
||||
|
||||
These isolation principles transform a simple "model script" into a **deterministic service**, a transition essential for safety-critical applications like autonomous driving or real-time industrial control. With the deployment spectrum, load balancing, and resource isolation established, we have defined *where* models serve and *what* infrastructure supports them. The next question is *how* the serving software itself is organized: what components comprise an inference server, and how do they coordinate to turn irregular user traffic into efficient hardware utilization?
|
||||
These isolation principles transform a simple "model script" into a **deterministic service**, a transition essential for safety-critical applications like autonomous driving or real-time industrial control. The deployment spectrum, load balancing, and resource isolation define *where* models serve and *what* infrastructure supports them. The remaining question is *how* the serving software itself is organized: what components comprise an inference server, and how do they coordinate to turn irregular user traffic into efficient hardware utilization?
|
||||
|
||||
## Serving System Architecture {#sec-model-serving-serving-system-architecture-4879}
|
||||
|
||||
The serving paradigm establishes *where* models execute; now we examine *how* the serving software itself is organized. A modern inference server must bridge the gap between irregular user traffic and the batch-oriented requirements of accelerators—a challenge that requires careful architectural decomposition.
|
||||
User requests arrive in unpredictable bursts while accelerators demand steady, uniformly-sized batches. Bridging this gap requires more than a Python script calling `model.predict()`; it requires a specialized software architecture that absorbs traffic variability, forms efficient batches, and keeps hardware saturated without violating latency SLOs.
|
||||
|
||||
### Internal Architecture and Request Flow {#sec-model-serving-anatomy-inference-server-f12e}
|
||||
|
||||
@@ -850,25 +850,25 @@ The following example compares *JSON vs Protobuf serialization*.
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class SerializationEfficiency:
|
||||
"""
|
||||
Namespace for Serialization Efficiency calculation.
|
||||
Scenario: Comparing JSON vs Protobuf for a 1000-float payload.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
floats_count = 1000
|
||||
json_parse_us = 50.0
|
||||
proto_parse_us = 5.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
efficiency_gain = json_parse_us / proto_parse_us
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(efficiency_gain >= 5, f"Protobuf gain ({efficiency_gain:.1f}x) is too small to justify switching.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
serial_floats_str = f"{floats_count:,}"
|
||||
json_size_str = "9"
|
||||
json_parse_str = f"{int(json_parse_us)}"
|
||||
@@ -904,7 +904,7 @@ The architectural components and protocols examined so far describe *how* servin
|
||||
|
||||
## Request Lifecycle {#sec-model-serving-request-lifecycle-d9c6}
|
||||
|
||||
With the serving architecture established, we now trace *what* happens to a single request as it flows through the system. Understanding *where* time goes within each request is essential for effective optimization: one cannot improve what one does not measure.
|
||||
A single HTTP request carrying a 224$\times$224 JPEG image arrives at an inference server. Between the moment the first byte enters the network stack and the moment the classification result leaves, that request traverses six pipeline stages, each consuming milliseconds that the user experiences as wait time. Understanding *where* time goes within each request is essential for effective optimization: one cannot improve what one does not measure.
|
||||
|
||||
### The Latency Budget {#sec-model-serving-latency-budget-ef40}
|
||||
|
||||
@@ -930,14 +930,14 @@ from mlsys.formatting import fmt
|
||||
class TailLatencyRatioCalc:
|
||||
"""Demonstrates why mean latency misleads by showing the p99-to-mean ratio."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
mean_latency_ms_value = 50 # mean latency (ms)
|
||||
p99_latency_ms_value = 2000 # p99 latency (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
tail_ratio_value = p99_latency_ms_value / mean_latency_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
tail_ratio_str = fmt(tail_ratio_value, precision=0, commas=False) # e.g. "40" times
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -1006,7 +1006,7 @@ Understanding *where* time goes requires instrumenting each phase independently.
|
||||
class LatencyTableCalc:
|
||||
"""Decomposes the ResNet-50 request lifecycle into processing phases with percentages."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
l_jpeg_value = 3.0 # JPEG decode (ms)
|
||||
l_resize_value = 1.0 # resize to 224×224 (ms)
|
||||
l_norm_value = 0.5 # normalize (mean/std) (ms)
|
||||
@@ -1014,7 +1014,7 @@ class LatencyTableCalc:
|
||||
l_inf_value = 5.0 # ResNet-50 forward pass (ms)
|
||||
l_post_value = 0.1 # softmax + top-5 (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
l_total_value = l_jpeg_value + l_resize_value + l_norm_value + l_transfer_value + l_inf_value + l_post_value
|
||||
p_jpeg_value = l_jpeg_value / l_total_value * 100
|
||||
p_resize_value = l_resize_value / l_total_value * 100
|
||||
@@ -1023,7 +1023,7 @@ class LatencyTableCalc:
|
||||
p_inf_value = l_inf_value / l_total_value * 100
|
||||
p_post_value = l_post_value / l_total_value * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
l_jpeg_str = f"{l_jpeg_value:.1f}ms"
|
||||
l_resize_str = f"{l_resize_value:.1f}ms"
|
||||
l_norm_str = f"{l_norm_value:.1f}ms"
|
||||
@@ -1077,7 +1077,7 @@ from mlsys.formatting import fmt
|
||||
class LatencyBudgetCalc:
|
||||
"""Demonstrates the shifting bottleneck from inference to preprocessing after TensorRT optimization."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
jpeg_decode_ms_value = 3.0 # JPEG decode (ms)
|
||||
resize_ms_value = 1.0 # resize (ms)
|
||||
normalize_ms_value = 0.5 # normalize (ms)
|
||||
@@ -1086,14 +1086,14 @@ class LatencyBudgetCalc:
|
||||
postprocess_ms_value = 0.1 # postprocessing (ms)
|
||||
tensorrt_inference_ms_value = 2.0 # TensorRT optimized inference (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
preprocess_ms_value = jpeg_decode_ms_value + resize_ms_value + normalize_ms_value
|
||||
total_latency_ms_value = preprocess_ms_value + cpu_gpu_ms_value + resnet_inference_ms_value + postprocess_ms_value
|
||||
preprocess_pct_value = preprocess_ms_value / total_latency_ms_value * 100
|
||||
tensorrt_total_ms_value = preprocess_ms_value + cpu_gpu_ms_value + tensorrt_inference_ms_value + postprocess_ms_value
|
||||
tensorrt_preprocess_pct_value = preprocess_ms_value / tensorrt_total_ms_value * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
preprocess_ms_str = fmt(preprocess_ms_value, precision=1, commas=False)
|
||||
cpu_gpu_ms_str = fmt(cpu_gpu_ms_value, precision=1, commas=False)
|
||||
resnet_inference_ms_str = fmt(resnet_inference_ms_value, precision=0, commas=False)
|
||||
@@ -1154,16 +1154,16 @@ The ResNet example represents compute-bound inference where math dominates. Reco
|
||||
class DlrmLatencyCalc:
|
||||
"""Models DLRM serving latency to contrast I/O-bound vs. compute-bound bottlenecks."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
dlrm_input_ms_value = 0.5 # request parsing (CPU) (ms)
|
||||
dlrm_embed_ms_value = 6.0 # embedding lookups (memory BW) (ms)
|
||||
dlrm_mlp_ms_value = 1.5 # MLP forward pass (compute) (ms)
|
||||
dlrm_post_ms_value = 1.0 # ranking & filtering (CPU) (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
dlrm_total_ms_value = dlrm_input_ms_value + dlrm_embed_ms_value + dlrm_mlp_ms_value + dlrm_post_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
dlrm_input_str = f"{dlrm_input_ms_value}ms"
|
||||
dlrm_embed_str = f"{dlrm_embed_ms_value}ms"
|
||||
dlrm_mlp_str = f"{dlrm_mlp_ms_value}ms"
|
||||
@@ -1218,14 +1218,14 @@ from mlsys.formatting import fmt
|
||||
class AmdahlServingCalc:
|
||||
"""Applies Amdahl's Law to show that 10× model speedup yields only ~1.8× end-to-end improvement."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Re-derived from latency-budget-calc constants
|
||||
preprocess_ms_value = 3.0 + 1.0 + 0.5 # JPEG decode + resize + normalize
|
||||
cpu_gpu_ms_value = 0.5 # CPU→GPU transfer
|
||||
resnet_inference_ms_value = 5.0 # PyTorch inference
|
||||
postprocess_ms_value = 0.1 # postprocessing
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
total_latency_ms_value = preprocess_ms_value + cpu_gpu_ms_value + resnet_inference_ms_value + postprocess_ms_value
|
||||
non_model_ms_value = preprocess_ms_value + cpu_gpu_ms_value
|
||||
non_model_pct_value = non_model_ms_value / total_latency_ms_value * 100
|
||||
@@ -1233,7 +1233,7 @@ class AmdahlServingCalc:
|
||||
optimized_total_ms_value = non_model_ms_value + model_10x_ms_value + postprocess_ms_value
|
||||
amdahl_speedup_value = total_latency_ms_value / optimized_total_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
non_model_pct_str = fmt(non_model_pct_value, precision=0, commas=False)
|
||||
optimized_total_str = fmt(optimized_total_ms_value, precision=1, commas=False)
|
||||
amdahl_speedup_str = fmt(amdahl_speedup_value, precision=1, commas=False)
|
||||
@@ -1305,15 +1305,15 @@ from mlsys.formatting import fmt
|
||||
class ResolutionScalingCalc:
|
||||
"""Quantifies the quadratic relationship between input resolution and inference slowdown."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
r1_value = 224 # original resolution
|
||||
r2_value = 448 # doubled resolution
|
||||
measured_slowdown_value = 3.6 # actual measured slowdown
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
theoretical_slowdown_value = (r2_value / r1_value) ** 2
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
r1_str = f"{r1_value}"
|
||||
r2_str = f"{r2_value}"
|
||||
theoretical_str = fmt(theoretical_slowdown_value, precision=0, commas=False)
|
||||
@@ -1348,7 +1348,7 @@ from mlsys.formatting import fmt
|
||||
class ResolutionBottleneckCalc:
|
||||
"""Shows that increasing resolution decreases arithmetic intensity, shifting from compute- to memory-bound."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
act_224_mb_value = 12.5 # 224×224 activation size (MB)
|
||||
act_384_mb_value = 36.8 # 384×384 activation size (MB)
|
||||
act_512_mb_value = 65.5 # 512×512 activation size (MB)
|
||||
@@ -1361,7 +1361,7 @@ class ResolutionBottleneckCalc:
|
||||
|
||||
ridge_point_value = 16 # V100 ridge point (FLOPs/byte)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
act_224_mb_str = f"{act_224_mb_value}"
|
||||
act_384_mb_str = f"{act_384_mb_value}"
|
||||
act_512_mb_str = f"{act_512_mb_value}"
|
||||
@@ -1421,11 +1421,11 @@ from mlsys.formatting import fmt
|
||||
class AdaptiveResolutionCalc:
|
||||
"""Demonstrates 1.4× throughput gain from content-aware resolution selection."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
adaptive_throughput_improvement_value = 1.4 # throughput gain factor
|
||||
adaptive_accuracy_retention_value = 99.2 # accuracy retention (%)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
adaptive_throughput_improvement_str = fmt(adaptive_throughput_improvement_value, precision=1, commas=False)
|
||||
adaptive_accuracy_retention_str = fmt(adaptive_accuracy_retention_value, precision=1, commas=False)
|
||||
|
||||
@@ -1570,7 +1570,7 @@ Serving engineers routinely face a concrete question: given a latency SLO\index{
|
||||
from mlsys.constants import MS_PER_SEC
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CapacityPlanningAnchor:
|
||||
"""
|
||||
Namespace for serving capacity anchor.
|
||||
@@ -1588,25 +1588,25 @@ serving_qps_str = CapacityPlanningAnchor.qps_str
|
||||
serving_slo_str = CapacityPlanningAnchor.slo_str
|
||||
serving_concurrency_slots_str = CapacityPlanningAnchor.slots_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CapacityPlanning:
|
||||
"""
|
||||
Namespace for Little's Law Capacity calculation.
|
||||
Scenario: Determining concurrency requirements for a 1000 QPS target.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
lambda_qps = 1000.0
|
||||
latency_slo_s = 0.050 # 50ms
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# L = lambda * W
|
||||
concurrency = lambda_qps * latency_slo_s
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(concurrency == 50, f"Math broken: 1000 * 0.05 should be 50, got {concurrency}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
littles_lambda_str = f"{lambda_qps:,.0f}"
|
||||
littles_w_ms_str = f"{int(latency_slo_s * MS_PER_SEC)}"
|
||||
littles_w_str = fmt(latency_slo_s, precision=2, commas=False)
|
||||
@@ -1670,21 +1670,21 @@ L = `{python} littles_lambda_str`$\times$ `{python} littles_w_str` = **`{python}
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class BatchingTax:
|
||||
"""
|
||||
Namespace for The Batching Tax calculation.
|
||||
Scenario: Comparing wait times for B=1 vs B=32 at 500 QPS.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
lambda_qps = 500.0
|
||||
|
||||
# Inference times (ms)
|
||||
t_inf_b1 = 2.0
|
||||
t_inf_b32 = 15.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Batch 1
|
||||
w_form_b1 = (1-1) / (2 * lambda_qps) * 1000 # 0ms
|
||||
lat_b1 = w_form_b1 + t_inf_b1
|
||||
@@ -1696,10 +1696,10 @@ class BatchingTax:
|
||||
|
||||
penalty_ratio = lat_b32 / lat_b1
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(lat_b32 > lat_b1 * 10, f"Batch-32 penalty ({lat_b32:.1f}ms) should be significant.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
wait_time_b1_ms_str = f"{int(w_form_b1)}"
|
||||
wait_time_b32_ms_str = f"{int(w_form_b32)}"
|
||||
lat_b1_ms_str = f"{lat_b1:.1f}"
|
||||
@@ -1812,7 +1812,7 @@ from mlsys.formatting import fmt
|
||||
class CapacityPlanningCalc:
|
||||
"""ResNet-50 capacity planning: GPUs needed for 5000 QPS at 50ms p99 with N+1 redundancy."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cp_peak_qps_value = 5000 # peak traffic (QPS)
|
||||
cp_service_ms_value = 5 # TensorRT FP16 service time (ms)
|
||||
cp_p99_target_ms_value = 50 # p99 latency SLO (ms)
|
||||
@@ -1824,7 +1824,7 @@ class CapacityPlanningCalc:
|
||||
mm1_p99_factor_value = 4.6 # p99 multiplier for M/M/1
|
||||
mm1_rho_example_value = 0.7 # example utilization
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cp_mu_required_value = cp_peak_qps_value / cp_rho_safe_value
|
||||
cp_gpus_raw_value = cp_mu_required_value / cp_v100_throughput_value
|
||||
cp_gpus_ceil_value = math.ceil(cp_gpus_raw_value)
|
||||
@@ -1835,7 +1835,7 @@ class CapacityPlanningCalc:
|
||||
cp_precision_ratio_value = cp_fp32_bits_value // cp_int8_bits_value
|
||||
mm1_wait_factor_value = mm1_p99_factor_value / (1 - mm1_rho_example_value)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cp_mu_required_str = fmt(cp_mu_required_value, precision=0, commas=False)
|
||||
cp_gpus_raw_str = fmt(cp_gpus_raw_value, precision=1, commas=False)
|
||||
cp_gpus_ceil_str = f"{cp_gpus_ceil_value}"
|
||||
@@ -2038,7 +2038,7 @@ Cold start\index{Cold Start!anatomy} latency compounds from multiple sources, ea
|
||||
class ColdStartCalc:
|
||||
"""Decomposes cold start latency showing pre-compiled TensorRT reduces startup from ~35s to ~1.5s."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cs_ssd_value = 0.5 # weight loading from SSD (s)
|
||||
cs_s3_value = 4.0 # weight loading from S3 (s)
|
||||
cs_cuda_value = 0.4 # CUDA context initialization (s)
|
||||
@@ -2046,11 +2046,11 @@ class ColdStartCalc:
|
||||
cs_warmup_value = 0.2 # warmup inferences (s)
|
||||
cs_runtime_overhead_value = 0.4 # runtime overhead (s)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cs_local_total_value = cs_ssd_value + cs_cuda_value + cs_warmup_value + cs_runtime_overhead_value
|
||||
cs_cloud_total_value = cs_s3_value + cs_cuda_value + cs_compile_value + cs_warmup_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cs_local_str = f"~{cs_local_total_value:.1f}s"
|
||||
cs_cloud_str = f"~{cs_cloud_total_value:.0f}s"
|
||||
cs_ssd_str = f"{cs_ssd_value}s"
|
||||
@@ -2150,14 +2150,14 @@ from mlsys.formatting import fmt
|
||||
class ModelSwapCalc:
|
||||
"""Quantifies the latency cost of swapping a 10 GB model over PCIe Gen4."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
model_size_gb_value = 10 # model size (GB)
|
||||
pcie_bw_gbs_value = PCIE_GEN4_BW.m_as(GB / second) # PCIe Gen4 x16 bandwidth
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
model_swap_ms_value = model_size_gb_value / pcie_bw_gbs_value * 1000
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
model_size_gb_str = f"{model_size_gb_value}"
|
||||
pcie_bw_gbs_str = fmt(pcie_bw_gbs_value, precision=0, commas=False)
|
||||
model_swap_ms_str = fmt(model_swap_ms_value, precision=0, commas=False)
|
||||
@@ -2222,18 +2222,18 @@ from mlsys.formatting import fmt
|
||||
class BatchThroughputCalc:
|
||||
"""Quantifies the 6.4× throughput gain of batch-32 over batch-1 and its latency cost."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
batch1_throughput_value = 200 # batch-1 throughput (img/s)
|
||||
batch32_throughput_value = 1280 # batch-32 throughput (img/s)
|
||||
batch32_inference_ms_value = 25.0 # batch-32 inference time (ms)
|
||||
batch_window_ms_value = 10.0 # batching window (ms)
|
||||
batch1_inference_total_ms_value = 5.0 # batch-1 total latency (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
throughput_ratio_value = batch32_throughput_value / batch1_throughput_value
|
||||
batch32_total_ms_value = batch_window_ms_value + batch32_inference_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
throughput_ratio_str = fmt(throughput_ratio_value, precision=1, commas=False)
|
||||
batch_window_ms_str = fmt(batch_window_ms_value, precision=0, commas=False)
|
||||
batch32_inference_ms_str = fmt(batch32_inference_ms_value, precision=0, commas=False)
|
||||
@@ -2288,18 +2288,18 @@ from mlsys.formatting import fmt
|
||||
class BatchingSweetspotCalc:
|
||||
"""Demonstrates that batch-8 yields ~3× throughput within a 20ms SLO budget."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
batch1_ms_value = 5.0 # batch-1 inference (ms)
|
||||
batch1_imgs_value = 200 # batch-1 throughput (img/s)
|
||||
batch8_wait_ms_value = 5.0 # batch-8 wait time (ms)
|
||||
batch8_inference_ms_value = 9.0 # batch-8 inference time (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
batch8_user_latency_ms_value = batch8_wait_ms_value + batch8_inference_ms_value
|
||||
batch8_throughput_value = 8 / (batch8_user_latency_ms_value / 1000)
|
||||
latency_increase_value = batch8_user_latency_ms_value / batch1_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
batch1_ms_str = fmt(batch1_ms_value, precision=0, commas=False)
|
||||
batch1_imgs_str = f"{batch1_imgs_value}"
|
||||
batch8_wait_ms_str = fmt(batch8_wait_ms_value, precision=0, commas=False)
|
||||
@@ -2452,16 +2452,16 @@ from mlsys.formatting import fmt
|
||||
class BatchingBudgetCalc:
|
||||
"""Shows that a 20ms batching window consumes 20% of a 50ms SLO before any computation."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
batch_window_ms_value = 20 # batching window (ms)
|
||||
slo_ms_value = 50 # latency SLO (ms)
|
||||
inference_ms_value = 5 # inference time (ms)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
avg_wait_ms_value = batch_window_ms_value / 2
|
||||
budget_pct_value = avg_wait_ms_value / slo_ms_value * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
avg_wait_str = fmt(avg_wait_ms_value, precision=0, commas=False)
|
||||
budget_pct_str = fmt(budget_pct_value, precision=0, commas=False)
|
||||
|
||||
@@ -2530,7 +2530,7 @@ def throughput_value(b, _T=10.0, _fixed=5.0, _per_image=0.6):
|
||||
class BatchingAnalysisCalc:
|
||||
"""Quantifies batching efficiency: batch-32 achieves 14.6× throughput gain over batch-1."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
T_window_value = 10.0 # batching window (ms)
|
||||
fixed_overhead_ms_value = 5.0 # fixed overhead (ms)
|
||||
per_image_ms_value = 0.6 # per-image compute (ms)
|
||||
@@ -2540,14 +2540,14 @@ class BatchingAnalysisCalc:
|
||||
il_compute_b32_ms_value = 19.2 # batch-32 compute (ms)
|
||||
il_threshold_pct_value = 10 # efficiency threshold (%)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
throughputs_value = {b: throughput_value(b) for b in batch_sizes_value}
|
||||
latencies_value = {b: total_latency_value(b) for b in batch_sizes_value}
|
||||
throughput_increase_value = throughputs_value[32] / throughputs_value[1]
|
||||
il_eff_b1_pct_value = int(il_compute_b1_ms_value / (il_overhead_ms_value + il_compute_b1_ms_value) * 100)
|
||||
il_eff_b32_pct_value = int(il_compute_b32_ms_value / (il_overhead_ms_value + il_compute_b32_ms_value) * 100)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
throughput_increase_str = fmt(throughput_increase_value, precision=1, commas=False)
|
||||
b1_throughput_str = fmt(throughputs_value[1], precision=0, commas=False)
|
||||
b32_throughput_str = fmt(throughputs_value[32], precision=0, commas=False)
|
||||
@@ -2636,14 +2636,14 @@ from mlsys.formatting import fmt
|
||||
# │ latency_p99_increase_ms_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class BatchingOptimization:
|
||||
"""
|
||||
Namespace for Latency-Constrained Batching Optimization.
|
||||
Scenario: Comparing 5ms (Conservative) vs 25ms (Aggressive) batching windows.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Scenario 1 (Conservative)
|
||||
s1_window = 5.0
|
||||
s1_batch = 32
|
||||
@@ -2654,7 +2654,7 @@ class BatchingOptimization:
|
||||
s2_batch = 48
|
||||
s2_tput = 1280.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Avg wait = Window / 2
|
||||
s1_wait = s1_window / 2
|
||||
s2_wait = s2_window / 2
|
||||
@@ -2667,11 +2667,11 @@ class BatchingOptimization:
|
||||
tput_gain = ((s2_tput / s1_tput) - 1) * 100
|
||||
latency_increase = s2_wait - s1_wait
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(tput_gain <= 25, f"Aggressive batching gained too much throughput ({tput_gain:.1f}%). Diminishing returns not shown.")
|
||||
check(latency_increase >= 5, "Latency penalty is too small to be a concern.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
s1_window_ms_str = f"{int(s1_window)}"
|
||||
s1_wait_ms_str = f"{s1_wait}"
|
||||
s1_budget_ms_str = f"{s1_budget}"
|
||||
@@ -2750,12 +2750,12 @@ from mlsys.formatting import fmt
|
||||
class SloViolationCalc:
|
||||
"""p99 latency is 2.2× the mean due to batch size variance from Poisson arrivals."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
qps_value = 500
|
||||
T_slo_value = 10.0
|
||||
p99_batch_value = 11
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# service_time_value is a module-level function from batching-analysis-calc
|
||||
mean_batch_value = qps_value * (T_slo_value / 1000)
|
||||
mean_wait_value = T_slo_value / 2
|
||||
@@ -2765,7 +2765,7 @@ class SloViolationCalc:
|
||||
p99_latency_value = T_slo_value + p99_service_value
|
||||
p99_to_mean_value = p99_latency_value / mean_latency_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
qps_str = f"{qps_value}"
|
||||
T_slo_str = fmt(T_slo_value, precision=0, commas=False)
|
||||
mean_wait_str = fmt(mean_wait_value, precision=0, commas=False)
|
||||
@@ -2893,7 +2893,7 @@ from mlsys.formatting import fmt
|
||||
class PracticalConfigCalc:
|
||||
"""Derives a production batching config: 30% SLO budget yields a 12ms window for 500 QPS."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
pc_slo_ms_value = 50
|
||||
pc_qps_value = 500
|
||||
pc_budget_pct_value = 0.3
|
||||
@@ -2903,12 +2903,12 @@ class PracticalConfigCalc:
|
||||
pc_predicted_p99_ms_value = 43
|
||||
pc_predicted_throughput_value = 1180
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
pc_batch_budget_ms_value = pc_slo_ms_value * pc_budget_pct_value
|
||||
pc_max_window_ms_value = pc_batch_budget_ms_value
|
||||
pc_expected_batch_value = pc_qps_value * (pc_max_window_ms_value / 1000)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
pc_slo_ms_str = f"{pc_slo_ms_value}"
|
||||
pc_qps_str = f"{pc_qps_value}"
|
||||
pc_batch_budget_ms_str = f"{pc_batch_budget_ms_value:.0f}"
|
||||
@@ -2979,17 +2979,15 @@ Continuous batching represents the state of the art for LLM serving, yet not all
|
||||
|
||||
#### When Not to Batch {#sec-model-serving-batch-12a4}
|
||||
|
||||
Some\index{Batching!when to avoid} scenarios require single-request processing. Ultra-low latency requirements\index{Ultra-Low Latency!no batching}, where p99 latency must stay under 10 ms, make any batching delay unacceptable. Highly variable request sizes create padding overhead that wastes compute, since the smallest input in a batch must be padded to match the largest. And memory constraints become binding when models already consume most GPU memory, since batch activations scale linearly with batch size and can trigger out-of-memory errors.
|
||||
Some\index{Batching!when to avoid} scenarios require single-request processing. Ultra-low latency requirements\index{Ultra-Low Latency!no batching}, where p99 latency must stay under 10 ms, make any batching delay unacceptable. Highly variable request sizes create padding overhead that wastes compute, since the smallest input in a batch must be padded to match the largest. Memory constraints also become binding when models already consume most GPU memory, since batch activations scale linearly with batch size and can trigger out-of-memory errors.
|
||||
|
||||
### Session Affinity Constraints {#sec-model-serving-session-affinity-constraints-8b1f}
|
||||
|
||||
When requests from the same user or session should route to the same replica, batching becomes constrained. Session affinity, also called sticky sessions, matters for three main reasons.
|
||||
|
||||
**KV-Cache Reuse**\index{KV Cache!session reuse}\index{KV Cache!multi-turn conversations}: For conversational AI, the key-value cache from previous turns dramatically speeds up multi-turn conversations. Routing a follow-up request to a different replica forfeits this cached context, increasing latency by 2 to 5 times for long conversations.
|
||||
The most impactful case is KV-cache reuse\index{KV Cache!session reuse}\index{KV Cache!multi-turn conversations} in conversational AI, where the key-value cache from previous turns dramatically speeds up multi-turn conversations. Routing a follow-up request to a different replica forfeits this cached context, increasing latency by 2 to 5 times for long conversations.
|
||||
|
||||
**User-Specific Models**\index{Personalized Models!user adapters}: Some systems serve personalized models or adapters per user. Routing requests to the replica that has already loaded that user's adapter avoids repeated loading overhead.
|
||||
|
||||
**Stateful Preprocessing**: When preprocessing maintains state through tokenizer caches or session-specific normalization, routing to a different replica requires rebuilding this state.
|
||||
A second driver is user-specific models\index{Personalized Models!user adapters}: some systems serve personalized models or adapters per user, and routing requests to the replica that has already loaded that user's adapter avoids repeated loading overhead. Similarly, stateful preprocessing that maintains tokenizer caches or session-specific normalization requires rebuilding state when requests route to a different replica.
|
||||
|
||||
The tension with batching is clear since strict affinity\index{Session Affinity!sticky sessions} constrains which requests can be batched together, potentially reducing batch sizes and GPU utilization. Production systems often implement soft affinity\index{Soft Affinity!load balancing} where requests prefer their assigned replica but can overflow to others when that replica is overloaded. This preserves most affinity benefits while maintaining load balance.
|
||||
|
||||
@@ -3078,7 +3076,7 @@ Mobile\index{Single-User Traffic!mobile serving}\index{SingleStream!MLPerf scena
|
||||
class MobileServingCalc:
|
||||
"""Mobile vision pipeline: JPEG decode dominates energy; NPU handles inference at 82% utilization."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
m_cam_ms_value = 8
|
||||
m_jpeg_ms_value = 15
|
||||
m_resize_ms_value = 5
|
||||
@@ -3090,7 +3088,7 @@ class MobileServingCalc:
|
||||
m_npu_mj_value = 0.8
|
||||
m_ui_mj_value = 0.2
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
m_total_ms_value = (
|
||||
m_cam_ms_value + m_jpeg_ms_value + m_resize_ms_value + m_npu_ms_value + m_ui_ms_value
|
||||
)
|
||||
@@ -3098,7 +3096,7 @@ class MobileServingCalc:
|
||||
m_cam_mj_value + m_jpeg_mj_value + m_resize_mj_value + m_npu_mj_value + m_ui_mj_value
|
||||
)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
m_cam_ms_str = f"{m_cam_ms_value}ms"
|
||||
m_jpeg_ms_str = f"{m_jpeg_ms_value}ms"
|
||||
m_resize_ms_str = f"{m_resize_ms_value}ms"
|
||||
@@ -3181,7 +3179,7 @@ Batching is the primary lever for serving economics, but the optimal strategy de
|
||||
- [ ] **Adaptive windows**: Can you explain why the optimal batching window *decreases* as traffic *increases*, even though batch sizes grow?
|
||||
::::
|
||||
|
||||
The batching strategies examined so far share a critical assumption: each request produces a single, fixed-size output---one classification label, one bounding box, one embedding vector. This assumption governs the queuing math, the Pareto frontier analysis, and the traffic-adaptive window tuning. But the fastest-growing category of serving workloads violates this assumption entirely. Large language models generate outputs token by token, with each token depending on every previous one. A single request may produce hundreds or thousands of tokens over seconds of elapsed time, yet must feel responsive from the first token onward. This fundamental shift from fixed-output to variable-length, streaming-output serving demands new metrics, new memory management strategies, and new batching techniques that build on---but substantially extend---the foundations established above.
|
||||
The batching strategies examined so far share a critical assumption: each request produces a single, fixed-size output---one classification label, one bounding box, one embedding vector. This assumption governs the queuing math, the Pareto frontier analysis, and the traffic-adaptive window tuning. The fastest-growing category of serving workloads, however, violates this assumption entirely. Large language models generate outputs token by token, with each token depending on every previous one. A single request may produce hundreds or thousands of tokens over seconds of elapsed time, yet must feel responsive from the first token onward. This fundamental shift from fixed-output to variable-length, streaming-output serving demands new metrics, new memory management strategies, and new batching techniques that build on---but substantially extend---the foundations established above.
|
||||
|
||||
## LLM Serving {#sec-model-serving-llm-serving-b8bf}
|
||||
|
||||
@@ -3269,7 +3267,7 @@ from mlsys.formatting import fmt
|
||||
class CarbonCostCalc:
|
||||
"""Energy cost per LLM token: poor utilization causes 10× higher Joules/token."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cc_concurrent_req_value = 114 # concurrent requests per H100
|
||||
cc_tokens_per_sec_req_value = 7.5 # tokens/sec per request (decode phase)
|
||||
cc_host_overhead_w_value = 300 # host server power overhead (W)
|
||||
@@ -3277,7 +3275,7 @@ class CarbonCostCalc:
|
||||
cc_low_util_pct_value = 10 # poor utilization scenario (%)
|
||||
cc_idle_power_w_value = 300 # GPU idle power (W)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
_h100_tdp_w = H100_TDP.m_as(watt)
|
||||
cc_total_tokens_sec_value = cc_concurrent_req_value * cc_tokens_per_sec_req_value
|
||||
cc_total_power_w_value = _h100_tdp_w + cc_host_overhead_w_value
|
||||
@@ -3288,7 +3286,7 @@ class CarbonCostCalc:
|
||||
cc_low_util_tokens_sec_value = cc_total_tokens_sec_value * (cc_low_util_pct_value / 100)
|
||||
cc_low_util_joules_value = cc_idle_power_w_value / cc_low_util_tokens_sec_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cc_concurrent_str = fmt(cc_concurrent_req_value, precision=0, commas=False)
|
||||
cc_tokens_req_str = fmt(cc_tokens_per_sec_req_value, precision=1, commas=False)
|
||||
cc_total_tokens_str = fmt(cc_total_tokens_sec_value, precision=0, commas=False)
|
||||
@@ -3405,7 +3403,7 @@ from mlsys.formatting import fmt
|
||||
class RuntimeComparisonCalc:
|
||||
"""ResNet-50 runtime comparison: TensorRT INT8 achieves up to 9× speedup over eager PyTorch."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
rt_pytorch_ms_value = 8.5
|
||||
rt_torchscript_ms_value = 6.2
|
||||
rt_onnx_ms_value = 5.1
|
||||
@@ -3413,7 +3411,7 @@ class RuntimeComparisonCalc:
|
||||
rt_trt_fp16_ms_value = 1.4
|
||||
rt_trt_int8_ms_value = 0.9
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
rt_pytorch_speedup_value = 1.0
|
||||
rt_torchscript_speedup_value = rt_pytorch_ms_value / rt_torchscript_ms_value
|
||||
rt_onnx_speedup_value = rt_pytorch_ms_value / rt_onnx_ms_value
|
||||
@@ -3421,7 +3419,7 @@ class RuntimeComparisonCalc:
|
||||
rt_trt_fp16_speedup_value = rt_pytorch_ms_value / rt_trt_fp16_ms_value
|
||||
rt_trt_int8_speedup_value = rt_pytorch_ms_value / rt_trt_int8_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
rt_pytorch_ms_str = f"{rt_pytorch_ms_value}"
|
||||
rt_torchscript_ms_str = f"{rt_torchscript_ms_value}"
|
||||
rt_onnx_ms_str = f"{rt_onnx_ms_value}"
|
||||
@@ -3512,7 +3510,7 @@ from mlsys.formatting import fmt
|
||||
class PrecisionTradeoffCalc:
|
||||
"""ResNet-50 precision tradeoffs: FP16 gives 2× free speedup; INT8 gives 3× with <0.4pp accuracy loss."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
pt_fp32_ms_value = 2.8
|
||||
pt_fp32_mem_mb_value = 98
|
||||
pt_fp32_acc_value = 76.13
|
||||
@@ -3526,12 +3524,12 @@ class PrecisionTradeoffCalc:
|
||||
pt_int8_qat_acc_value = 76.05
|
||||
pt_int8_util_value = 92
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
pt_int8_speedup_value = pt_fp32_ms_value / pt_int8_ms_value
|
||||
pt_fp16_speedup_value = pt_fp32_ms_value / pt_fp16_ms_value
|
||||
pt_int8_acc_loss_value = pt_fp32_acc_value - pt_int8_ptq_acc_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
pt_fp32_ms_str = f"{pt_fp32_ms_value}"
|
||||
pt_fp32_mem_mb_str = f"{pt_fp32_mem_mb_value}"
|
||||
pt_fp32_acc_str = f"{pt_fp32_acc_value}"
|
||||
@@ -3601,7 +3599,7 @@ Advanced\index{Dynamic Precision!adaptive quality} serving systems select precis
|
||||
|
||||
The precision decision has direct infrastructure consequences: INT8 inference achieves roughly 3$\times$ higher throughput than FP32, meaning a workload requiring 30 GPUs at FP32 needs only 10 at INT8. This 3$\times$ reduction in hardware translates directly to a 3$\times$ reduction in operating costs. The connection between model-level optimization and infrastructure economics is why precision selection cannot be treated as purely a model concern.
|
||||
|
||||
Runtime selection and precision tuning operate at the model level: they determine *what* computation runs and at *what* numerical format. But between the model and the silicon lies another optimization layer—the mechanics of how computation graphs compile to kernels, how bytes move from disk to memory, and how the CPU and GPU coordinate their work. These node-level techniques often yield the final 2--5$\times$ that separates a functional prototype from a production-grade serving node.
|
||||
Runtime selection and precision tuning operate at the model level: they determine *what* computation runs and at *what* numerical format. Between the model and the silicon, however, lies another optimization layer—the mechanics of how computation graphs compile to kernels, how bytes move from disk to memory, and how the CPU and GPU coordinate their work. These node-level techniques often yield the final 2--5$\times$ that separates a functional prototype from a production-grade serving node.
|
||||
|
||||
## Node-Level Optimization {#sec-model-serving-nodelevel-optimization-3d9d}
|
||||
|
||||
@@ -3722,7 +3720,7 @@ The optimization techniques examined so far—batching, runtime selection, preci
|
||||
|
||||
## Economics and Planning {#sec-model-serving-economics-capacity-planning-3e7e}
|
||||
|
||||
Every optimization technique examined so far—batching, precision tuning, operator fusion, graph compilation—reduces a single number: the cost of one inference on one machine. But production deployment requires answering a different question: how many machines, of what type, at what total cost? A team that achieves 1,200 images/second on a V100 still needs to know whether 8 V100s at \$3/hour each or 24 T4s at \$0.53/hour each yields lower total cost of ownership for their 5,000 QPS target. Serving costs\index{Serving Economics!infrastructure costs}\index{Serving Costs!request volume scaling} scale with request volume, unlike training costs that scale with dataset size and model complexity [@zhang2019mark]. The intelligence deflation trend shown in @fig-intelligence-deflation intensifies this pressure: as per-token prices collapse by orders of magnitude, the margin on each inference shrinks, making infrastructure efficiency the primary lever for economic viability.
|
||||
Every optimization technique examined so far—batching, precision tuning, operator fusion, graph compilation—reduces a single number: the cost of one inference on one machine. Production deployment, however, requires answering a different question: how many machines, of what type, at what total cost? A team that achieves 1,200 images/second on a V100 still needs to know whether 8 V100s at \$3/hour each or 24 T4s at \$0.53/hour each yields lower total cost of ownership for their 5,000 QPS target. Serving costs\index{Serving Economics!infrastructure costs}\index{Serving Costs!request volume scaling} scale with request volume, unlike training costs that scale with dataset size and model complexity [@zhang2019mark]. The intelligence deflation trend shown in @fig-intelligence-deflation intensifies this pressure: as per-token prices collapse by orders of magnitude, the margin on each inference shrinks, making infrastructure efficiency the primary lever for economic viability.
|
||||
|
||||
### Cost Per Inference {#sec-model-serving-cost-per-inference-27fc}
|
||||
|
||||
@@ -3752,7 +3750,7 @@ from mlsys.formatting import fmt
|
||||
class CostAnalysisCalc:
|
||||
"""ResNet-50 cost analysis: T4 achieves lowest cost-per-image despite higher hourly rate than CPU."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
ca_cpu_cost_value = 0.17
|
||||
ca_cpu_throughput_value = 50
|
||||
ca_t4_cost_value = 0.53
|
||||
@@ -3760,13 +3758,13 @@ class CostAnalysisCalc:
|
||||
ca_v100_cost_value = 3.06
|
||||
ca_v100_throughput_value = 1200
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
ca_cpu_cpm_value = ca_cpu_cost_value / (ca_cpu_throughput_value * SEC_PER_HOUR / MILLION)
|
||||
ca_t4_cpm_value = ca_t4_cost_value / (ca_t4_throughput_value * SEC_PER_HOUR / MILLION)
|
||||
ca_v100_cpm_value = ca_v100_cost_value / (ca_v100_throughput_value * SEC_PER_HOUR / MILLION)
|
||||
ca_v100_price_increase_value = ca_v100_cost_value / ca_t4_cost_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
ca_cpu_cost_str = fmt(ca_cpu_cost_value, precision=2, commas=False)
|
||||
ca_cpu_throughput_str = f"{ca_cpu_throughput_value}"
|
||||
ca_cpu_cpm_str = fmt(ca_cpu_cpm_value, precision=2, commas=False)
|
||||
@@ -3954,7 +3952,7 @@ from mlsys.formatting import fmt
|
||||
class LlmServingCalc:
|
||||
"""Llama-3-8B serving economics: memory capacity bounds throughput, bandwidth bounds latency."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
model_weight_gb_value = 3.5
|
||||
realized_tpot_ms_value = 10 # conservative production target (theoretical min ~1-2ms)
|
||||
decode_tokens_value = 256
|
||||
@@ -3964,7 +3962,7 @@ class LlmServingCalc:
|
||||
ttft_s_value = 0.12
|
||||
hourly_cost_value = 3.00
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
_h100_bw_tb = H100_MEM_BW.m_as(TB / second)
|
||||
token_time_theoretical_ms_value = model_weight_gb_value / (_h100_bw_tb * 1000) * 1000
|
||||
total_decode_s_value = decode_tokens_value * realized_tpot_ms_value / 1000
|
||||
@@ -3975,7 +3973,7 @@ class LlmServingCalc:
|
||||
cost_per_m_tokens_value = hourly_cost_value / (tokens_per_hour_value / MILLION)
|
||||
remaining_vram_gb_value = int(80 - model_weight_gb_value)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
model_weight_gb_str = f"{model_weight_gb_value}"
|
||||
h100_bw_tb_str = fmt(_h100_bw_tb, precision=1, commas=False)
|
||||
token_time_theoretical_ms_str = fmt(token_time_theoretical_ms_value, precision=0, commas=False)
|
||||
@@ -4068,12 +4066,12 @@ from mlsys.formatting import fmt
|
||||
class FallacyLatencyCalc:
|
||||
"""Shows system-level speedup far exceeds model-level speedup due to nonlinear queuing dynamics."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
fl_utilization_high_value = 0.8
|
||||
fl_service_slow_ms_value = 5
|
||||
fl_service_fast_ms_value = 2
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# M/M/1: wait = service * rho / (1 - rho)
|
||||
fl_wait_slow_ms_value = fl_service_slow_ms_value * fl_utilization_high_value / (1 - fl_utilization_high_value)
|
||||
fl_total_slow_ms_value = fl_wait_slow_ms_value + fl_service_slow_ms_value
|
||||
@@ -4086,7 +4084,7 @@ class FallacyLatencyCalc:
|
||||
fl_queuing_improvement_value = fl_wait_slow_ms_value / fl_wait_fast_ms_value
|
||||
fl_inference_gain_ms_value = fl_service_slow_ms_value - fl_service_fast_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fl_utilization_high_pct_str = f"{fl_utilization_high_value * 100:.0f}"
|
||||
fl_service_slow_ms_str = f"{fl_service_slow_ms_value}"
|
||||
fl_wait_slow_ms_str = f"{fl_wait_slow_ms_value:.0f}"
|
||||
@@ -4143,12 +4141,12 @@ from mlsys.formatting import fmt
|
||||
class FallacyUtilizationCalc:
|
||||
"""Moving from 70% to 90% utilization cuts costs 22% but triples average latency."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
fu_util_high_value = 0.9
|
||||
fu_util_mod_value = 0.7
|
||||
fu_service_ms_value = 5
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# M/M/1: wait factor = rho / (1 - rho); total time = service / (1 - rho)
|
||||
fu_wait_high_factor_value = fu_util_high_value / (1 - fu_util_high_value)
|
||||
fu_cost_reduction_pct_value = (1 - fu_util_mod_value / fu_util_high_value) * 100
|
||||
@@ -4158,7 +4156,7 @@ class FallacyUtilizationCalc:
|
||||
fu_p99_mod_value = 4.6 * fu_service_ms_value / (1 - fu_util_mod_value)
|
||||
fu_p99_high_value = 4.6 * fu_service_ms_value / (1 - fu_util_high_value)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fu_util_high_pct_str = f"{fu_util_high_value * 100:.0f}"
|
||||
fu_util_mod_pct_str = f"{fu_util_mod_value * 100:.0f}"
|
||||
fu_wait_high_factor_str = f"{fu_wait_high_factor_value:.0f}"
|
||||
@@ -4207,16 +4205,16 @@ from mlsys.formatting import fmt
|
||||
class FallacySkewCalc:
|
||||
"""Training-serving skew: 95% validation accuracy drops to 90% from preprocessing mismatches."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
fs_val_acc_value = 95.0
|
||||
fs_prod_acc_value = 90.0
|
||||
fs_resize_drop_min_value = 0.5
|
||||
fs_resize_drop_max_value = 1.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
fs_acc_drop_value = fs_val_acc_value - fs_prod_acc_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fs_val_acc_str = f"{fs_val_acc_value:.0f}"
|
||||
fs_prod_acc_str = f"{fs_prod_acc_value:.0f}"
|
||||
fs_acc_drop_str = f"{fs_acc_drop_value:.0f}"
|
||||
@@ -4256,18 +4254,18 @@ from mlsys.formatting import fmt
|
||||
class TailLatencyCalc:
|
||||
"""At 70% utilization, p99 latency is 4.6× the mean — invisible to average-based monitoring."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
tl_util_value = 0.7
|
||||
tl_service_ms_value = 5
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# M/M/1: avg time in system = service / (1 - rho)
|
||||
tl_avg_ms_value = tl_service_ms_value / (1 - tl_util_value)
|
||||
# M/M/1 p99 approximation: 4.6 * avg
|
||||
tl_p99_ms_value = 4.6 * tl_avg_ms_value
|
||||
tl_gap_value = tl_p99_ms_value / tl_avg_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
tl_util_pct_str = f"{tl_util_value * 100:.0f}"
|
||||
tl_service_ms_str = f"{tl_service_ms_value}"
|
||||
tl_avg_ms_str = f"{tl_avg_ms_value:.0f}"
|
||||
@@ -4310,7 +4308,7 @@ from mlsys.formatting import fmt
|
||||
class FallacyBatchingCalc:
|
||||
"""Batch-16 to batch-32 yields only ~12% more throughput while nearly doubling inference time."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
fb_batch_small_value = 16
|
||||
fb_batch_large_value = 32
|
||||
fb_throughput_small_value = 1143 # from earlier table
|
||||
@@ -4320,10 +4318,10 @@ class FallacyBatchingCalc:
|
||||
fb_padding_waste_min_pct_value = 15
|
||||
fb_padding_waste_max_pct_value = 30
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
fb_throughput_gain_pct_value = (fb_throughput_large_value / fb_throughput_small_value - 1) * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fb_batch_small_str = f"{fb_batch_small_value}"
|
||||
fb_batch_large_str = f"{fb_batch_large_value}"
|
||||
fb_throughput_gain_str = f"{fb_throughput_gain_pct_value:.0f}"
|
||||
@@ -4368,14 +4366,14 @@ from mlsys.formatting import fmt
|
||||
class FallacyCalibrationCalc:
|
||||
"""INT8 model calibrated on ImageNet drops 3.2pp when serving out-of-distribution wildlife images."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
fc_acc_loss_pct_value = 3.2
|
||||
fc_imagenet_acc_value = 76.1 # ResNet-50 INT8 on ImageNet
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
fc_ood_acc_value = fc_imagenet_acc_value - fc_acc_loss_pct_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fc_acc_loss_str = f"{fc_acc_loss_pct_value}"
|
||||
fc_imagenet_acc_str = f"{fc_imagenet_acc_value:.1f}"
|
||||
fc_ood_acc_str = f"{fc_ood_acc_value:.1f}"
|
||||
@@ -4412,17 +4410,17 @@ from mlsys.formatting import fmt
|
||||
class FallacyColdstartCalc:
|
||||
"""Cold starts compound: 10 new instances at 30s compile time = 300s aggregate user-facing delay."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
cs_new_instances_value = 10
|
||||
cs_compile_time_s_value = 30 # TensorRT compilation per instance
|
||||
cs_steady_latency_ms_value = 5
|
||||
cs_cold_latency_ms_value = 500 # first request during cold start
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
cs_aggregate_cold_s_value = cs_new_instances_value * cs_compile_time_s_value
|
||||
cs_cold_multiplier_value = cs_cold_latency_ms_value / cs_steady_latency_ms_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
cs_new_instances_str = f"{cs_new_instances_value}"
|
||||
cs_compile_time_str = f"{cs_compile_time_s_value}"
|
||||
cs_aggregate_cold_str = f"{cs_aggregate_cold_s_value}"
|
||||
@@ -4465,10 +4463,17 @@ The serving principles established here (queuing theory for capacity planning, p
|
||||
|
||||
::: {.callout-chapter-connection title="From Node to Factory"}
|
||||
|
||||
This chapter engineered the single serving node: latency budgets decomposed each request, queuing theory sized the hardware, batching strategies maximized throughput, and runtime optimization extracted every available microsecond. But a single node is fragile. Models drift as the world changes. Deployments must roll out without downtime. Monitoring must detect the silent accuracy degradation that training-serving skew causes. Scaling events demand orchestration across dozens or hundreds of replicas. In @sec-ml-operations, we scale our perspective from the single request to the full system lifecycle—building the automated machinery (CI/CD pipelines, feature stores, model registries, and observability platforms) that keeps production ML systems running reliably through crashes, model drift, and continuous updates.
|
||||
This chapter engineered the single serving node: latency budgets decomposed each request, queuing theory sized the hardware, batching strategies maximized throughput, and runtime optimization extracted every available microsecond. A single node, however, is fragile. Models drift as the world changes. Deployments must roll out without downtime. Monitoring must detect the silent accuracy degradation that training-serving skew causes. Scaling events demand orchestration across dozens or hundreds of replicas. In @sec-ml-operations, we scale our perspective from the single request to the full system lifecycle—building the automated machinery (CI/CD pipelines, feature stores, model registries, and observability platforms) that keeps production ML systems running reliably through crashes, model drift, and continuous updates.
|
||||
|
||||
:::
|
||||
|
||||
<!-- This is here to make sure that quizzes are inserted properly before a part begins. -->
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:serving")
|
||||
```
|
||||
|
||||
@@ -192,7 +192,7 @@ Before using these models as engineering benchmarks, we review their historical
|
||||
\index{Skip Connection!gradient flow}
|
||||
\index{He, Kaiming}
|
||||
**ResNet-50 (Microsoft Research, 2015)**\index{ResNet-50}
|
||||
The Residual Network (ResNet) [@he2016deep] solved the "vanishing gradient" problem that prevented training very deep networks. By introducing "skip connections" that allow gradients to flow unimpeded, it enabled networks of 50, 100, or even 1000 layers. It won the ImageNet 2015 competition and became the standard "backbone" for computer vision. From a systems perspective, it is a highly regular, compute-intensive workload composed almost entirely of dense convolutions, making it the ideal test for GPU floating-point throughput.
|
||||
The Residual Network (ResNet) [@he2016deep] solved the "vanishing gradient" problem that prevented training networks beyond ~20 layers. By introducing "skip connections" that allow gradients to flow unimpeded, it enabled networks of 50, 100, or even 1000 layers. It won the ImageNet 2015 competition and became the standard "backbone" for computer vision. From a systems perspective, it is a highly regular, compute-intensive workload composed almost entirely of dense convolutions, making it the ideal test for GPU floating-point throughput.
|
||||
|
||||
**GPT-2 (OpenAI, 2019)**\index{GPT-2}
|
||||
Generative Pre-trained Transformer 2 (GPT-2) demonstrated that scaling up a simple architecture (the Transformer Decoder) on massive datasets could produce coherent text generation. Unlike BERT (which processes text bidirectionally), GPT-2 generates text sequentially (autoregressively[^fn-autoregressive-inference]\index{Autoregressive Generation}), creating a unique memory bandwidth bottleneck where the entire model must be loaded to generate just one token. It serves as our archetype for modern Large Language Models (LLMs)\index{Large Language Model (LLM)} like Llama and ChatGPT.
|
||||
@@ -230,14 +230,14 @@ The quantitative characteristics of these Lighthouse models expose a critical en
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class WorkloadSignatures:
|
||||
"""
|
||||
Namespace for Workload Signature calculation.
|
||||
Scenario: Comparing Arithmetic Intensity across architectures (FP32).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Values from LighthouseSpecs (approximated for pedagogical clarity)
|
||||
resnet_flops = 4.1e9
|
||||
resnet_bytes = 102e6
|
||||
@@ -248,15 +248,15 @@ class WorkloadSignatures:
|
||||
mobilenet_flops = 300e6
|
||||
mobilenet_bytes = 14e6
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
resnet_i = resnet_flops / resnet_bytes
|
||||
gpt2_i = gpt2_flops_token / gpt2_bytes
|
||||
mobilenet_i = mobilenet_flops / mobilenet_bytes
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(resnet_i > gpt2_i * 50, f"ResNet intensity ({resnet_i:.1f}) should be >> GPT-2 ({gpt2_i:.1f}).")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
resnet_intensity_str = fmt(resnet_i, precision=1, commas=False)
|
||||
gpt2_intensity_str = fmt(gpt2_i, precision=2, commas=False)
|
||||
mobilenet_intensity_str = fmt(mobilenet_i, precision=1, commas=False)
|
||||
@@ -307,14 +307,14 @@ from mlsys.constants import (
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LighthouseSpecs:
|
||||
"""
|
||||
Namespace for Lighthouse Model Comparison Table.
|
||||
Aggregates specs for ResNet, GPT-2, DLRM, MobileNet, and KWS.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
m_resnet = Models.ResNet50
|
||||
m_gpt2 = Models.GPT2
|
||||
m_dlrm = Models.DLRM
|
||||
@@ -323,7 +323,7 @@ class LighthouseSpecs:
|
||||
|
||||
hw_a100 = Hardware.Cloud.A100
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# ResNet-50
|
||||
resnet_params = m_resnet.parameters.m_as(Mparam)
|
||||
resnet_flops = m_resnet.inference_flops.m_as(GFLOPs)
|
||||
@@ -355,13 +355,13 @@ class LighthouseSpecs:
|
||||
# Reference Hardware
|
||||
a100_mem_gib = hw_a100.memory_capacity.m_as(GiB)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
# Ensure numbers match the book's narrative
|
||||
check(abs(resnet_flops - 4.1) <= 0.1, f"ResNet FLOPs {resnet_flops} != 4.1")
|
||||
check(abs(gpt2_params - 1.5) <= 0.1, f"GPT-2 Params {gpt2_params} != 1.5B")
|
||||
check(abs(mobilenet_params - 3.5) <= 0.1, f"MobileNet Params {mobilenet_params} != 3.5M")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
resnet_params_m_str = fmt(resnet_params, precision=1)
|
||||
resnet_gflops_str = fmt(resnet_flops, precision=1)
|
||||
resnet_fp32_mb_str = fmt(resnet_mem_mb, precision=0)
|
||||
@@ -394,18 +394,18 @@ class TransformerScaling:
|
||||
Namespace for Transformer Scaling Laws.
|
||||
Scenario: Memory scaling vs sequence length.
|
||||
"""
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
seq_len_base = 512
|
||||
seq_len_doubled = 1024
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Attention memory scales quadratically with sequence length O(N^2)
|
||||
scaling_ratio = (seq_len_doubled / seq_len_base)**2
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(scaling_ratio == 4.0, f"Quadratic scaling should yield 4x, got {scaling_ratio}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
transformer_scaling_ratio_str = fmt(scaling_ratio, precision=0)
|
||||
|
||||
# Note: Use LighthouseSpecs.variable or TransformerScaling.variable directly.
|
||||
@@ -428,7 +428,7 @@ The "Bottleneck" column in @tbl-lighthouse-comparison deserves particular attent
|
||||
\index{GPT-2!arithmetic intensity}
|
||||
Architecture selection is ultimately an engineering trade-off between **Math** ($O$) and **Memory Movement** ($D_{vol}$). By comparing our Lighthouses, we can see how architectural choices shift a model's position on the intensity spectrum:
|
||||
|
||||
- **ResNet-50 (Compute-Bound)**: High intensity ($\approx 50\text{--}200+$ FLOPs/byte, varying by layer). Convolutional layers reuse each weight many times across the spatial dimensions of an image. Deep bottleneck layers achieve very high intensity, while early layers are lower. Its performance is limited by how fast the hardware can do math.
|
||||
- **ResNet-50 (Compute-Bound)**: High intensity ($\approx 50\text{--}200+$ FLOPs/byte, varying by layer). Convolutional layers reuse each weight many times across the spatial dimensions of an image. Deep bottleneck layers achieve intensity above 200, while early layers are lower. Its performance is limited by how fast the hardware can do math.
|
||||
- **GPT-2 (Bandwidth-Bound)**: Low intensity ($\approx 1$ FLOPs/byte). Each token produces only a matrix-vector multiplication rather than the matrix-matrix operations of batch processing, so the system must load massive weights from memory for a single token’s math. Its performance is limited by how fast memory can move bits.
|
||||
- **MobileNet (Memory-Bound on GPUs)**: Low intensity ($\approx 1\text{--}10$ FLOPs/byte, with depthwise layers at the low end). MobileNet reduces total $O$ through depthwise separable convolutions, but it moves more data relative to that work. It fits mobile hardware perfectly but often "starves" high-end GPUs optimized for dense math.
|
||||
|
||||
@@ -442,7 +442,7 @@ Match the architectural choice to its systems implication:
|
||||
- [ ] **Sequential Attention (GPT)**: Decreases arithmetic intensity by loading weights per-token rather than per-batch.
|
||||
:::
|
||||
|
||||
With these quantitative reference points established, we now examine each architectural family in detail, starting with the foundational Multi-Layer Perceptron, the architecture that established the computational patterns underlying all modern neural networks. From there, we progress through increasingly specialized designs: CNNs that exploit spatial structure, RNNs that capture temporal dependencies, attention mechanisms that enable dynamic relevance weighting, Transformers that build entire architectures from attention, and finally DLRM that handles massive categorical features. Each architecture represents a different answer to the same fundamental question: *how* should we structure computation to match the patterns in our data?
|
||||
The quantitative reference points above set the stage for a detailed examination of each architectural family, starting with the foundational Multi-Layer Perceptron, the architecture that established the computational patterns underlying all modern neural networks. From there, we progress through increasingly specialized designs: CNNs that exploit spatial structure, RNNs that capture temporal dependencies, attention mechanisms that enable dynamic relevance weighting, Transformers that build entire architectures from attention, and finally DLRM that handles massive categorical features. Each architecture represents a different answer to the same fundamental question: *how* should we structure computation to match the patterns in our data?
|
||||
|
||||
For each family, we follow a consistent analysis: what data patterns the architecture targets (*Pattern Processing Needs*), how it computes (*Algorithmic Structure*), how those computations map to hardware (*Computational Mapping*), and what system bottlenecks emerge (*System Implications*). This four-part lens ensures that every architecture is evaluated not just for what it learns, but for what it costs to run.
|
||||
|
||||
@@ -495,25 +495,25 @@ In practice, the UAT explains *why* MLPs succeed across diverse tasks while reve
|
||||
from mlsys.constants import MNIST_IMAGE_WIDTH, MNIST_IMAGE_HEIGHT
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MLPDim:
|
||||
"""
|
||||
Namespace for MLP Dimensionality Examples.
|
||||
Scenario: Input vector size for flattened images.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
mnist_w = MNIST_IMAGE_WIDTH
|
||||
mnist_h = MNIST_IMAGE_HEIGHT
|
||||
|
||||
std_w = 256
|
||||
std_h = 256
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
mnist_dim = mnist_w * mnist_h
|
||||
std_dim = std_w * std_h
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
mlp_28_dim_str = fmt(mnist_dim, precision=0, commas=True)
|
||||
mlp_256_dim_str = fmt(std_dim, precision=0, commas=True)
|
||||
|
||||
@@ -560,14 +560,14 @@ The classic *MNIST* handwritten digit benchmark illustrates this gap between *re
|
||||
|
||||
from mlsys.constants import param, Mparam, Kparam
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MLPvsCNN:
|
||||
"""
|
||||
Namespace for MNIST Parameter Comparison.
|
||||
Scenario: Comparing a naive MLP vs a CNN for the same task.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# MLP: 784 -> 4096 -> 4096 -> 10
|
||||
mlp_in = 784
|
||||
mlp_h = 4096
|
||||
@@ -578,7 +578,7 @@ class MLPvsCNN:
|
||||
c2_k, c2_out = 3, 64
|
||||
fc1_in, fc1_out = 64*7*7, 128
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# MLP Params
|
||||
mlp_p = (mlp_in * mlp_h) + (mlp_h * mlp_h) + (mlp_h * mlp_out)
|
||||
|
||||
@@ -590,10 +590,10 @@ class MLPvsCNN:
|
||||
|
||||
ratio = mlp_p // cnn_p
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
check(ratio >= 10, f"MLP ({mlp_p}) isn't significantly larger than CNN ({cnn_p}). Ratio: {ratio}x")
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ratio >= 10, f"MLP ({mlp_p}) is not significantly larger than CNN ({cnn_p}). Ratio: {ratio}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
mlp_params_str = f"{(mlp_p * param).m_as(Mparam):.0f}M"
|
||||
cnn_params_str = f"{(cnn_p * param).m_as(Kparam):.0f}K"
|
||||
param_ratio_str = f"{ratio}"
|
||||
@@ -635,7 +635,7 @@ The learnability gap motivates the core design principle of this chapter: embed
|
||||
|
||||
[^fn-nfl-theorem]: **No Free Lunch Theorem**: Wolpert and Macready's 1997 result proved that no optimization algorithm outperforms random search across *all* possible problems -- averaged over every conceivable function, all algorithms are equivalent. The ML systems consequence: every inductive bias (locality, equivariance, attention) improves performance on problems matching that bias while necessarily degrading performance on problems that violate it, making architecture selection an irreversible engineering commitment to a problem class. \index{No Free Lunch Theorem!architecture selection}
|
||||
|
||||
These theoretical insights translate directly into engineering decisions. Appropriate inductive biases reduce parameter counts (enabling edge deployment), accelerate convergence (reducing training costs), and produce structured computation patterns that map efficiently to specialized hardware (@sec-hardware-acceleration). A `{python} MLPvsCNN.mlp_params_str`-parameter MLP infeasible for edge deployment becomes a `{python} MLPvsCNN.cnn_params_str`-parameter CNN that fits comfortably, a `{python} MLPvsCNN.param_ratio_str`$\times$ reduction achieved by matching architecture to data structure. With this motivation established, we now examine the specific pattern processing requirements that dense architectures address.
|
||||
These theoretical insights translate directly into engineering decisions. Appropriate inductive biases reduce parameter counts (enabling edge deployment), accelerate convergence (reducing training costs), and produce structured computation patterns that map efficiently to specialized hardware (@sec-hardware-acceleration). A `{python} MLPvsCNN.mlp_params_str`-parameter MLP infeasible for edge deployment becomes a `{python} MLPvsCNN.cnn_params_str`-parameter CNN that fits comfortably, a `{python} MLPvsCNN.param_ratio_str`$\times$ reduction achieved by matching architecture to data structure. The next question is what specific pattern processing requirements dense architectures address.
|
||||
|
||||
\index{Fully Connected Layer}
|
||||
|
||||
@@ -854,18 +854,18 @@ The algorithmic structure above defines *what* an MLP computes; computational ma
|
||||
from mlsys.constants import MNIST_IMAGE_WIDTH, MNIST_IMAGE_HEIGHT
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MLPLayerStats:
|
||||
"""
|
||||
Namespace for MNIST MLP Computation Costs.
|
||||
Scenario: Single dense layer (784 -> 100).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
input_dim = MNIST_IMAGE_WIDTH * MNIST_IMAGE_HEIGHT # 784
|
||||
hidden_dim = 100
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# MACs = Input * Output
|
||||
macs = input_dim * hidden_dim
|
||||
|
||||
@@ -873,7 +873,7 @@ class MLPLayerStats:
|
||||
# Note: Output write is negligible compared to reads
|
||||
mem_acc_per_neuron = input_dim + input_dim # 2 * input_dim
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
mnist_mlp_macs_str = fmt(macs, precision=0, commas=True)
|
||||
mnist_neuron_mem_acc_str = fmt(mem_acc_per_neuron, precision=0, commas=True)
|
||||
|
||||
@@ -954,21 +954,21 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class A100Specs:
|
||||
"""
|
||||
Namespace for A100 Tensor Core Specs.
|
||||
Scenario: Comparing mixed-precision throughput.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# A100 performance at various precisions
|
||||
fp16_tensor = A100_FLOPS_FP16_TENSOR.m_as(TFLOPs/second)
|
||||
int8_tensor = A100_FLOPS_INT8.m_as(TFLOPs/second)
|
||||
fp32_cuda = A100_FLOPS_FP32.m_as(TFLOPs/second)
|
||||
tf32_tensor = A100_FLOPS_TF32.m_as(TFLOPs/second)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
a100_tflops_fp16_str = fmt(fp16_tensor, precision=0, commas=False)
|
||||
a100_tflops_int8_str = fmt(int8_tensor, precision=0, commas=False)
|
||||
a100_tflops_fp32_str = fmt(fp32_cuda, precision=1, commas=False)
|
||||
@@ -1006,24 +1006,24 @@ from mlsys.constants import param, BYTES_FP32, MB, MILLION
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LargeMLP:
|
||||
"""
|
||||
Namespace for Large MLP Memory Scaling.
|
||||
Scenario: The O(N^2) cost of dense layers (2048 -> 2048).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
width = 2048
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
params = width * width
|
||||
mem_mb = model_memory(params * param, BYTES_FP32, MB)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(params >= MILLION, f"{width}x{width} layer should be large (>1M params).")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
dim_str = f"{width}"
|
||||
params_str = fmt(params, precision=0, commas=False)
|
||||
mem_str = fmt(mem_mb, precision=0, commas=False)
|
||||
@@ -1047,7 +1047,7 @@ $$
|
||||
\text{Intensity} \approx \frac{2 \cdot M \cdot N \text{ (Ops)}}{4 \cdot M \cdot N \text{ (Bytes)}} = 0.5 \text{ FLOPs/byte}
|
||||
$$ {#eq-dense-intensity}
|
||||
|
||||
Since modern accelerators (like the A100) require intensities >100 FLOPs/byte to saturate compute units, dense layers are almost always memory-bandwidth-bound unless batch sizes are very large. This explains why "fully connected" layers are often the performance bottleneck in inference workloads, despite performing fewer total FLOPs than convolutional layers.
|
||||
Since modern accelerators (like the A100) require intensities >100 FLOPs/byte to saturate compute units, dense layers are almost always memory-bandwidth-bound unless batch sizes exceed several hundred. This explains why "fully connected" layers are often the performance bottleneck in inference workloads, despite performing fewer total FLOPs than convolutional layers.
|
||||
|
||||
Dense connectivity thus moves maximum data for minimum compute. For data with inherent structure, spatial locality in images or temporal order in sequences, specialized architectures can exploit that structure for both better accuracy and better efficiency. The most established such architecture is the convolutional neural network.
|
||||
|
||||
@@ -1320,14 +1320,14 @@ at($(8-CH3.70)!0.3!(8-CH3.290)$){};
|
||||
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CNNSharing:
|
||||
"""
|
||||
Namespace for CNN Parameter Sharing Comparison.
|
||||
Scenario: MLP vs CNN params for 224 × 224 × 3 input.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# ImageNet input dimensions
|
||||
img_dim = 224
|
||||
img_channels = 3 # RGB
|
||||
@@ -1335,7 +1335,7 @@ class CNNSharing:
|
||||
# CNN: 3x3 filter
|
||||
cnn_k = 3
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# MLP: fully connected to all pixels
|
||||
mlp_img_params = img_dim * img_dim * img_channels
|
||||
|
||||
@@ -1345,7 +1345,7 @@ class CNNSharing:
|
||||
# Parameter reduction
|
||||
reduction = mlp_img_params / cnn_filter_params
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
mlp_img_params_str = fmt(mlp_img_params, precision=0, commas=True)
|
||||
cnn_filter_params_str = f"{cnn_filter_params}"
|
||||
cnn_param_reduction_str = fmt(reduction, precision=0, commas=True)
|
||||
@@ -1381,7 +1381,7 @@ To illustrate, consider applying a CNN to the same MNIST images used in our MLP
|
||||
|
||||
This algorithmic structure directly implements the requirements for spatial pattern processing, creating distinct computational patterns that influence system design. Unlike MLPs, convolutional networks preserve spatial locality, using the hierarchical feature extraction principles established above. These properties drive architectural optimizations in AI accelerators, where operations such as data reuse, tiling, and parallel filter computation are important for performance.
|
||||
|
||||
**Translation equivariance** is central to understanding why CNNs work effectively for spatial data: shifting the input shifts the output feature map correspondingly. We examine this property in four stages: the equivariance-invariance distinction, the mathematical formulation, the group theory generalization, and the systems implications for deployment.
|
||||
The property of **translation equivariance** is central to understanding why CNNs work effectively for spatial data: shifting the input shifts the output feature map correspondingly. We examine this property in four stages: the equivariance-invariance distinction, the mathematical formulation, the group theory generalization, and the systems implications for deployment.
|
||||
|
||||
Equivariance and invariance are related but distinct concepts that determine how architectures handle transformations. Equivariance means that transforming the input produces the same transformation in the output, as defined in @eq-equivariance:
|
||||
|
||||
@@ -1740,20 +1740,20 @@ The sliding window and im2col transformations above reveal *how* CNNs compute; t
|
||||
from mlsys.constants import IMAGE_DIM_RESNET
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CNNSysImpl:
|
||||
"""
|
||||
Namespace for CNN System Implications calculations.
|
||||
Scenario: ImageNet (224×224) with 3×3 filters and 64 output channels.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
img_dim = IMAGE_DIM_RESNET # 224
|
||||
kernel_size = 3
|
||||
c_out = 64 # output channels
|
||||
c_in_single = 1 # single input channel (for weight illustration)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Weight parameters for single input channel: K × K × C_in × C_out
|
||||
weights_single_ch = kernel_size * kernel_size * c_in_single * c_out # 576
|
||||
|
||||
@@ -1763,7 +1763,7 @@ class CNNSysImpl:
|
||||
# Activation values: H × W × C_out
|
||||
activations = img_dim * img_dim * c_out # 3,211,264
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(weights_single_ch == 576,
|
||||
f"Expected 576 weight params, got {weights_single_ch}")
|
||||
check(spatial_positions == 50_176,
|
||||
@@ -1771,7 +1771,7 @@ class CNNSysImpl:
|
||||
check(activations == 3_211_264,
|
||||
f"Expected 3,211,264 activations, got {activations}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
weights_single_ch_str = fmt(weights_single_ch, precision=0, commas=True)
|
||||
spatial_positions_str = fmt(spatial_positions, precision=0, commas=True)
|
||||
activations_str = fmt(activations / 1e6, precision=1) # "3.2" (millions)
|
||||
@@ -1969,18 +1969,18 @@ RNN sequential processing creates computational patterns different from both MLP
|
||||
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class RNNCompute:
|
||||
"""
|
||||
Namespace for RNN Computation Costs.
|
||||
Scenario: Per-step MACs for a standard RNN layer (100 input, 128 hidden).
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
input_dim = 100
|
||||
hidden_dim = 128
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Recurrent: h_prev x W_hh (H x H)
|
||||
macs_recurrent = hidden_dim * hidden_dim
|
||||
|
||||
@@ -1989,10 +1989,10 @@ class RNNCompute:
|
||||
|
||||
macs_total = macs_recurrent + macs_input
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(macs_recurrent > macs_input, f"Recurrent cost ({macs_recurrent}) should dominate Input cost ({macs_input}) for large hidden states.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
recurrent_str = fmt(macs_recurrent, precision=0, commas=True)
|
||||
input_str = fmt(macs_input, precision=0, commas=True)
|
||||
total_str = fmt(macs_total, precision=0, commas=True)
|
||||
@@ -2482,17 +2482,17 @@ from mlsys.formatting import fmt, check
|
||||
class AttentionComputeCosts:
|
||||
"""Demonstrate quadratic compute cost of self-attention at sequence length 512."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
seq_len = 512 # sequence length
|
||||
head_dim = 64 # dimension per head
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
score_macs = seq_len * seq_len * head_dim
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(score_macs > MILLION, "Attention MACs should exceed 1M for seq_len=512.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
attn_score_macs_m_str = fmt(score_macs / MILLION, precision=1, commas=False) # e.g. "16.8"
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -2573,7 +2573,7 @@ Attention mechanisms require storage for attention weights, key-query-value proj
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import BYTES_FP16, byte, GB
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TransformerComplexityAnchor:
|
||||
"""
|
||||
Namespace for Transformer quadratic scaling anchor.
|
||||
@@ -2584,20 +2584,20 @@ class TransformerComplexityAnchor:
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
transformer_scaling_ratio_str = TransformerComplexityAnchor.scaling_ratio_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AttentionMemory:
|
||||
"""
|
||||
Namespace for Quadratic Attention Memory Calculation.
|
||||
Scenario: 100K context window memory cost.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
seq_len = 100_000
|
||||
bytes_per_element = BYTES_FP16.m_as(byte)
|
||||
num_layers = 32
|
||||
num_heads = 12
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# N^2 * heads
|
||||
elements = seq_len * seq_len * num_heads
|
||||
|
||||
@@ -2607,10 +2607,10 @@ class AttentionMemory:
|
||||
single_layer_gb = (single_layer_bytes / 1e9)
|
||||
total_gb = single_layer_gb * num_layers
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(total_gb >= 100, f"Attention memory ({total_gb:.1f} GB) is too small for a 'bottleneck'.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
seq_len_str = f"{seq_len:,}"
|
||||
elements_str = f"{elements:.1e}".replace("+", "")
|
||||
bytes_per_element_str = f"{bytes_per_element}"
|
||||
@@ -2656,7 +2656,7 @@ Attention computation divides into two main phases: generating attention weights
|
||||
|
||||
Data movement in attention mechanisms presents challenges distinct from all previous architectures. Each attention operation requires projecting and moving query, key, and value vectors for every position in the sequence, then storing and accessing the full $N \times N$ attention weight matrix, and finally coordinating value vector movement during the weighted combination phase. These intermediate attention weights become a major factor in system bandwidth requirements. Unlike the predictable spatial access patterns of CNNs or the sequential access of RNNs, attention operations require frequent movement of dynamically computed weights across the memory hierarchy—a pattern that defeats simple caching strategies.
|
||||
|
||||
These distinctive memory, computation, and data movement characteristics shape system design in fundamental ways. But they also raise a natural question: if attention provides such powerful dynamic connectivity, could it replace other architectural components entirely?
|
||||
These distinctive memory, computation, and data movement characteristics shape system design in fundamental ways. They also raise a natural question: if attention provides such effective dynamic connectivity, could it replace other architectural components entirely?
|
||||
|
||||
::: {.callout-checkpoint title="Quadratic Scaling Intuition" collapse="false"}
|
||||
Modern AI scaling is defined by the cost of Attention. Verify your intuition:
|
||||
@@ -2666,7 +2666,7 @@ Modern AI scaling is defined by the cost of Attention. Verify your intuition:
|
||||
:::
|
||||
|
||||
::: {.callout-war-story title="The Quadratic Wall"}
|
||||
**The Context**: When Google released BERT in 2018, it revolutionized NLP. However, the engineering team strictly limited the input sequence length to 512 tokens, despite users clamoring for longer context to process full documents.
|
||||
**The Context**: When Google released BERT in 2018, it set new accuracy records across 11 NLP benchmarks. However, the engineering team strictly limited the input sequence length to 512 tokens, despite users demanding longer context to process full documents.
|
||||
|
||||
**The Failure**: This was not a product decision; it was a physics decision. The self-attention mechanism's memory requirement scales quadratically ($O(N^2)$). Doubling the context from 512 to 1024 would quadruple the memory; increasing it to a modest 4,000 tokens (for a short article) would increase memory usage by $64 \times$.
|
||||
|
||||
@@ -2686,7 +2686,7 @@ The attention mechanism analyzed above provides the computational primitive—dy
|
||||
While the attention mechanisms examined above introduced dynamic connectivity, they were initially applied as additions to existing architectures, particularly RNNs for sequence-to-sequence tasks. This hybrid approach still suffered from the inherent limitations of recurrent architectures: sequential processing constraints that prevented efficient parallelization and difficulties with very long sequences. The breakthrough insight was recognizing that attention mechanisms alone could replace both convolutional and recurrent processing entirely -- eliminating the sequential bottleneck while preserving dynamic pattern processing.
|
||||
|
||||
\index{Vaswani, Ashish}
|
||||
Transformers, introduced in the landmark "Attention is All You Need" paper by @vaswani2017attention, embody a revolutionary inductive bias: **they assume no prior structure but allow the model to learn all pairwise relationships dynamically based on content**. Rather than adding attention to RNNs, Transformers built the entire architecture around attention mechanisms, introducing self-attention as the primary computational pattern. This architectural decision traded the parameter efficiency of CNNs and the sequential coherence of RNNs for maximum flexibility and parallelizability.
|
||||
Transformers, introduced in the "Attention is All You Need" paper by @vaswani2017attention, embody a fundamentally different inductive bias: **they assume no prior structure but allow the model to learn all pairwise relationships dynamically based on content**. Rather than adding attention to RNNs, Transformers built the entire architecture around attention mechanisms, introducing self-attention as the primary computational pattern. This architectural decision traded the parameter efficiency of CNNs and the sequential coherence of RNNs for maximum flexibility and parallelizability.
|
||||
|
||||
The progression from MLPs that connect everything, to CNNs that connect locally, to RNNs that connect sequentially, to Transformers that connect dynamically based on learned content relationships illustrates how each iteration refined the balance between flexibility and efficiency.
|
||||
|
||||
@@ -2861,7 +2861,7 @@ def multi_head_attention(X, W_Q, W_K, W_V, W_O, num_heads, d_k):
|
||||
```
|
||||
:::
|
||||
|
||||
The self-attention implementation above shows how Transformers process entire sequences in parallel. But what happens at inference time, when the model generates tokens one at a time? GPT-2 XL reveals the answer.
|
||||
The self-attention implementation above shows how Transformers process entire sequences in parallel. The picture changes at inference time, when the model generates tokens one at a time. GPT-2 XL reveals the answer.
|
||||
|
||||
::: {.callout-lighthouse title="GPT-2 XL (Bandwidth Lighthouse)"}
|
||||
**Why it matters:** GPT-2 XL exemplifies **memory-bandwidth-bound** workloads. During autoregressive inference, the model must load all `{python} LighthouseSpecs.gpt2_fp32_gb_str` GB of weights from HBM for *every generated token*, while performing only a single matrix-vector multiply per layer. The arithmetic intensity drops to $\approx 1$ Op/Byte, leaving compute cores idle while waiting for memory. This contrasts with ResNet-50 (compute-bound, high weight reuse) and DLRM (capacity-bound, random access).
|
||||
@@ -3005,26 +3005,26 @@ The **DLRM** architecture [@naumov2019deep] standardizes this pattern, combining
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import BYTES_FP32, byte, GB
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class DLRMEmbedding:
|
||||
"""
|
||||
Namespace for DLRM Embedding Table calculation.
|
||||
Scenario: 1 Billion users x 128 dim x FP32 = Capacity Wall.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
num_users = 1_000_000_000
|
||||
embed_dim = 128
|
||||
bytes_per_param = 4 # FP32
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
table_bytes = num_users * embed_dim * bytes_per_param
|
||||
table_gb = (table_bytes * byte).m_as(GB)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(table_gb >= 80, f"DLRM table ({table_gb:.1f} GB) fits on an A100. It must be larger to justify model parallelism.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
embed_table_gb_str = fmt(table_gb, precision=0, commas=False)
|
||||
|
||||
# Note: Use DLRMEmbedding.embed_table_gb_str directly.
|
||||
@@ -3087,25 +3087,25 @@ This dependency creates an **All-to-All**\index{All-to-All Communication} commun
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import BYTES_FP32, byte, GB, A100_MEM_CAPACITY, GiB
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CapacityWall:
|
||||
"""
|
||||
Namespace for Capacity Wall Calculation.
|
||||
Scenario: 100M items x 128 dim embedding.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
num_items = 100_000_000
|
||||
embed_dim = 128
|
||||
bytes_per_param = BYTES_FP32.m_as(byte)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
table_bytes = num_items * embed_dim * bytes_per_param
|
||||
table_gb = (table_bytes * byte).m_as(GB)
|
||||
a100_capacity_gb = A100_MEM_CAPACITY.m_as(GB)
|
||||
utilization_pct = (table_gb / a100_capacity_gb) * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
cw_num_items_str = fmt(num_items / MILLION, precision=0, commas=False)
|
||||
cw_embed_dim_str = f"{embed_dim}"
|
||||
cw_table_gb_str = fmt(table_gb, precision=1, commas=False)
|
||||
@@ -3175,7 +3175,7 @@ Dense connectivity also established the cost baseline that every subsequent arch
|
||||
|
||||
Parameter sharing (born in CNNs) made deep networks efficient, but efficiency alone could not solve the challenges of training them. As practitioners attempted to build deeper CNNs for more complex tasks, they encountered a barrier that now confronts *every* deep architecture: the gradient flow problem.
|
||||
|
||||
Before examining the architectural innovations that enabled training very deep networks, we must understand the challenge that depth creates: the gradient flow problem. This subsection provides the mathematical foundations for understanding why skip connections became essential, covering vanishing gradients, exploding gradients, the limitations of ReLU, and the residual solution that enabled networks exceeding 100 layers.
|
||||
Before examining the architectural innovations that enabled training networks exceeding 100 layers, we must understand the challenge that depth creates: the gradient flow problem. This subsection provides the mathematical foundations for understanding why skip connections became essential, covering vanishing gradients, exploding gradients, the limitations of ReLU, and the residual solution that enabled networks exceeding 100 layers.
|
||||
|
||||
#### The Problem of Depth {#sec-network-architectures-problem-depth-15bc}
|
||||
|
||||
@@ -3302,18 +3302,18 @@ from mlsys.formatting import fmt, check
|
||||
class ResNetSkipOverhead:
|
||||
"""Quantify systems cost of residual connections: ~20% memory overhead."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
memory_overhead_pct = 20 # activation storage
|
||||
epoch_cost_pct = 10 # per-epoch compute
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Values are empirical anchors; no derived calculation needed.
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(0 < memory_overhead_pct < 100, "Memory overhead must be a valid percentage.")
|
||||
check(0 < epoch_cost_pct < 100, "Epoch cost must be a valid percentage.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
skip_memory_overhead_pct_str = fmt(memory_overhead_pct, precision=0, commas=False) # e.g. "20"
|
||||
skip_epoch_cost_pct_str = fmt(epoch_cost_pct, precision=0, commas=False) # e.g. "10"
|
||||
|
||||
@@ -3387,7 +3387,7 @@ Normalization enables significantly higher learning rates. Networks with batch n
|
||||
|
||||
\index{Layer Normalization}\index{Normalization!layer}
|
||||
\index{Ba, Jimmy Lei}
|
||||
While batch normalization proved transformative for CNNs, it introduced a problematic dependency on batch statistics. This creates issues for small batch sizes (noisy statistics), varying sequence lengths (incompatible batch dimensions), and inference (requires running mean/variance estimation). Layer normalization addresses these limitations by normalizing across features rather than across the batch [@ba2016layer].
|
||||
While batch normalization enabled training of much deeper CNNs, it introduced a problematic dependency on batch statistics. This creates issues for small batch sizes (noisy statistics), varying sequence lengths (incompatible batch dimensions), and inference (requires running mean/variance estimation). Layer normalization addresses these limitations by normalizing across features rather than across the batch [@ba2016layer].
|
||||
|
||||
For an input vector $\mathbf{x} \in \mathbb{R}^H$ with $H$ features:
|
||||
|
||||
@@ -3419,7 +3419,7 @@ The choice between normalization variants depends on computational context. @tbl
|
||||
|
||||
: **Normalization Variant Comparison**\index{RMSNorm}\index{Normalization!RMSNorm}: Different normalization techniques trade off between computational efficiency, batch size sensitivity, and architectural compatibility. RMSNorm [@zhang2019root], used in LLaMA and other efficient architectures, omits mean centering: $\text{RMSNorm}(\mathbf{x}) = \mathbf{x} / \sqrt{\frac{1}{H}\sum_i x_i^2 + \epsilon} \cdot \boldsymbol{\gamma}$. {#tbl-normalization-comparison}
|
||||
|
||||
Batch size constraints emerge because batch normalization requires sufficiently large batches for stable statistics. Empirically, batch sizes below 16 degrade performance noticeably, and sizes below 8 can cause training instability. This constraint impacts memory-limited scenarios such as high-resolution images or very large models.
|
||||
Batch size constraints emerge because batch normalization requires sufficiently large batches for stable statistics. Empirically, batch sizes below 16 degrade performance noticeably, and sizes below 8 can cause training instability. This constraint impacts memory-limited scenarios such as high-resolution images or billion-parameter models.
|
||||
|
||||
The computational cost of computing mean and variance adds $O(m \times H)$ operations per batch normalization layer for batch size $m$ and feature dimension $H$. For layer normalization, the cost is $O(H)$ per sample. RMSNorm reduces this further by eliminating the mean computation.
|
||||
|
||||
@@ -3507,7 +3507,7 @@ draw=BrownLine,fill=BrownL,line width=0.75pt] (add) {};
|
||||
```
|
||||
:::
|
||||
|
||||
This recombination is not accidental. The transition from RNNs to Transformers represents a decisive engineering shift from sequential to parallel state management. By replacing time-step dependencies with global, data-dependent routing (attention), we moved from $O(n)$ sequential complexity to $O(1)$ sequential steps for information flow between any two positions, enabling full use of the massive parallel processing power of modern accelerators. But the *other* building blocks—GEMM, skip connections, normalization—carried over unchanged.
|
||||
This recombination is not accidental. The transition from RNNs to Transformers represents a decisive engineering shift from sequential to parallel state management. By replacing time-step dependencies with global, data-dependent routing (attention), we moved from $O(n)$ sequential complexity to $O(1)$ sequential steps for information flow between any two positions, enabling full use of the massive parallel processing capacity of modern accelerators. The *other* building blocks, however, carried over unchanged: GEMM, skip connections, and normalization remain essential across all families.
|
||||
|
||||
This portability is the central lesson. Recent innovations continue the same pattern: Vision Transformers[^fn-vision-transformers] adapt the Transformer to images while maintaining all four building blocks [@dosovitskiy2021image]. Large language models scale up these patterns while introducing refinements like grouped-query attention or sliding window attention, yet still rely on the same core primitives [@brown2020language]. Practical implementation challenges and optimizations are explored in @sec-model-compression.
|
||||
|
||||
@@ -3530,7 +3530,7 @@ With the architectural building blocks established, we now examine the lower-lev
|
||||
|
||||
## Computational Primitives {#sec-network-architectures-systemlevel-building-blocks-41c5}
|
||||
|
||||
The preceding section identified the *architectural* building blocks that practitioners choose when designing models—GEMM layers, skip connections, normalization, gating. This section drops one level deeper to the *system* building blocks: the computational, memory access, and data movement primitives that hardware and software must actually execute. While earlier sections analyzed each architecture's system implications individually, here we synthesize those insights into a unified view that reveals common optimization opportunities.
|
||||
A ResNet-50 forward pass executes billions of multiply-accumulate operations; a Transformer attention layer moves gigabytes through memory hierarchies; a DLRM lookup scatters random reads across terabyte-scale tables. Despite their architectural differences, all three reduce to a small set of *computational primitives* that hardware and software must actually execute. Synthesizing the per-architecture system implications from earlier sections into a unified view reveals common optimization opportunities.
|
||||
|
||||
These primitives represent operations that cannot be decomposed further while maintaining their essential characteristics. Understanding them reveals *where* performance bottlenecks arise on specific hardware and guides the optimization strategies detailed in @sec-hardware-acceleration.
|
||||
|
||||
@@ -3807,17 +3807,17 @@ from mlsys.formatting import fmt, check
|
||||
class EnergyConsumptionAnalysis:
|
||||
"""Contrast energy cost of compute vs. data movement: DRAM access is ~5x more costly."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
mac_pj = 4.6 # pJ per MAC (Horowitz 2014, 45nm)
|
||||
dram_pj = ENERGY_DRAM_ACCESS_PJ.m_as(ureg.picojoule) # pJ per 32-bit access
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
dram_to_mac_ratio = dram_pj / mac_pj
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(dram_to_mac_ratio > 1, "DRAM access must cost more energy than a MAC.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
energy_mac_pj_str = f"{mac_pj}" # e.g. "4.6"
|
||||
energy_dram_str = fmt(dram_pj, precision=0, commas=False) # e.g. "26"
|
||||
|
||||
@@ -3906,18 +3906,18 @@ from mlsys.formatting import fmt, check
|
||||
class WinogradCalc:
|
||||
"""Demonstrate 2.25x multiplication reduction of Winograd F(2,3) vs standard 3x3 conv."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
std_muls_3x3 = 9 # 3x3 = 9 multiplies
|
||||
winograd_muls = 4 # Winograd F(2,3) multiplies
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
winograd_reduction = std_muls_3x3 / winograd_muls
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(winograd_reduction > 1, "Winograd must reduce multiply count.")
|
||||
check(abs(winograd_reduction - 2.25) < 0.01, "Winograd F(2,3) must yield 2.25x reduction.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
winograd_reduction_str = fmt(winograd_reduction, precision=2, commas=False) # e.g. "2.25"
|
||||
std_muls_3x3_str = f"{std_muls_3x3}" # e.g. "9"
|
||||
winograd_muls_str = f"{winograd_muls}" # e.g. "4"
|
||||
@@ -3936,7 +3936,7 @@ Transformer attention demands specialized optimizations that reduce memory usage
|
||||
|
||||
[^fn-flashattention]: **FlashAttention**\index{FlashAttention!IO-aware}: An IO-aware algorithm (Dao et al., 2022) that avoids materializing the full $N \times N$ attention matrix in HBM by fusing computation into a single kernel tiled to fit in SRAM. The result: 2--4$\times$ wall-clock speedup and memory reduction from $O(N^2)$ to $O(N)$, enabling training on sequences 4--16$\times$ longer than standard attention. FlashAttention demonstrates that algorithmic optimization of data movement ($D_{vol}$) can yield larger speedups than increasing raw compute ($R_{peak}$) -- a concrete validation of the Iron Law's data term. \index{FlashAttention!memory reduction}
|
||||
|
||||
These complexity patterns — detailed in each architecture's System Implications section — define optimal domains for each architecture. MLPs excel when parameter efficiency is not critical, CNNs dominate for moderate-resolution spatial data, RNNs remain viable for very long sequences where memory is constrained, and Transformers excel for complex relational tasks where their computational cost is justified through superior performance. With these quantitative foundations established, we can now formalize the architecture selection process into a systematic decision framework.
|
||||
These complexity patterns — detailed in each architecture's System Implications section — define optimal domains for each architecture. MLPs excel when parameter efficiency is not critical, CNNs dominate for moderate-resolution spatial data, RNNs remain viable for very long sequences where memory is constrained, and Transformers excel for complex relational tasks where their computational cost is justified through superior performance. These quantitative foundations lead directly to a systematic decision framework for architecture selection.
|
||||
|
||||
### Decision Framework {#sec-network-architectures-decision-framework-a889}
|
||||
|
||||
@@ -4025,11 +4025,11 @@ Different architectures form a hierarchy of decreasing inductive bias\index{Indu
|
||||
|
||||
All successful architectures implement hierarchical representation learning\index{Hierarchical Representation Learning}, but through different mechanisms: CNNs through progressive receptive field expansion (@sec-network-architectures-cnns-spatial-pattern-processing-5b8d), RNNs through hidden state evolution (@sec-network-architectures-rnns-sequential-pattern-processing-f804), and Transformers through multi-head attention (@sec-network-architectures-attention-mechanisms-dynamic-pattern-processing-22df). This hierarchical organization reflects a general principle: complex patterns can be efficiently represented through composition of simpler components. For systems engineering, this means that computational patterns must efficiently compose lower-level features into higher-level abstractions, that memory hierarchies must align with representational hierarchies to minimize data movement, that parallelization strategies must respect hierarchical dependency structure, and that hardware accelerators must efficiently support the matrix operations implementing feature composition.
|
||||
|
||||
With these theoretical foundations and the practical decision framework established, we now walk through a complete architecture selection exercise to demonstrate the full decision process.
|
||||
A complete architecture selection exercise demonstrates how the theoretical foundations and decision framework apply in practice.
|
||||
|
||||
### Architecture Selection in Practice {#sec-network-architectures-putting-together-architecture-selection-practice-052f}
|
||||
|
||||
This section synthesizes the chapter's concepts through a complete architecture selection exercise. Rather than isolated examples, we walk through the full decision process an ML systems engineer would follow, using a *real-time wildlife monitoring* scenario as the integrating case study. Before diving into the case study, a quick back-of-the-napkin calculation reveals *the throughput ceiling* that drives the hardware selection.
|
||||
A complete architecture selection exercise synthesizes the chapter's concepts. We walk through the full decision process an ML systems engineer would follow, using a *real-time wildlife monitoring* scenario as the integrating case study. First, a back-of-the-napkin calculation reveals *the throughput ceiling* that drives the hardware selection.
|
||||
|
||||
```{python}
|
||||
#| label: throughput-ceiling-calc
|
||||
@@ -4056,12 +4056,12 @@ from mlsys.constants import RESNET50_FLOPs, GFLOPs, TFLOPs
|
||||
class ThroughputCeilingCalc:
|
||||
"""Evaluate real-time vision feasibility: ResNet-50 at 30 FPS leaves ample headroom."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
fps = 30 # target frame rate
|
||||
midrange_gpu_tflops = 10 # reference mid-range GPU (TFLOPS)
|
||||
objdet_gflops = 100 # object detection model (GFLOPs)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
resnet_gflops = RESNET50_FLOPs.m_as(GFLOPs)
|
||||
sustained_gflops = fps * resnet_gflops
|
||||
effective_tflops_low = midrange_gpu_tflops * 0.50 # 50% utilization
|
||||
@@ -4071,10 +4071,10 @@ class ThroughputCeilingCalc:
|
||||
objdet_sustained_tflops = (fps * objdet_gflops * GFLOPs).m_as(TFLOPs)
|
||||
objdet_headroom = effective_tflops_low / objdet_sustained_tflops
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(headroom > 1, "ResNet-50 at 30 FPS must leave compute headroom on a mid-range GPU.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
tc_fps_str = f"{fps}" # e.g. "30"
|
||||
tc_resnet_gflops_str = fmt(resnet_gflops, precision=0, commas=False) # e.g. "4"
|
||||
tc_sustained_gflops_str = fmt(sustained_gflops, precision=0, commas=False) # e.g. "123"
|
||||
@@ -4109,7 +4109,7 @@ tc_objdet_headroom_str = ThroughputCeilingCalc.tc_objdet_headroom_str
|
||||
2. **Frame Rate**: `{python} tc_fps_str` FPS required.
|
||||
3. **Sustained Throughput**: `{python} tc_fps_str`$\times$ `{python} tc_resnet_gflops_str` GFLOPs = **`{python} tc_sustained_gflops_str` GFLOPs/sec**.
|
||||
|
||||
**The Systems Conclusion**: A mid-range GPU delivering `{python} tc_gpu_tflops_str` TFLOPS theoretical peak achieves ~50--60% utilization in practice, yielding **`{python} tc_effective_low_str`--`{python} tc_effective_high_str` TFLOPS effective**. For ResNet-50 at `{python} tc_fps_str` FPS, you have **`{python} tc_headroom_str`$\times$ headroom**, easily achievable. But switch to an object detection model at `{python} tc_objdet_gflops_str` GFLOPs per frame, and you need **`{python} tc_objdet_sustained_str` TFLOPS sustained**, leaving only **`{python} tc_objdet_headroom_str`$\times$ headroom**. Add batch size constraints or multi-stream processing, and you quickly approach the compute ceiling. ResNet-50 is **Compute-Bound**, but with comfortable margins on modern hardware.
|
||||
**The Systems Conclusion**: A mid-range GPU delivering `{python} tc_gpu_tflops_str` TFLOPS theoretical peak achieves ~50--60% utilization in practice, yielding **`{python} tc_effective_low_str`--`{python} tc_effective_high_str` TFLOPS effective**. For ResNet-50 at `{python} tc_fps_str` FPS, you have **`{python} tc_headroom_str`$\times$ headroom**, easily achievable. Switch to an object detection model at `{python} tc_objdet_gflops_str` GFLOPs per frame, however, and you need **`{python} tc_objdet_sustained_str` TFLOPS sustained**, leaving only **`{python} tc_objdet_headroom_str`$\times$ headroom**. Add batch size constraints or multi-stream processing, and you quickly approach the compute ceiling. ResNet-50 is **Compute-Bound**, yet with comfortable margins on modern hardware.
|
||||
:::
|
||||
|
||||
```{python}
|
||||
@@ -4135,7 +4135,7 @@ from mlsys.constants import KWS_DSCNN_PARAMS, KWS_DSCNN_FLOPs, Kparam, MFLOPs
|
||||
class WildlifeModelSizing:
|
||||
"""Select model architecture for constrained edge deployment: MobileNetV2 fits 512 MB."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# MobileNetV1 specs
|
||||
mnv1_params_m = 4.2 # millions of params
|
||||
mnv1_flops_mflops = 569 # MFLOPs at 224x224
|
||||
@@ -4149,7 +4149,7 @@ class WildlifeModelSizing:
|
||||
inference_latency_ms = 75 # ms per inference
|
||||
inferences_per_day = 100 # trigger-based
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Memory footprints
|
||||
mnv1_fp32_mb = mnv1_params_m * 4 # FP32: 4 bytes/param
|
||||
mnv1_int8_mb = mnv1_params_m * 1 # INT8: 1 byte/param
|
||||
@@ -4164,10 +4164,10 @@ class WildlifeModelSizing:
|
||||
energy_per_inf_mj = inference_power_mw * inference_latency_ms / 1000
|
||||
energy_per_day_j = inferences_per_day * energy_per_inf_mj / 1000
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(mnv2_int8_mb < 512, "MobileNetV2 INT8 must fit in 512 MB edge RAM.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mnv1_params_str = fmt(mnv1_params_m, precision=1, commas=False) # e.g. "4.2"
|
||||
mnv1_flops_str = fmt(mnv1_flops_mflops, precision=0, commas=False) # e.g. "569"
|
||||
mnv1_fp32_str = fmt(mnv1_fp32_mb, precision=0, commas=False) # e.g. "17"
|
||||
@@ -4308,16 +4308,16 @@ from mlsys.constants import A100_MEM_CAPACITY, GiB
|
||||
class A100ClusterMemory:
|
||||
"""Contrast datacenter and edge memory: 8-GPU A100 node vs 4 GB edge device."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
n_gpus = 8
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
a100_8x_mem = int(A100_MEM_CAPACITY.m_as(GiB)) * n_gpus
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(a100_8x_mem > 400, "8x A100 cluster should provide >400 GiB memory.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
a100_8x_mem_str = f"{a100_8x_mem}" # e.g. "640"
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
@@ -4332,7 +4332,7 @@ Teams design for high-end GPU clusters, then discover deployment failures on tar
|
||||
|
||||
Teams budget Transformer deployment based on model weight memory alone, overlooking the key-value (KV) cache that self-attention requires during autoregressive generation. The KV cache scales as $O(\text{batch} \times \text{layers} \times \text{heads} \times \text{seq\_len} \times \text{head\_dim})$, and for large models this overhead dominates serving memory. Consider a Transformer with 32 layers and 32 attention heads, each with a 128-dimensional head, serving sequences of length 2048 in FP16. Each concurrent request stores $32 \times 32 \times 2048 \times 128 \times 2$ bytes $\approx$ 537 MB of KV cache. At even modest concurrency of 2--4 users, the KV cache alone consumes 1--2 GB, rivaling or exceeding the memory occupied by model weights. As the quadratic memory analysis in @sec-network-architectures-system-implications-77ac establishes, attention memory grows with sequence length, making the KV cache the binding constraint on serving throughput. Teams that size infrastructure based solely on weight memory discover at deployment that halving the batch size or truncating context length is the only way to fit within device memory, degrading either throughput or output quality.
|
||||
|
||||
With these cautionary notes in mind, we now synthesize the key concepts that practitioners should carry forward from this chapter's systematic tour of architectural families, shared building blocks, computational primitives, and selection methodology.
|
||||
These cautionary notes reinforce a recurring theme: architectural decisions are infrastructure commitments. The key concepts from this chapter's systematic tour of architectural families, shared building blocks, computational primitives, and selection methodology follow.
|
||||
|
||||
## Summary {#sec-network-architectures-summary-e642}
|
||||
|
||||
@@ -4363,3 +4363,10 @@ We have the blueprints. Now we need the tools to build them.
|
||||
<!-- This is here to make sure that quizzes are inserted properly before a part begins. -->
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:nn_architectures")
|
||||
```
|
||||
|
||||
@@ -79,20 +79,20 @@ from mlsys.constants import *
|
||||
from mlsys.formatting import fmt, sci
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MNISTInference:
|
||||
"""
|
||||
Namespace for MNIST running example (784→128→64→10).
|
||||
Establishes the base 'Arithmetic' profile for neural computation.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
input_dim = 28 * 28 # 784
|
||||
h1_dim = 128
|
||||
h2_dim = 64
|
||||
out_dim = 10
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Multiply-Accumulate (MAC) count for each layer
|
||||
mac_l1 = input_dim * h1_dim
|
||||
mac_l2 = h1_dim * h2_dim
|
||||
@@ -105,7 +105,7 @@ class MNISTInference:
|
||||
params_l3 = (h2_dim + 1) * out_dim
|
||||
total_params = params_l1 + params_l2 + params_l3
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
total_macs_str = f"{total_macs:,}"
|
||||
total_params_str = f"{total_params:,}"
|
||||
|
||||
@@ -117,11 +117,11 @@ inf_params_total_str = MNISTInference.total_params_str
|
||||
## From Logic to Arithmetic {#sec-neural-computation-deep-learning-systems-engineering-foundation-597f}
|
||||
|
||||
\index{Deep Learning!definition}
|
||||
The preceding chapters established the surrounding infrastructure: the ML workflow (@sec-ml-workflow) defined how projects progress from problem definition through deployment, and data engineering (@sec-data-engineering) covered how to prepare the raw material that models consume. Now we turn inward, to what happens inside the model itself.
|
||||
A model that runs correctly on one GPU and crashes on another is not suffering from a hardware bug. The matrix dimensions in its attention layer exceed the memory available for intermediate activations, and the crash is a direct consequence of the mathematics inside the model, not the code around it. The ML workflow (@sec-ml-workflow) defined how projects progress from problem definition through deployment, and data engineering (@sec-data-engineering) covered how to prepare the raw material that models consume. The question remaining is what happens inside the model itself.
|
||||
|
||||
The **Silicon Contract** (@sec-introduction-iron-law-ml-systems-c32a) established that every model architecture makes a computational bargain with the hardware it runs on. The architecture's mathematical operators set the terms of that bargain: they determine how much memory the model consumes, how long each computation takes, and how much energy the system expends. To honor the contract, a systems engineer must understand those operators.
|
||||
|
||||
This chapter examines those operators not as abstract theory but as a specification for computational workloads. Neural computation represents a qualitative shift in how we process information: instead of executing a sequence of explicit logical instructions (if-then-else), we execute a massive sequence of continuous mathematical transformations (multiply-add-accumulate). This shift from *Logic* to *Arithmetic* changes everything for the systems engineer, creating the **Compute-Bound** workloads characterized in the **Iron Law** (@sec-introduction-iron-law-ml-systems-c32a). It implies that the "bug" in your system is rarely a syntax error; it is a numerical instability, a vanishing gradient, or a saturated activation function. Concretely, recognizing a single handwritten digit in the MNIST network we use throughout this chapter requires `{python} inf_madd_total_str` MAC operations—not one of which is a logical branch.
|
||||
The operators that follow are not abstract theory but a specification for computational workloads. Neural computation represents a qualitative shift in how we process information: instead of executing a sequence of explicit logical instructions (if-then-else), we execute a massive sequence of continuous mathematical transformations (multiply-add-accumulate). This shift from *Logic* to *Arithmetic* changes everything for the systems engineer, creating the **Compute-Bound** workloads characterized in the **Iron Law** (@sec-introduction-iron-law-ml-systems-c32a). It implies that the "bug" in your system is rarely a syntax error; it is a numerical instability, a vanishing gradient, or a saturated activation function. Concretely, recognizing a single handwritten digit in the MNIST network we use throughout this chapter requires `{python} inf_madd_total_str` MAC operations—not one of which is a logical branch.
|
||||
|
||||
::: {.callout-definition title="Deep Learning"}
|
||||
|
||||
@@ -145,13 +145,13 @@ Classical machine learning required human experts to design feature extractors f
|
||||
\index{Gradient Instabilities!training failure mode}
|
||||
This paradigm shift creates an engineering problem with no precedent in traditional software. When conventional software fails, an error message points to a line of code. When deep learning fails, the symptoms are subtler: gradient instabilities[^fn-gradient-instabilities] that silently prevent learning, numerical precision errors that corrupt model weights over thousands of iterations, or memory access patterns in tensor operations[^fn-tensor-operations] that leave GPUs idle for most of each training step. These are not algorithmic bugs that a debugger can catch. They are systems problems that require understanding the mathematical machinery underneath.
|
||||
|
||||
[^fn-gradient-instabilities]: **Gradient Instabilities**: In a 20-layer sigmoid network, gradient magnitude after backpropagation is approximately $0.25^{20} \approx 10^{-12}$ -- effectively zero, making learning a mathematical impossibility without architectural intervention. These failures are invisible in standard logs (loss simply plateaus or becomes NaN), making them among the hardest bugs to diagnose. ReLU activations (gradient of 1 for positive inputs) and residual connections (direct gradient highways that bypass layers) were the two architectural breakthroughs that made deep networks tractable (see @sec-model-training). \index{Gradient Instabilities!training failure}
|
||||
[^fn-gradient-instabilities]: **Gradient Instabilities**: In a 20-layer sigmoid network, gradient magnitude after backpropagation is approximately $0.25^{20} \approx 10^{-12}$—effectively zero, making learning a mathematical impossibility without architectural intervention. These failures are invisible in standard logs (loss simply plateaus or becomes NaN), making them among the hardest bugs to diagnose. ReLU activations (gradient of 1 for positive inputs) and residual connections (direct gradient highways that bypass layers) were the two architectural breakthroughs that made deep networks tractable (see @sec-model-training). \index{Gradient Instabilities!training failure}
|
||||
|
||||
[^fn-tensor-operations]: **Tensor Operations**: The logical structure of a tensor (e.g., a 4D image batch) often requires non-sequential memory access patterns to retrieve elements from its flat, 1D physical storage. A concrete example: PyTorch defaults to NCHW (channel-first) layout while most mobile hardware and ARM processors prefer NHWC (channel-last). Transposing a $224\times224\times3$ ImageNet tensor between formats requires reading and rewriting ~150 KB -- a pure memory operation that adds 0.3--1 ms per inference call with no arithmetic benefit. At 1,000 requests/second, this layout mismatch alone can consume 20--30% of total inference latency. \index{Tensor!n-dimensional arrays}\index{Tensor!memory layout}
|
||||
[^fn-tensor-operations]: **Tensor Operations**: The logical structure of a tensor (e.g., a 4D image batch) often requires non-sequential memory access patterns to retrieve elements from its flat, 1D physical storage. A concrete example: PyTorch defaults to NCHW (channel-first) layout while most mobile hardware and ARM processors prefer NHWC (channel-last). Transposing a $224\times224\times3$ ImageNet tensor between formats requires reading and rewriting ~150 KB—a pure memory operation that adds 0.3--1 ms per inference call with no arithmetic benefit. At 1,000 requests/second, this layout mismatch alone can consume 20--30% of total inference latency. \index{Tensor!n-dimensional arrays}\index{Tensor!memory layout}
|
||||
|
||||
This chapter builds the mathematical literacy needed to diagnose and solve such problems. It traces how **learning paradigms** evolved from explicit rules to handcrafted features to learned representations, establishing *why* deep learning demands qualitatively different system infrastructure than classical machine learning. It then examines **neural network fundamentals**---neurons, layers, activation functions, and tensor operations---treating each component as both a mathematical operation and a computational workload, with particular attention to the memory access patterns and arithmetic intensity that determine hardware utilization.
|
||||
|
||||
The chapter then covers the **learning process**: the forward pass that produces predictions, the backpropagation algorithm that computes gradients, the loss functions that define optimization objectives, and the optimization algorithms that navigate loss landscapes. Each connects directly to system engineering decisions: matrix multiplication illuminates memory bandwidth requirements (the **Memory Wall** explored in @sec-hardware-acceleration), gradient computation explains numerical precision constraints, and optimization dynamics inform resource allocation. We follow the learning process with the **inference pipeline**—the transformation of a trained model into a production system that answers queries—where engineering concerns shift from throughput to latency and from training stability to deployment efficiency. A historical **case study** (USPS digit recognition) grounds these concepts in a real deployment, and the chapter closes by mapping its content onto the **D·A·M taxonomy** (Data, Algorithm, Machine)—the framework that explains why deep learning systems succeed only when all three components align.
|
||||
The **learning process** then takes center stage: the forward pass that produces predictions, the backpropagation algorithm that computes gradients, the loss functions that define optimization objectives, and the optimization algorithms that navigate loss landscapes. Each connects directly to system engineering decisions: matrix multiplication illuminates memory bandwidth requirements (the **Memory Wall** explored in @sec-hardware-acceleration), gradient computation explains numerical precision constraints, and optimization dynamics inform resource allocation. The **inference pipeline** shifts the engineering concerns from throughput to latency and from training stability to deployment efficiency. A historical **case study** (USPS digit recognition) grounds these concepts in a real deployment, and the **D·A·M taxonomy** (Data, Algorithm, Machine) closes the arc by explaining why deep learning systems succeed only when all three components align.
|
||||
|
||||
To ground this arc in a concrete systems story, we start by following a single MNIST digit through three computational paradigms and quantify how each step changes the workload profile.
|
||||
|
||||
@@ -185,7 +185,7 @@ To ground this arc in a concrete systems story, we start by following a single M
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import KIB_TO_BYTES, MILLION, THOUSAND
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ParadigmSystemsCost:
|
||||
"""Namespace for Paradigm Systems Cost."""
|
||||
|
||||
@@ -491,7 +491,7 @@ plt.show()
|
||||
\index{Overparameterization!generalization benefit}
|
||||
Notice the counterintuitive shape: test error initially follows the expected U-curve, but then *decreases again* in the overparameterized regime. This scaling behavior resolves the central paradox of deep learning. Classical statistical theory predicted that models should be sized to match data complexity: too small and they underfit, too large and they overfit by memorizing noise. This Bias-Variance Tradeoff[^fn-bias-variance]\index{Bias-Variance Tradeoff!classical theory} suggested that massive models would inevitably fail on new data. Instead, we observe a 'Double Descent'\index{Double Descent!overparameterized regime} [@belkin2019reconciling] where larger models, trained on sufficient data, find smoother solutions that generalize better than smaller ones. This insight—that *bigger is better* when properly regularized—drives the race for 100B+ parameter foundation models.
|
||||
|
||||
[^fn-bias-variance]: **Bias-Variance Tradeoff**\index{Bias-Variance Tradeoff!double descent}: In overparameterized networks (parameter count >> training samples), the classical bias-variance tradeoff breaks down: test error decreases again after the interpolation threshold, the Double Descent phenomenon. The systems consequence is that larger models trained longer are often *more* stable than smaller models stopped early, inverting the conventional wisdom that regularization is always the right response to overfitting. This insight drives the engineering decision to scale model size rather than constrain it -- bigger networks with more compute often generalize better, not worse. \index{Bias-Variance Tradeoff!overparameterization}
|
||||
[^fn-bias-variance]: **Bias-Variance Tradeoff**\index{Bias-Variance Tradeoff!double descent}: In overparameterized networks (parameter count >> training samples), the classical bias-variance tradeoff breaks down: test error decreases again after the interpolation threshold, the Double Descent phenomenon. The systems consequence is that larger models trained longer are often *more* stable than smaller models stopped early, inverting the conventional wisdom that regularization is always the right response to overfitting. This insight drives the engineering decision to scale model size rather than constrain it—bigger networks with more compute often generalize better, not worse. \index{Bias-Variance Tradeoff!overparameterization}
|
||||
|
||||
Neural network performance often follows empirical scaling relationships that impact system design. One durable scale anchor is that frontier model sizes and training compute budgets have increased by multiple orders of magnitude over the past decade. In broad terms, modern AI systems frequently trade off model size, data, and compute budgets rather than relying on a single “train longer” axis. Memory bandwidth and storage capacity can become primary constraints rather than raw computational power, depending on the workload and platform. The detailed formulations and quantitative analysis of scaling behavior are covered in @sec-model-training, while @sec-model-compression explores practical implementation.
|
||||
|
||||
@@ -864,25 +864,25 @@ else:
|
||||
from mlsys.constants import GPT3_PARAMS, Bparam, THOUSAND
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class HistoricalScale:
|
||||
"""
|
||||
Namespace for Historical Model Scale.
|
||||
Scenario: Comparing GPT-3 vs GPT-4 parameter counts.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
gpt3_params_b = GPT3_PARAMS.m_as(Bparam)
|
||||
gpt4_params_t = 1.8 # Estimate (MoE)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
gpt4_params_b = gpt4_params_t * THOUSAND
|
||||
scale_factor = gpt4_params_b / gpt3_params_b
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(scale_factor >= 5, f"GPT-4 ({gpt4_params_t}T) should be significantly larger than GPT-3 ({gpt3_params_b}B). Ratio: {scale_factor:.1f}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt3_params_b_str = fmt(gpt3_params_b, precision=0, commas=False)
|
||||
gpt4_params_t_str = fmt(gpt4_params_t, precision=1, commas=False)
|
||||
|
||||
@@ -969,13 +969,13 @@ The historical trajectory from Perceptrons through AI winters to the GPU-driven
|
||||
|
||||
## Neural Network Fundamentals {#sec-neural-computation-neural-network-fundamentals-07e4}
|
||||
|
||||
The preceding section traced *what* happened: compute grew exponentially, algorithms matured, and data became abundant. This section explains *why* the computational demands are so extreme by examining the mathematical operations neural networks actually perform. To understand why a GPU processes neural networks faster than a CPU, or why training requires more memory than inference, we must open the box and examine the operations themselves. This section develops that mathematical foundation, showing how simple operations on individual neurons compound into the infrastructure requirements that shaped modern AI.
|
||||
Compute grew exponentially, algorithms matured, and data became abundant. The question now is *why* the computational demands are so extreme. A GPU processes neural networks faster than a CPU not because of raw clock speed but because of the specific mathematical operations neural networks perform. Training requires more memory than inference not because of software overhead but because the chain rule demands storing every intermediate result. Understanding these operations reveals how simple arithmetic on individual neurons compounds into the infrastructure requirements that shaped modern AI.
|
||||
|
||||
The concepts here apply to all neural networks, from simple classifiers to large language models. While architectures evolve and new paradigms emerge, these fundamentals remain constant: weighted sums, nonlinear activations, gradient-based learning. Mastering these operations and their computational characteristics enables reasoning about any neural network's resource requirements.
|
||||
|
||||
### Why Depth Matters: The Power of Hierarchical Representations {#sec-neural-computation-depth-matters-power-hierarchical-representations-f83c}
|
||||
|
||||
Before the mathematical machinery of neurons and layers, we preview why "deep" learning earns its name\index{Deep Learning!network depth}\index{Hierarchical Representation!depth advantage}. The detailed mechanics of layers and connections follow in subsequent sections; here we establish the intuition for why depth provides such dramatic representational advantages. We introduced hierarchical feature learning conceptually earlier; now we formalize that intuition with a concrete example that grounds all subsequent mathematical development.
|
||||
A single-layer network attempting to classify handwritten digits must map raw pixels directly to labels, essentially memorizing every variation of every digit\index{Deep Learning!network depth}\index{Hierarchical Representation!depth advantage}. A network with three layers solves the same problem with far fewer parameters by decomposing it hierarchically. The question is *why* depth provides such dramatic representational advantages, and the answer grounds all the mathematical development that follows.
|
||||
|
||||
Deep networks succeed because they use **compositionality**\index{Compositionality!pattern decomposition}: complex patterns decompose into simpler patterns that themselves decompose further. In image recognition, pixels combine into edges, edges into textures, textures into parts, and parts into objects. This hierarchical decomposition reflects the structure of the world itself and explains why "deep" learning earns its name.
|
||||
|
||||
@@ -998,7 +998,7 @@ However, depth introduces engineering challenges. Each additional layer:
|
||||
- Increases gradient path length, risking vanishing/exploding gradients
|
||||
- Requires storing intermediate activations for backpropagation
|
||||
|
||||
Modern architectures balance depth (representational power) against width (parallelism). A network with 10 layers of 100 neurons has the same 1,000 total hidden neurons as one with 2 layers of 500 neurons, but very different computational characteristics. The deeper network can represent more complex functions; the wider network can compute all neurons in a layer simultaneously.
|
||||
Modern architectures balance depth (representational power) against width (parallelism). A network with 10 layers of 100 neurons has the same 1,000 total hidden neurons as one with 2 layers of 500 neurons, but fundamentally different computational characteristics. The deeper network can represent more complex functions; the wider network can compute all neurons in a layer simultaneously.
|
||||
|
||||
:::
|
||||
|
||||
@@ -1042,20 +1042,20 @@ To ground these concepts in a concrete example, we use handwritten digit recogni
|
||||
|
||||
from mlsys.constants import MNIST_IMAGE_WIDTH, MNIST_IMAGE_HEIGHT
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MnistArchitectureConstants:
|
||||
"""Namespace for Mnist Architecture Constants."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
mnist_l1_dim = 784 # input: 28×28 pixels
|
||||
mnist_l2_dim = 128 # hidden layer 1
|
||||
mnist_l3_dim = 64 # hidden layer 2
|
||||
mnist_l4_dim = 10 # output: 10 digit classes
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
mnist_input_neurons_value = MNIST_IMAGE_WIDTH * MNIST_IMAGE_HEIGHT # 28×28 = 784
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mnist_arch_str = f"{mnist_l1_dim}→{mnist_l2_dim}→{mnist_l3_dim}→{mnist_l4_dim}" # e.g. "784→128→64→10"
|
||||
mnist_input_str = f"{mnist_input_neurons_value}" # e.g. "784"
|
||||
|
||||
@@ -1277,11 +1277,11 @@ The choice of activation function affects both learning effectiveness and comput
|
||||
\index{Sigmoid!etymology}
|
||||
The sigmoid function\index{Activation Function!sigmoid}\index{Sigmoid!bounded output}[^fn-sigmoid-etymology] maps any input value to a bounded range between 0 and 1, as defined in @eq-sigmoid:
|
||||
|
||||
[^fn-sigmoid-etymology]: **Sigmoid**: From Greek *sigma* + *eidos* ("sigma-shaped"), referring to the S-curve that maps inputs to the bounded (0, 1) range. The mapping requires a floating-point exponential ($e^{-x}$), which costs ~2,500 transistors and 20--40 CPU cycles per evaluation, versus ReLU's single comparator at ~50 transistors and 1 cycle -- a 50$\times$ silicon cost difference per activation. This arithmetic penalty scales with every neuron in every layer, making sigmoid's replacement by ReLU as much a hardware efficiency decision as a gradient stability one. \index{Sigmoid!computational cost}
|
||||
[^fn-sigmoid-etymology]: **Sigmoid**: From Greek *sigma* + *eidos* ("sigma-shaped"), referring to the S-curve that maps inputs to the bounded (0, 1) range. The mapping requires a floating-point exponential ($e^{-x}$), which costs ~2,500 transistors and 20--40 CPU cycles per evaluation, versus ReLU's single comparator at ~50 transistors and 1 cycle—a 50$\times$ silicon cost difference per activation. This arithmetic penalty scales with every neuron in every layer, making sigmoid's replacement by ReLU as much a hardware efficiency decision as a gradient stability one. \index{Sigmoid!computational cost}
|
||||
|
||||
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ {#eq-sigmoid}
|
||||
|
||||
The S-shaped curve produces outputs interpretable as probabilities, making sigmoid particularly useful for binary classification tasks. For very large positive inputs, the function approaches 1; for very large negative inputs, it approaches 0. The smooth, continuous nature of sigmoid makes it differentiable everywhere, which is necessary for gradient-based learning.
|
||||
The S-shaped curve produces outputs interpretable as probabilities, making sigmoid particularly useful for binary classification tasks. For large positive inputs, the function approaches 1; for large negative inputs, it approaches 0. The smooth, continuous nature of sigmoid makes it differentiable everywhere, which is necessary for gradient-based learning.
|
||||
|
||||
Sigmoid has a significant limitation: for inputs with large absolute values (far from zero), the gradient becomes extremely small, a phenomenon called the **vanishing gradient problem**\index{Vanishing Gradient Problem!activation saturation}[^fn-vanishing-gradient-depth]. During backpropagation, these small gradients multiply together across layers, causing gradients in early layers to become exponentially tiny. This effectively prevents learning in deep networks, as weight updates become negligible.
|
||||
|
||||
@@ -1300,7 +1300,7 @@ $$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$ {#eq-tanh}
|
||||
|
||||
Tanh produces an S-shaped curve similar to sigmoid but centered at zero: negative inputs map to negative outputs and positive inputs to positive outputs. This symmetry balances gradient flow during training, often yielding faster convergence than sigmoid.
|
||||
|
||||
Like sigmoid, tanh is smooth and differentiable everywhere, and it still suffers from the vanishing gradient problem for inputs with large magnitudes. When the function saturates (approaches -1 or 1), gradients become very small. Despite this limitation, tanh's zero-centered outputs make it preferable to sigmoid for hidden layers in many architectures, particularly in recurrent neural networks where maintaining balanced activations across time steps is important.
|
||||
Like sigmoid, tanh is smooth and differentiable everywhere, and it still suffers from the vanishing gradient problem for inputs with large magnitudes. When the function saturates (approaches -1 or 1), gradients shrink toward zero. Despite this limitation, tanh's zero-centered outputs make it preferable to sigmoid for hidden layers in many architectures, particularly in recurrent neural networks where maintaining balanced activations across time steps is important.
|
||||
|
||||
Both sigmoid and tanh share a critical limitation: gradient saturation at extreme input values. The search for an activation function that avoids this problem while remaining computationally efficient led to one of deep learning's most important innovations.
|
||||
|
||||
@@ -1342,7 +1342,7 @@ Softmax is almost exclusively used in the output layer for multi-class classific
|
||||
The mathematical relationship between input logits and output probabilities is differentiable, allowing gradients to flow back through softmax during training. When combined with cross-entropy loss (discussed in @sec-neural-computation-loss-functions-6fc2), softmax produces particularly clean gradient expressions that guide learning effectively. Beyond their mathematical properties, the choice of *activation functions* has direct consequences for *hardware* efficiency.
|
||||
|
||||
::: {.callout-perspective title="Activation Functions and Hardware"}
|
||||
**Why ReLU Dominates in Practice**: Beyond its mathematical benefits like avoiding vanishing gradients, ReLU's hardware efficiency explains its widespread adoption. Computing $\max(0,x)$ requires a single comparison operation, while sigmoid and tanh require computing exponentials—operations that are orders of magnitude more expensive in both time and energy. This computational simplicity means ReLU can be executed faster on any processor and consumes significantly less power, a critical consideration for battery-powered devices. The computational and hardware implications of activation functions, including performance benchmarks and implementation strategies for modern accelerators, are explored in @sec-hardware-acceleration.
|
||||
**Why ReLU Dominates in Practice**: Beyond its mathematical benefits like avoiding vanishing gradients, ReLU's hardware efficiency explains its widespread adoption. Computing $\max(0,x)$ requires a single comparison operation, while sigmoid and tanh require computing exponentials—operations that are orders of magnitude more expensive in both time and energy. This computational simplicity means ReLU can be executed faster on any processor and consumes 10--100$\times$ less power, a critical consideration for battery-powered devices. The computational and hardware implications of activation functions, including performance benchmarks and implementation strategies for modern accelerators, are explored in @sec-hardware-acceleration.
|
||||
:::
|
||||
|
||||
```{python}
|
||||
@@ -1362,14 +1362,14 @@ The mathematical relationship between input logits and output probabilities is d
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ActivationLogic:
|
||||
"""
|
||||
Namespace for Activation Logic Cost calculation.
|
||||
Scenario: Estimating transistor-level complexity for ALUs.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# ReLU is a comparator + mux
|
||||
relu_transistors = 50
|
||||
|
||||
@@ -1377,13 +1377,13 @@ class ActivationLogic:
|
||||
# High-precision floating point exponential unit
|
||||
sigmoid_transistors = 2500
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
ratio = sigmoid_transistors / relu_transistors
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ratio >= 40, f"Sigmoid should be much more expensive than ReLU. Ratio: {ratio:.1f}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
relu_transistor_str = f"{relu_transistors}"
|
||||
sigmoid_transistor_str = f"{sigmoid_transistors:,}"
|
||||
activation_ratio_str = f"{int(ratio)}"
|
||||
@@ -1581,7 +1581,7 @@ As data flows through the network, it is transformed at each layer to extract me
|
||||
|
||||
The learnable parameters[^fn-parameter-memory-cost] of neural networks, weights and biases, determine how information flows through the network and how transformations are applied to input data. Their organization directly impacts both learning capacity and computational requirements.
|
||||
|
||||
[^fn-parameter-memory-cost]: **Parameter Memory Cost**: Parameter count is a misleading proxy for memory importance. Normalization layer parameters (BatchNorm's $\gamma$ and $\beta$, LayerNorm's scale and bias) add only 2 parameters per feature dimension, making them negligible for memory budgeting. Yet freezing them during fine-tuning causes 5--15% accuracy degradation -- they punch far above their weight. Conversely, the bulk of parameters (dense weight matrices) each require 12 bytes during Adam training (weight + gradient + two moment vectors), so a model that fits in memory for inference may require 3$\times$ more for training. \index{Parameter!memory multiplier}
|
||||
[^fn-parameter-memory-cost]: **Parameter Memory Cost**: Parameter count is a misleading proxy for memory importance. Normalization layer parameters (BatchNorm's $\gamma$ and $\beta$, LayerNorm's scale and bias) add only 2 parameters per feature dimension, making them negligible for memory budgeting. Yet freezing them during fine-tuning causes 5--15% accuracy degradation—they punch far above their weight. Conversely, the bulk of parameters (dense weight matrices) each require 12 bytes during Adam training (weight + gradient + two moment vectors), so a model that fits in memory for inference may require 3$\times$ more for training. \index{Parameter!memory multiplier}
|
||||
|
||||
#### Weight Matrices {#sec-neural-computation-weight-matrices-9f9a}
|
||||
|
||||
@@ -1779,11 +1779,11 @@ This simple network demonstrates how hidden layers enable learning non-linear pa
|
||||
from mlsys.constants import BYTES_FP32, MB, KiB, param, Mparam, Kparam
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MnistScaleComparison:
|
||||
"""Namespace for Mnist Scale Comparison."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
mnist_large_l1 = 1000 # hidden layer 1 width
|
||||
mnist_large_l2 = 1000 # hidden layer 2 width
|
||||
mnist_large_arch = [(mnist_l1_dim, mnist_large_l1),
|
||||
@@ -1797,7 +1797,7 @@ class MnistScaleComparison:
|
||||
(mnist_small_l1, mnist_small_l2),
|
||||
(mnist_small_l2, mnist_l4_dim)]
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Use explicit for-loops: generator expressions skip the class namespace in Python 3
|
||||
_large_params = 0
|
||||
for _i, _o in mnist_large_arch:
|
||||
@@ -1811,7 +1811,7 @@ class MnistScaleComparison:
|
||||
mnist_small_params = _small_params
|
||||
mnist_small_mem_kb = model_memory(mnist_small_params, BYTES_FP32, KiB)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mnist_large_params_m_str = f"{(mnist_large_params * param).m_as(Mparam):.1f}" # e.g. "1.8"
|
||||
mnist_large_mem_mb_str = f"{mnist_large_mem_mb:.0f}" # e.g. "7"
|
||||
mnist_small_params_k_str = f"{(mnist_small_params * param).m_as(Kparam):.0f}" # e.g. "89"
|
||||
@@ -2078,19 +2078,19 @@ These connection patterns have significant implications for both the theoretical
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import BYTES_FP32, flop, MFLOPs, KFLOPs, MILLION, THOUSAND, KIB_TO_BYTES
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO: CANONICAL MNIST ──────────────────────────────
|
||||
# ┌── LEGO: CANONICAL MNIST ──────────────────────────────
|
||||
class MNISTMemory:
|
||||
"""
|
||||
Namespace for Canonical MNIST (784->128->64->10).
|
||||
Calculates Memory, FLOPs, and Arithmetic Intensity.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
layers_dims = [784, 128, 64, 10]
|
||||
batch_size = 32
|
||||
bytes_per_param = 4 # FP32
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# A. Weights & Biases
|
||||
weights = []
|
||||
biases = []
|
||||
@@ -2176,7 +2176,7 @@ class MNISTMemory:
|
||||
total_inf_act_str = f"{inf_act_elements}"
|
||||
|
||||
|
||||
# ┌── P.I.C.O. SCENARIO: BACKPROP EXAMPLE (Wider Network) ──────────────────────
|
||||
# ┌── LEGO: BACKPROP EXAMPLE (Wider Network) ──────────────────────
|
||||
class BackpropMemory:
|
||||
"""
|
||||
Namespace for 'Backpropagation Mechanics' callout.
|
||||
@@ -2321,7 +2321,7 @@ Parameter count grows with network width and depth. For our MNIST example, consi
|
||||
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MnistTrainingMemoryCalc:
|
||||
"""Namespace for Mnist Training Memory Calc."""
|
||||
|
||||
@@ -2454,7 +2454,7 @@ from mlsys.constants import GPT2_PARAMS, BYTES_FP32, param, Kparam, Bparam, KiB,
|
||||
from mlsys.formulas import model_memory
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MemoryExplosionCalc:
|
||||
"""Namespace for Memory Explosion Calc."""
|
||||
|
||||
@@ -2468,14 +2468,14 @@ class MemoryExplosionCalc:
|
||||
(mnist_l3_dim, mnist_l4_dim)]
|
||||
mnist_params_value = sum(i * o + o for i, o in mnist_arch_value) # weights + biases
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
mnist_mem_kb_value = model_memory(mnist_params_value, BYTES_FP32, KiB) # Uses 1024 base
|
||||
gpt2_params_count_value = GPT2_PARAMS.m_as(param)
|
||||
gpt2_params_b_value = GPT2_PARAMS.m_as(Bparam)
|
||||
gpt2_mem_gb_value = model_memory(GPT2_PARAMS, BYTES_FP32, GB)
|
||||
mem_jump_value = gpt2_params_count_value / mnist_params_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mnist_params_count_str = f"{mnist_params_value:,}" # e.g. "109,386"
|
||||
mnist_params_k_str = fmt((mnist_params_value * param).m_as(Kparam),
|
||||
precision=0, commas=False) # e.g. "109"
|
||||
@@ -2531,23 +2531,23 @@ The memory calculations above are precise but slow. Experienced engineers develo
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import byte, GB
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MentalMathCalc:
|
||||
"""Namespace for Mental Math Calc."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
mm_params_m_value = 100
|
||||
mm_bytes_value = 4
|
||||
mm_overhead_value = 4 # params + grads + optimizer states
|
||||
mm_gpu_gb_value = 16
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
mm_model_gb_value = (
|
||||
mm_params_m_value * MILLION * mm_bytes_value * mm_overhead_value * byte
|
||||
).m_as(GB)
|
||||
mm_remaining_gb_value = mm_gpu_gb_value - mm_model_gb_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mm_model_str = fmt(mm_model_gb_value, precision=1, commas=False)
|
||||
mm_remaining_str = fmt(mm_remaining_gb_value, precision=0, commas=False)
|
||||
mm_params_m_str = str(mm_params_m_value)
|
||||
@@ -2604,11 +2604,11 @@ Network architecture, neurons, and parameters are now in place, but a central qu
|
||||
|
||||
## Learning Process {#sec-neural-computation-learning-process-0b83}
|
||||
|
||||
Our MNIST network currently holds `{python} MNISTMemory.total_params_str` randomly initialized parameters—numbers that encode no knowledge at all. How do these random values become a digit classifier that achieves over 95% accuracy? Neural networks learn through training on examples, iteratively adjusting weights to reduce prediction errors. This section traces that process from the first random guess to a trained model, covering the four operations that constitute each training step: forward propagation, loss computation, backpropagation, and weight update.
|
||||
Our MNIST network currently holds `{python} MNISTMemory.total_params_str` randomly initialized parameters—numbers that encode no knowledge at all. How do these random values become a digit classifier that achieves over 95% accuracy? The answer lies in four operations repeated millions of times: forward propagation computes a prediction, a loss function measures the error, backpropagation assigns blame to each weight, and an optimizer adjusts those weights to reduce the error.
|
||||
|
||||
### Supervised Learning from Labeled Examples {#sec-neural-computation-supervised-learning-labeled-examples-5e6d}
|
||||
|
||||
Drawing from our architectural foundation, the core principle of neural network training is supervised learning\index{Supervised Learning!labeled examples}\index{Training!supervised learning} from labeled examples. Consider our MNIST digit recognition task: we have a dataset of 60,000 training images, each a $28\times 28$ pixel grayscale image paired with its correct digit label. The network must learn the relationship between these images and their corresponding digits through an iterative process of prediction and weight adjustment. Ensuring the quality and integrity of training data is essential to model success, as established in @sec-data-engineering.
|
||||
A randomly initialized network classifies digits no better than a coin flip. Transforming it into a 95%-accurate classifier requires supervised learning\index{Supervised Learning!labeled examples}\index{Training!supervised learning}: showing the network labeled examples and adjusting its weights based on the errors it makes. Consider our MNIST digit recognition task: we have a dataset of 60,000 training images, each a $28\times 28$ pixel grayscale image paired with its correct digit label. The network must learn the relationship between these images and their corresponding digits through an iterative process of prediction and weight adjustment. Ensuring the quality and integrity of training data is essential to model success, as established in @sec-data-engineering.
|
||||
|
||||
This relationship between inputs and outputs drives the training methodology. Training operates as a loop where each iteration processes a subset of training examples called a batch\index{Batch!training iteration}[^fn-batch-processing]. For each batch, the network performs four operations: forward computation through the network layers generates predictions, a loss function evaluates prediction accuracy, weight adjustments are computed based on prediction errors, and network weights are updated to improve future predictions.
|
||||
|
||||
@@ -2624,7 +2624,7 @@ $$ \text{loss} = \mathcal{L}(\hat{y}, y) $$ {#eq-loss-general}
|
||||
|
||||
\index{Statistical Decision Theory!loss function origin}
|
||||
|
||||
[^fn-loss-function]: **Loss Function**\index{Loss Function!etymology}: Formalized by Abraham Wald in statistical decision theory as the "cost" of an incorrect decision, $\mathcal{L}$ quantifies the gap between prediction $\hat{y}$ and ground truth $y$. The choice of loss function shapes the optimization geometry: it determines the gradient landscape that backpropagation must navigate. A loss with flat regions near incorrect predictions produces weak gradients that stall learning, while a loss with steep gradients near the decision boundary accelerates convergence where it matters most -- a systems consequence explored in the cross-entropy discussion below. \index{Loss Function!landscape geometry}
|
||||
[^fn-loss-function]: **Loss Function**\index{Loss Function!etymology}: Formalized by Abraham Wald in statistical decision theory as the "cost" of an incorrect decision, $\mathcal{L}$ quantifies the gap between prediction $\hat{y}$ and ground truth $y$. The choice of loss function shapes the optimization geometry: it determines the gradient landscape that backpropagation must navigate. A loss with flat regions near incorrect predictions produces weak gradients that stall learning, while a loss with steep gradients near the decision boundary accelerates convergence where it matters most—a systems consequence explored in the cross-entropy discussion below. \index{Loss Function!landscape geometry}
|
||||
|
||||
This error measurement drives the adjustment of network parameters through backpropagation, which we examine in detail below.
|
||||
|
||||
@@ -2853,11 +2853,11 @@ For each image in the batch, this produces a probability distribution over the p
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import flop, MFLOPs
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MnistFlopsCalc:
|
||||
"""Namespace for Mnist Flops Calc."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
batch_size_value = 32
|
||||
in_dim_value = 784
|
||||
h1_value = 128
|
||||
@@ -2866,7 +2866,7 @@ class MnistFlopsCalc:
|
||||
double_h1_value = 256
|
||||
double_h2_value = 128
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
flops_l1_mm_value = 2 * batch_size_value * in_dim_value * h1_value
|
||||
flops_l1_bias_value = 2 * (batch_size_value * h1_value) # Bias add + ReLU
|
||||
|
||||
@@ -2903,7 +2903,7 @@ class MnistFlopsCalc:
|
||||
double_total_mops_value = (double_total_flops_value * flop).m_as(MFLOPs)
|
||||
double_ratio_value = double_total_mops_value / total_mops_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
l1_mm_str = f"{flops_l1_mm_value:,}"
|
||||
l1_bias_str = f"{flops_l1_bias_value:,}"
|
||||
l2_mm_str = f"{flops_l2_mm_value:,}"
|
||||
@@ -2974,7 +2974,7 @@ The organization of computations also affects performance. Matrix operations can
|
||||
|
||||
The computational characteristics of neural networks favor parallel processing architectures\index{GPU!parallel computation}\index{Parallel Processing!neural networks}. While traditional CPUs can execute these operations, GPUs designed for parallel computation achieve substantial speedups, often 10--100$\times$ faster for matrix operations. Specialized AI accelerators achieve even better efficiency through reduced precision arithmetic, specialized memory architectures, and dataflow optimizations tailored for neural network computation patterns.
|
||||
|
||||
Energy consumption also varies significantly across hardware platforms\index{Energy Consumption!hardware platforms}\index{Throughput!accelerator parallelism}. CPUs offer flexibility but consume more energy per operation. GPUs provide high throughput at higher power consumption. Specialized edge accelerators optimize for energy efficiency, achieving the same computations with orders of magnitude less power, which is important for mobile and embedded deployments. This energy disparity stems from the memory hierarchy constraints where data movement dominates computation costs.
|
||||
Energy consumption also varies by orders of magnitude across hardware platforms\index{Energy Consumption!hardware platforms}\index{Throughput!accelerator parallelism}. CPUs offer flexibility but consume more energy per operation. GPUs provide high throughput at higher power consumption. Specialized edge accelerators optimize for energy efficiency, achieving the same computations with orders of magnitude less power, which is important for mobile and embedded deployments. This energy disparity stems from the memory hierarchy constraints where data movement dominates computation costs.
|
||||
|
||||
These considerations recur throughout subsequent chapters, particularly in @sec-network-architectures where architecture-specific optimizations introduce additional trade-offs.
|
||||
|
||||
@@ -3037,7 +3037,7 @@ For a batch of B examples, the cross-entropy loss becomes @eq-batch-cross-entrop
|
||||
|
||||
$$ \mathcal{L}_{\text{batch}} = -\frac{1}{B}\sum_{i=1}^B \sum_{j=1}^{10} y_{ij} \log(\hat{y}_{ij}) $$ {#eq-batch-cross-entropy}
|
||||
|
||||
Computing this loss efficiently requires careful consideration of numerical precision. Taking the logarithm of very small probabilities can lead to numerical instability. Consider a case where our network predicts a probability of 0.0001 for the correct class. Computing $\log(0.0001)$ directly might cause underflow or result in imprecise values.
|
||||
Computing this loss efficiently requires careful consideration of numerical precision. Taking the logarithm of near-zero probabilities can lead to numerical instability. Consider a case where our network predicts a probability of 0.0001 for the correct class. Computing $\log(0.0001)$ directly might cause underflow or result in imprecise values.
|
||||
|
||||
To address this, we typically implement the loss computation with two key modifications:
|
||||
|
||||
@@ -3063,7 +3063,7 @@ During each training iteration, the loss value serves multiple purposes. As a pe
|
||||
|
||||
For our MNIST classifier, monitoring the loss during training reveals the network's learning trajectory. A typical pattern begins with high loss ($\sim 2.3$, equivalent to random guessing among 10 classes), followed by rapid decrease in early iterations as the network discovers the most salient features. Progress then slows to gradual improvement as the network fine-tunes its predictions for harder cases, eventually stabilizing at a lower loss ($\sim 0.1$, indicating confident correct predictions).
|
||||
|
||||
The loss function's gradients with respect to the network's outputs provide the initial error signal that drives backpropagation. For cross-entropy loss, these gradients have a particularly simple form: the difference between predicted and true probabilities. This mathematical property makes cross-entropy loss especially suitable for classification tasks, as it provides strong gradients even when predictions are very wrong.
|
||||
The loss function's gradients with respect to the network's outputs provide the initial error signal that drives backpropagation. For cross-entropy loss, these gradients have a particularly simple form: the difference between predicted and true probabilities. This mathematical property makes cross-entropy loss especially suitable for classification tasks, as it provides strong gradients even when predictions are far from the target.
|
||||
|
||||
The choice of loss function also influences other training decisions. Larger loss gradients may require smaller learning rates to prevent overshooting, while loss averaging across batches affects gradient stability and thus optimal batch size. The loss landscape's curvature shapes which optimization algorithms work best, and the loss value's trajectory determines when training has converged.
|
||||
|
||||
@@ -3131,7 +3131,7 @@ This computation cascades backward through the network, with each layer's gradie
|
||||
|
||||
[^fn-chain-rule-depth]: **Chain Rule**: The calculus identity $\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdots \frac{\partial a_1}{\partial w}$ becomes a *product* of $n$ terms for an $n$-layer network. If each partial derivative is slightly less than 1, the product vanishes exponentially; if slightly greater, it explodes. This multiplicative structure is why depth is a systems constraint, not just a design choice: it dictates the numerical precision requirements and initialization strategies (e.g., Glorot, He) needed to keep training stable. \index{Chain Rule!depth constraint}
|
||||
|
||||
This process faces challenges in deep networks. As gradients flow backward through many layers, they can either vanish or explode\index{Exploding Gradients!training instability}\index{Gradient!vanishing and exploding}. When gradients are repeatedly multiplied through many layers, they can become exponentially small, particularly with sigmoid or tanh activation functions. This causes early layers to learn very slowly or not at all, as they receive negligible updates. Conversely, if gradient values are consistently greater than 1, they can grow exponentially, leading to unstable training and destructive weight updates.
|
||||
This process faces challenges in deep networks. As gradients flow backward through many layers, they can either vanish or explode\index{Exploding Gradients!training instability}\index{Gradient!vanishing and exploding}. When gradients are repeatedly multiplied through many layers, they can become exponentially small, particularly with sigmoid or tanh activation functions. This causes early layers to learn at negligible rates or not at all, as they receive negligible updates. Conversely, if gradient values are consistently greater than 1, they can grow exponentially, leading to unstable training and destructive weight updates.
|
||||
|
||||
#### Derivative Calculation Process {#sec-neural-computation-derivative-calculation-process-edaa}
|
||||
|
||||
@@ -3247,11 +3247,11 @@ The "Credit Assignment Problem" asks: which weight caused this error? Now that y
|
||||
|
||||
### Weight Update and Optimization {#sec-neural-computation-weight-update-optimization-3e00}
|
||||
|
||||
Training neural networks\index{Training!weight optimization}\index{Weight Update!optimization} requires systematic adjustment of weights and biases to minimize prediction errors through iterative optimization. Building on the computational foundations established earlier, this section explores the core mechanisms of neural network optimization, from gradient-based parameter updates to practical training implementations.
|
||||
Backpropagation computes *what* each weight should change, but not *how much*\index{Training!weight optimization}\index{Weight Update!optimization}. The step size, the direction refinement, and the momentum across iterations are all governed by the optimizer—the algorithm that converts raw gradients into weight updates. The choice of optimizer determines whether training converges in hours or diverges in minutes.
|
||||
|
||||
#### Parameter Update Algorithms {#sec-neural-computation-parameter-update-algorithms-b592}
|
||||
|
||||
The optimization process adjusts network weights through **gradient descent**, a systematic method that implements the learning principles derived from our biological neural network analysis.
|
||||
The optimization process adjusts network weights through **gradient descent**, a systematic method that uses the error signal from backpropagation to determine the direction and magnitude of each weight update.
|
||||
|
||||
::: {.callout-definition title="Gradient Descent"}
|
||||
|
||||
@@ -3336,7 +3336,7 @@ During training, we monitor several key metrics: training loss tracks the averag
|
||||
|
||||
#### Convergence and Stability Considerations {#sec-neural-computation-convergence-stability-considerations-a9b7}
|
||||
|
||||
Successful neural network training requires attention to several practical aspects that significantly impact learning effectiveness. These considerations bridge theoretical understanding and practical implementation, beginning with the central risk of *overfitting*\index{Convergence!training stability}.
|
||||
A network that achieves 99.5% accuracy on training data but only 85% on new data has not learned the underlying patterns—it has memorized the training set\index{Convergence!training stability}. This failure mode, *overfitting*, is the central risk in practical training.
|
||||
|
||||
::: {.callout-definition title="Overfitting"}
|
||||
|
||||
@@ -3348,7 +3348,7 @@ Successful neural network training requires attention to several practical aspec
|
||||
|
||||
:::
|
||||
|
||||
Learning rate selection\index{Learning Rate!selection criteria}\index{Hyperparameter!tuning} is perhaps the most critical parameter affecting training. For our MNIST network, the choice of learning rate dramatically influences the training dynamics. A large learning rate of 0.1 might cause unstable training where the loss oscillates or explodes as weight updates overshoot optimal values. Conversely, a very small learning rate of 0.0001 might result in extremely slow convergence, requiring many more epochs to achieve good performance. A moderate learning rate of 0.01 often provides a good balance between training speed and stability, allowing the network to make steady progress while maintaining stable learning.
|
||||
Learning rate selection\index{Learning Rate!selection criteria}\index{Hyperparameter!tuning} is perhaps the most critical parameter affecting training. For our MNIST network, the choice of learning rate dramatically influences the training dynamics. A large learning rate of 0.1 might cause unstable training where the loss oscillates or explodes as weight updates overshoot optimal values. Conversely, a learning rate of 0.0001 might result in extremely slow convergence, requiring many more epochs to achieve good performance. A moderate learning rate of 0.01 often provides a good balance between training speed and stability, allowing the network to make steady progress while maintaining stable learning.
|
||||
|
||||
Convergence monitoring\index{Convergence!monitoring}\index{Validation!accuracy plateau} provides essential feedback during training—and continues into production deployment, as covered in @sec-ml-operations. As training progresses, the loss value typically stabilizes around a particular value, indicating the network is approaching a local optimum. Validation accuracy often plateaus as well, suggesting the network has extracted most learnable patterns from the data. The gap between training and validation performance reveals whether the network is overfitting\index{Overfitting!training challenge} or generalizing well to new examples. The interplay between batch size, available memory, and computational resources requires careful balancing to achieve efficient training within hardware constraints—the same memory-computation trade-offs established in the backpropagation section above.
|
||||
|
||||
@@ -3399,9 +3399,9 @@ Training transforms randomly initialized weights into parameters that encode mea
|
||||
|
||||
### Production Deployment and Prediction Pipeline {#sec-neural-computation-production-deployment-prediction-pipeline-19c0}
|
||||
|
||||
The core characteristics of inference begin with a systematic comparison to training, then the computational pipeline that transforms inputs into predictions.
|
||||
A model that achieved 99% accuracy on the test set produces nonsensical outputs three months after deployment, yet no code has changed. The weights are frozen, the architecture is identical, and the inference pipeline runs without error. The problem is that the world moved while the model stood still.
|
||||
|
||||
The transition from training to inference introduces a constraint on model adaptability that significantly impacts system design. Trained models generalize to unseen inputs through learned statistical patterns, but parameters remain fixed throughout deployment. Once training concludes, the model applies its learned probability distributions without modification. When operational data distribution diverges from training distributions, the model continues executing its fixed computational pathways regardless of this shift. Consider an autonomous vehicle perception system: if construction zone frequency increases substantially or novel vehicle configurations appear in deployment, the model's responses reflect statistical patterns learned during training rather than adapting to the evolved operational context. Adaptation in ML systems emerges not from runtime model modification but from systematic retraining with updated data, a deliberate engineering process detailed in @sec-model-training.
|
||||
The transition from training to inference introduces a constraint on model adaptability that fundamentally shapes system design. Trained models generalize to unseen inputs through learned statistical patterns, but parameters remain fixed throughout deployment. Once training concludes, the model applies its learned probability distributions without modification. When operational data distribution diverges from training distributions, the model continues executing its fixed computational pathways regardless of this shift. Consider an autonomous vehicle perception system: if construction zone frequency increases substantially or novel vehicle configurations appear in deployment, the model's responses reflect statistical patterns learned during training rather than adapting to the evolved operational context. Adaptation in ML systems emerges not from runtime model modification but from systematic retraining with updated data, a deliberate engineering process detailed in @sec-model-training.
|
||||
|
||||
#### Operational Phase Differences {#sec-neural-computation-operational-phase-differences-3f95}
|
||||
|
||||
@@ -3431,16 +3431,16 @@ The transition from training to inference introduces a constraint on model adapt
|
||||
from mlsys.constants import A100_MEM_CAPACITY, A100_TDP, H100_TDP, GiB, watt
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GpuSpecsFootnote:
|
||||
"""Namespace for Gpu Specs Footnote."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
a100_mem_gb_value = A100_MEM_CAPACITY.m_as(GiB) # e.g. 80
|
||||
a100_tdp_w_value = A100_TDP.m_as(watt) # e.g. 400
|
||||
h100_tdp_w_value = H100_TDP.m_as(watt) # e.g. 700
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
a100_mem_gb_str = fmt(a100_mem_gb_value, precision=0, commas=False) # e.g. "80"
|
||||
a100_tdp_w_str = fmt(a100_tdp_w_value, precision=0, commas=False) # e.g. "400"
|
||||
h100_tdp_w_str = fmt(h100_tdp_w_value, precision=0, commas=False) # e.g. "700"
|
||||
@@ -3692,19 +3692,19 @@ Computational requirements follow a fixed pattern for each input:
|
||||
* Output layer: `{python} MNISTMemory.inf_madd_l3_str` multiply-adds
|
||||
* Total: `{python} MNISTMemory.inf_madd_total_str` multiply-add operations per inference
|
||||
|
||||
This resource profile differs markedly from training requirements, where additional memory for gradients and computational overhead for backpropagation significantly increase resource demands (see the worked example in @sec-neural-computation-model-size-computational-complexity-1f0f). The predictable, streamlined nature of inference computations enables various optimization opportunities and efficient hardware utilization.
|
||||
This resource profile differs markedly from training requirements, where gradient storage and backpropagation overhead multiply resource demands by `{python} MNISTMemory.training_ratio_str`$\times$ or more (see the worked example in @sec-neural-computation-model-size-computational-complexity-1f0f). The predictable, streamlined nature of inference enables optimization opportunities that training cannot exploit.
|
||||
|
||||
#### Performance Enhancement Techniques {#sec-neural-computation-performance-enhancement-techniques-692f}
|
||||
|
||||
The fixed nature of inference computation presents optimization opportunities\index{Inference!optimization techniques}\index{Performance Optimization!inference} unavailable during training. Once parameters are frozen, the predictable computation pattern allows systematic improvements in both memory usage and computational efficiency.
|
||||
|
||||
Batch size selection\index{Batch Size!inference trade-off}\index{Inference!batch size selection} represents a key inference trade-off. During training, large batches stabilized gradient computation, but inference offers more flexibility. Processing single inputs minimizes latency, making it ideal for real-time applications requiring immediate responses. Batch processing, however, significantly improves throughput by using parallel computing capabilities more effectively, particularly on GPUs. For our MNIST network, processing a single image requires storing 202 activation values, while a batch of 32 images requires 6,464 activation values but can process images up to 32 times faster on parallel hardware.
|
||||
Batch size selection\index{Batch Size!inference trade-off}\index{Inference!batch size selection} represents a key inference trade-off. During training, large batches stabilized gradient computation, but inference offers more flexibility. Processing single inputs minimizes latency, making it ideal for real-time applications requiring immediate responses. Batch processing, however, improves throughput by 10--32$\times$ by using parallel computing capabilities more effectively, particularly on GPUs. For our MNIST network, processing a single image requires storing 202 activation values, while a batch of 32 images requires 6,464 activation values but can process images up to 32 times faster on parallel hardware.
|
||||
|
||||
Memory management\index{Memory Management!inference efficiency}\index{Memory Reuse!activation buffers} during inference is significantly more efficient than during training. Since intermediate values serve only forward computation, memory buffers can be reused aggressively. Activation values from each layer need only exist until the next layer's computation completes, enabling in-place operations that reduce the total memory footprint. The fixed nature of inference allows precise memory alignment and access patterns optimized for the underlying hardware architecture.
|
||||
Memory management\index{Memory Management!inference efficiency}\index{Memory Reuse!activation buffers} during inference is far more efficient than during training. Since intermediate values serve only forward computation, memory buffers can be reused aggressively. Activation values from each layer need only exist until the next layer's computation completes, enabling in-place operations that reduce the total memory footprint. The fixed nature of inference allows precise memory alignment and access patterns optimized for the underlying hardware architecture.
|
||||
|
||||
Hardware-specific optimizations\index{Hardware Optimization!architecture-specific}\index{SIMD!vector parallelism} become particularly important during inference. On CPUs, computations can be organized to maximize cache utilization and exploit SIMD (single instruction, multiple data) parallelism. GPU deployments benefit from optimized matrix multiplication routines and efficient memory transfer patterns. These optimizations extend beyond computational efficiency to significantly impact power consumption and hardware utilization, critical factors in real-world deployments.
|
||||
Hardware-specific optimizations\index{Hardware Optimization!architecture-specific}\index{SIMD!vector parallelism} become particularly important during inference. On CPUs, computations can be organized to maximize cache utilization and exploit SIMD (single instruction, multiple data) parallelism. GPU deployments benefit from optimized matrix multiplication routines and efficient memory transfer patterns. These optimizations extend beyond computational efficiency to reduce power consumption and improve hardware utilization, critical factors in real-world deployments.
|
||||
|
||||
The predictable nature of inference also enables optimizations like reduced numerical precision\index{Numerical Precision!inference optimization}\index{Quantization}. While training typically requires full floating-point precision to maintain stable learning, inference can often operate with reduced precision while maintaining acceptable accuracy. For our MNIST network, such optimizations could significantly reduce the memory footprint with corresponding improvements in computational efficiency.
|
||||
The predictable nature of inference also enables optimizations like reduced numerical precision\index{Numerical Precision!inference optimization}\index{Quantization}. While training typically requires full floating-point precision to maintain stable learning, inference can often operate with reduced precision while maintaining acceptable accuracy. For our MNIST network, such optimizations could halve the memory footprint with corresponding improvements in computational efficiency.
|
||||
|
||||
These optimization principles, while illustrated through our simple MNIST feedforward network, represent only the foundation of neural network optimization. More sophisticated architectures introduce additional considerations and opportunities, including specialized designs for spatial data processing, sequential computation, and attention-based computation patterns. These architectural variations and their optimizations are explored in @sec-network-architectures and @sec-model-compression. Production deployment considerations, including batching strategies and runtime optimization, are covered in @sec-benchmarking and @sec-ml-operations.
|
||||
|
||||
@@ -3714,7 +3714,7 @@ Neural network outputs must be transformed into actionable predictions, which re
|
||||
|
||||
The complexity of post-processing extends beyond simple mathematical transformations. Real-world systems must handle uncertainty, validate outputs, and integrate with larger computing systems. In our MNIST example, a digit recognition system might require not just the most likely digit, but also confidence measures to determine when human intervention is needed. This introduces additional computational steps: confidence thresholds, secondary prediction checks, and error handling logic, all of which are implemented in traditional computing frameworks.
|
||||
|
||||
The computational requirements of post-processing differ significantly from neural network inference. While inference benefits from parallel processing and specialized hardware, post-processing typically runs on conventional CPUs and follows sequential logic. Operations are more flexible and easier to modify than neural computations, but they can become bottlenecks if not carefully implemented. Computing softmax probabilities for a batch of predictions, for instance, requires different optimization strategies than the matrix multiplications of neural network layers.
|
||||
The computational requirements of post-processing differ fundamentally from neural network inference. While inference benefits from parallel processing and specialized hardware, post-processing typically runs on conventional CPUs and follows sequential logic. Operations are more flexible and easier to modify than neural computations, but they can become bottlenecks if not carefully implemented. Computing softmax probabilities for a batch of predictions, for instance, requires different optimization strategies than the matrix multiplications of neural network layers.
|
||||
|
||||
System integration considerations often dominate post-processing design. Output formats must match downstream system requirements, error handling must align with broader system protocols, and performance must meet system-level constraints. In a complete mail sorting system, the post-processing stage must not only identify digits but also format these predictions for the sorting machinery, handle uncertainty cases appropriately, and maintain processing speeds that match physical mail flow rates.
|
||||
|
||||
@@ -3770,7 +3770,7 @@ The complete neural network lifecycle—from architecture design through trainin
|
||||
## USPS Digit Recognition {#sec-neural-computation-case-study-usps-digit-recognition-97be}
|
||||
|
||||
\index{LeCun, Yann!LeNet deployment}
|
||||
The concepts we have developed, including forward propagation, backpropagation, loss functions, batch processing, and inference optimization, may feel abstract in isolation. How do they combine in a real system under real constraints? We have traced a single $28\times28$ digit from `{python} rb_ops_str` rule-based comparisons to `{python} dl_total_macs_str` neural-network MACs; now we see what happens when that digit must be classified millions of times per day under production latency constraints. The USPS handwritten digit recognition system\index{USPS System!digit recognition}\index{LeNet!USPS deployment}, an early large-scale neural network deployment, provides the answer. Deployed in the 1990s [@lecun1989backpropagation; @lecun1998gradient], this system gives concrete form to every concept from this chapter: preprocessing normalizes varying handwriting, the neural network performs forward propagation through learned weights, confidence thresholds implement post-processing logic, and the complete pipeline operates under strict latency constraints. This early production deployment established principles still relevant in modern ML systems: the importance of robust preprocessing pipelines, the need for confidence thresholds in automated decision-making, and the challenge of maintaining performance under varying real-world conditions. While today's systems deploy vastly more sophisticated architectures on more capable hardware, this foundational case study reveals how optimization principles combine to create production systems, with lessons that scale from 1990s mail sorting to 2025's edge AI deployments.
|
||||
In the early 1990s, the United States Postal Service needed to read over 100 million handwritten ZIP codes per day\index{USPS System!digit recognition}\index{LeNet!USPS deployment}. Human operators processed one digit per second at a cost that was becoming untenable. The solution was one of the first large-scale neural network deployments: a system that classified the same $28\times28$ digits we have been analyzing, but millions of times per day under strict latency constraints. Deployed by Yann LeCun and colleagues [@lecun1989backpropagation; @lecun1998gradient], this system gives concrete form to every operation from this chapter: preprocessing normalizes varying handwriting, the neural network performs forward propagation through learned weights, confidence thresholds implement post-processing logic, and the complete pipeline must finish before each mail piece reaches its sorting point. The engineering principles it established—robust preprocessing, confidence-based routing, and end-to-end pipeline optimization—remain the template for production ML systems three decades later.
|
||||
|
||||
### The Mail Sorting Challenge {#sec-neural-computation-mail-sorting-challenge-ef8c}
|
||||
|
||||
@@ -3786,7 +3786,7 @@ This challenging environment imposed requirements spanning every aspect of neura
|
||||
|
||||
### Engineering Process and Design Decisions {#sec-neural-computation-engineering-process-design-decisions-2e8e}
|
||||
|
||||
The development of the USPS digit recognition system required careful consideration at every stage, from data collection to deployment. This process illustrates how theoretical principles of neural networks translate into practical engineering decisions.
|
||||
Recognizing a handwritten "7" on a white envelope is straightforward. Recognizing it on a crumpled package with coffee stains, ballpoint smudges, and overlapping address lines requires engineering decisions at every stage from data collection to deployment.
|
||||
|
||||
Data collection presented the first major challenge—and a concrete instance of the data pipeline principles covered in @sec-data-engineering. Unlike controlled laboratory environments, postal facilities processed mail with tremendous variety. The training dataset had to capture this diversity: digits written by people of different ages, educational backgrounds, and writing styles; envelopes in varying colors and textures; and images captured under different lighting conditions and orientations. The data quality, labeling consistency, and distribution coverage that @sec-data-engineering emphasizes were not abstract concerns here; they directly determined whether the system could handle a hurried scrawl as reliably as a carefully printed digit. This extensive data collection effort later contributed to the creation of the MNIST database [@lecun1998gradient] used throughout our examples.
|
||||
|
||||
@@ -3877,17 +3877,17 @@ Once captured, the raw images are far from ready for neural network processing.
|
||||
from mlsys.constants import BYTES_FP32, KiB
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class UspsLenetSpecs:
|
||||
"""Namespace for Usps Lenet Specs."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
lenet_1_params = 10000 # approx params in 1989 LeNet
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
lenet_1_mem_kb = model_memory(lenet_1_params, BYTES_FP32, KiB)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
lenet_1_params_str = f"{lenet_1_params:,}" # e.g. "10,000"
|
||||
lenet_1_mem_kb_str = f"{lenet_1_mem_kb:.0f}" # e.g. "39"
|
||||
|
||||
@@ -3926,7 +3926,7 @@ Neural network-based ZIP code recognition transformed USPS mail processing opera
|
||||
|
||||
:::
|
||||
|
||||
Performance metrics validated many of the principles developed earlier in the chapter. The system achieved its highest accuracy on clearly written digits similar to those in the training data, but performance varied significantly with real-world factors: lighting conditions affected preprocessing effectiveness, unusual writing styles occasionally confused the neural network, and environmental vibrations degraded image quality. These challenges led to continuous refinements in both the physical system and the neural network pipeline.
|
||||
Performance metrics validated many of the principles developed earlier in the chapter. The system achieved its highest accuracy on clearly written digits similar to those in the training data, but performance varied with real-world factors: lighting conditions affected preprocessing effectiveness, unusual writing styles occasionally confused the neural network, and environmental vibrations degraded image quality. These challenges led to continuous refinements in both the physical system and the neural network pipeline.
|
||||
|
||||
The economic impact proved substantial. Before automation, manual sorting required operators to read and key in ZIP codes at an average rate of one piece per second. The neural network system processed pieces at ten times this rate\index{Throughput!USPS automation} while reducing labor costs and error rates. The system did not eliminate human operators entirely; their role shifted to handling uncertain cases and maintaining system performance. This hybrid approach, combining artificial and human intelligence, became a model for subsequent automation projects.
|
||||
|
||||
@@ -3970,19 +3970,19 @@ The USPS system's success was not merely a triumph of neural network accuracy—
|
||||
|
||||
## D·A·M Taxonomy {#sec-neural-computation-deep-learning-ai-triad-09cb}
|
||||
|
||||
The USPS case study made concrete what this chapter has developed abstractly: LeNet's architecture matched the digit recognition task (Algorithm), diverse handwriting samples captured real-world variation (Data), and specialized hardware met latency constraints (Machine). The neural network concepts explored throughout this chapter map directly onto this three-part framework, illuminating why deep learning requires rethinking computational architectures and system design from first principles.
|
||||
The USPS system succeeded because three dimensions aligned: LeNet's architecture matched the digit recognition task (Algorithm), diverse handwriting samples captured real-world variation (Data), and specialized hardware met latency constraints (Machine). This alignment was not coincidental—it reflects the **D·A·M taxonomy** that governs all deep learning deployments, where each component constrains and enables the others.
|
||||
|
||||
The mathematical foundations we covered—forward propagation, activation functions, backpropagation, and gradient descent—define the algorithmic core of deep learning systems. The architecture choices we make (layer depths, neuron counts, connection patterns) directly determine the computational complexity, memory requirements, and training dynamics. Each activation function selection, from ReLU's computational efficiency to sigmoid's saturating gradients, represents an algorithmic decision with profound systems implications. The hierarchical feature learning that distinguishes neural networks from classical approaches emerges from these algorithmic building blocks, but success depends critically on the other two triangle components.
|
||||
Forward propagation, activation functions, backpropagation, and gradient descent define the algorithmic core of deep learning systems. The architecture choices we make (layer depths, neuron counts, connection patterns) directly determine the computational complexity, memory requirements, and training dynamics. Each activation function selection, from ReLU's computational efficiency to sigmoid's saturating gradients, represents an algorithmic decision with profound systems implications. The hierarchical feature learning that distinguishes neural networks from classical approaches emerges from these algorithmic building blocks, but success depends critically on the other two triangle components.
|
||||
|
||||
Learning depends entirely on labeled data to calculate loss functions and guide weight updates through backpropagation. Our MNIST example demonstrated how data quality, distribution, and scale directly determine network performance: the algorithms remain identical, but data characteristics govern whether learning succeeds or fails. The shift from manual feature engineering to automatic representation learning does not eliminate data dependency; it transforms the challenge from designing features to curating datasets that capture the full complexity of real-world patterns. Preprocessing, augmentation, and validation strategies become algorithmic design decisions that shape the entire learning process.
|
||||
|
||||
The Machine component manages the massive number of matrix multiplications required for forward and backward propagation, revealing why specialized hardware became essential for deep learning success. The memory bandwidth limitations we explored, the parallel computation patterns that favor GPU architectures, and the different computational demands of training versus inference all stem from the mathematical operations we studied. The evolution from CPUs to GPUs to specialized AI accelerators directly responds to the computational patterns inherent in neural network algorithms. Understanding these mathematical foundations enables engineers to make informed decisions about hardware selection, memory hierarchy design, and distributed training strategies.
|
||||
The Machine component manages the massive number of matrix multiplications required for forward and backward propagation, revealing why specialized hardware became essential for deep learning success. Memory bandwidth limitations, parallel computation patterns that favor GPU architectures, and the different computational demands of training versus inference all stem from the mathematical operations at the core of neural networks. The evolution from CPUs to GPUs to specialized AI accelerators directly responds to the computational patterns inherent in neural network algorithms. Understanding these mathematical foundations enables engineers to make informed decisions about hardware selection, memory hierarchy design, and distributed training strategies.
|
||||
|
||||
The interdependence of these three components emerges through our chapter's progression: algorithms define what computations are necessary, data determines whether those computations can learn meaningful patterns, and machines determine whether the system can execute efficiently at scale. Neural networks succeeded not because any single component improved, but because advances in all three areas aligned. More sophisticated algorithms, larger datasets, and specialized hardware created a synergistic effect that transformed artificial intelligence.
|
||||
The interdependence of these three components is the central lesson: algorithms define what computations are necessary, data determines whether those computations can learn meaningful patterns, and machines determine whether the system can execute efficiently at scale. Neural networks succeeded not because any single component improved, but because advances in all three areas aligned. More sophisticated algorithms, larger datasets, and specialized hardware created a synergistic effect that transformed artificial intelligence.
|
||||
|
||||
This D·A·M perspective explains why deep learning engineering requires systems thinking that extends well beyond traditional software development. Optimizing any single axis without considering the others leads to suboptimal outcomes: the most elegant algorithms fail without quality data, the best datasets remain unusable without adequate machines, and the most powerful machines achieve nothing without algorithms that can learn from data. When performance stalls, ask: where is the flow blocked? Check the D·A·M.
|
||||
|
||||
The mathematical foundations, systems trade-offs, and deployment principles developed throughout this chapter equip engineers to reason about neural networks from first principles. Yet conceptual understanding alone is insufficient—practitioners must also recognize the recurring misconceptions that derail real-world projects.
|
||||
These foundations equip engineers to reason about neural networks from first principles. Yet conceptual understanding alone is insufficient—practitioners must also recognize the recurring misconceptions that derail real-world projects.
|
||||
|
||||
## Fallacies and Pitfalls {#sec-neural-computation-fallacies-pitfalls-3422}
|
||||
|
||||
@@ -4029,7 +4029,7 @@ These fallacies and pitfalls share a common root: applying intuitions from deter
|
||||
|
||||
We opened this chapter with a question: why do deep learning systems engineers need mathematical understanding rather than treating neural networks as black-box components? The answer emerges through every section. When a production model fails, the problem lies not in the code but in the mathematics: a misconfigured learning rate causes gradients to explode during backpropagation, an activation function saturates and blocks learning in deep layers, or memory requirements during training exceed GPU capacity because of stored activations and optimizer states. Engineers who understand forward propagation can trace which layer produces anomalous activations. Engineers who understand backpropagation can diagnose vanishing gradients. Engineers who understand the distinction between training and inference can predict memory consumption before deployment surprises them.
|
||||
|
||||
Neural networks transform computational approaches by replacing rule-based programming with adaptive systems that learn patterns from data. Building on the biological-to-artificial neuron mappings explored throughout this chapter, these systems process complex information and improve performance through experience.
|
||||
Neural networks transform computational approaches by replacing rule-based programming with adaptive systems that learn patterns from data. The biological-to-artificial neuron mapping—weighted sums, nonlinear activations, and gradient-based learning—provides the atomic operations from which all modern architectures are composed.
|
||||
|
||||
Neural network architecture demonstrates hierarchical processing, where each layer extracts progressively more abstract patterns from raw data. Training adjusts connection weights through iterative optimization to minimize prediction errors, while inference applies learned knowledge to make predictions on new data. This separation between learning and application phases creates distinct system requirements for computational resources, memory usage, and processing latency that shape system design and deployment strategies. Training requires ~`{python} MNISTMemory.training_ratio_str`$\times$ more memory than inference because gradients, optimizer state, and activations must be stored and updated. The USPS digit recognition case study demonstrated that these mathematical principles combine into production systems where the complete pipeline—preprocessing, neural inference, and post-processing—must operate within real-world latency and reliability constraints.
|
||||
|
||||
@@ -4056,18 +4056,18 @@ Neural network architecture demonstrates hierarchical processing, where each lay
|
||||
# │ Exports: fc1_weights_str, mnist_pixels_str, fc1_neurons_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MnistWeightsCalc:
|
||||
"""Namespace for Mnist Weights Calc."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
mnist_pixels_value = 784
|
||||
fc1_neurons_value = 128
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
fc1_weights_value = mnist_pixels_value * fc1_neurons_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fc1_weights_str = f"{fc1_weights_value:,}"
|
||||
mnist_pixels_str = f"{mnist_pixels_value}"
|
||||
fc1_neurons_str = f"{fc1_neurons_value}"
|
||||
@@ -4105,3 +4105,10 @@ Real-world problems exhibit structure that generic fully-connected networks cann
|
||||
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:nn_computation")
|
||||
```
|
||||
|
||||
@@ -80,7 +80,7 @@ from mlsys.formatting import fmt, check, sci
|
||||
class CompressionSetup:
|
||||
"""Chapter-wide constants: GPU specs, energy physics, model sizes, device constraints."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Illustrative energy/perf values
|
||||
int8_energy_reduction = 20
|
||||
mobilenet_int8_mj = 47
|
||||
@@ -91,7 +91,7 @@ class CompressionSetup:
|
||||
llm_7b_params = 7
|
||||
gpt3_training_flops_exp = 23
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# A100 specs
|
||||
a100_tflops_fp16 = A100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second)
|
||||
a100_tflops_int8 = A100_FLOPS_INT8.m_as(TFLOPs / second)
|
||||
@@ -132,11 +132,11 @@ class CompressionSetup:
|
||||
smartphone_ram_gb = SMARTPHONE_RAM_GB.m_as(GB)
|
||||
mcu_ram_kb = MCU_RAM_KIB.m_as(KiB)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(a100_int8_speedup >= 2, "A100 INT8 should be at least 2x faster than FP16.")
|
||||
check(int8_fp32_energy_ratio > 1, "FP32 MAC must cost more energy than INT8 MAC.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
a100_tflops_fp16_str = fmt(a100_tflops_fp16, precision=0, commas=False)
|
||||
a100_tflops_int8_str = fmt(a100_tflops_int8, precision=0, commas=False)
|
||||
a100_bw_tbs_str = fmt(a100_bw_tbs, precision=1, commas=False)
|
||||
@@ -288,7 +288,7 @@ According to the **Iron Law** ($T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak} \cdot
|
||||
| **Float Add** | 32-bit | 30$\times$ |
|
||||
| **DRAM Read** | 64-bit | **40,000$\times$** |
|
||||
|
||||
**For Inference**: Moving from FP32 to INT8 doesn't just save 4$\times$ memory; it can reduce the **energy per inference** by up to **`{python} int8_energy_reduction_str`$\times$** on hardware with dedicated INT8 units, depending on the compute-to-memory ratio of the workload. This is the difference between a battery lasting 1 hour or 20 hours.
|
||||
For inference workloads, moving from FP32 to INT8 does not merely save 4$\times$ memory; it can reduce the energy per inference by up to `{python} int8_energy_reduction_str`$\times$ on hardware with dedicated INT8 units, depending on the compute-to-memory ratio of the workload. This is the difference between a battery lasting 1 hour or 20 hours.
|
||||
|
||||
These same physics apply at datacenter scale: distributed training systems use reduced precision to cut gradient communication overhead, a topic covered in @sec-model-training. For a deeper treatment of how silicon architectures exploit these energy differences, see @sec-hardware-acceleration.
|
||||
:::
|
||||
@@ -363,14 +363,14 @@ from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import KIB_TO_BYTES
|
||||
from mlsys.constants import BYTES_FP16, BYTES_INT4, byte, MS_PER_SEC
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class QuantizationSpeedup:
|
||||
"""
|
||||
Namespace for Quantization Speedup calculation.
|
||||
Scenario: Deploying a 7B LLM on a bandwidth-constrained device.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
params_b = 7
|
||||
bytes_fp16 = 2.0
|
||||
bytes_int4 = 0.5
|
||||
@@ -379,7 +379,7 @@ class QuantizationSpeedup:
|
||||
mem_bw_gbs = 50.0
|
||||
kv_cache_gb = 1.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Sizes
|
||||
fp16_size_gb = params_b * bytes_fp16
|
||||
fp16_total_gb = fp16_size_gb + kv_cache_gb
|
||||
@@ -396,10 +396,10 @@ class QuantizationSpeedup:
|
||||
|
||||
speedup = int4_toks / fp16_toks
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(3.5 <= speedup <= 4.5, f"INT4 should yield ~4x speedup vs FP16, got {speedup:.1f}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
params_b_str = f"{params_b}"
|
||||
bytes_fp16_str = f"{int(bytes_fp16)}"
|
||||
bytes_int4_str = f"{bytes_int4}"
|
||||
@@ -526,7 +526,7 @@ def _get_ratio(model_mem, device_mem):
|
||||
class ModelDeviceComparison:
|
||||
"""Contrast model requirements with device memory: 6-order-of-magnitude deployment gap."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Device capacities
|
||||
cloud_mem = CLOUD_MEM_GIB
|
||||
mobile_mem = MOBILE_MEM_GIB
|
||||
@@ -540,7 +540,7 @@ class ModelDeviceComparison:
|
||||
mobilenet_int8_mem = 3.5 * MiB
|
||||
dscnn_mem = 500 * KiB
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
dlrm_mobile = _get_ratio(dlrm_mem, mobile_mem)
|
||||
dlrm_tiny = _get_ratio(dlrm_mem, tiny_mem)
|
||||
gpt2_mobile = _get_ratio(gpt2_mem, mobile_mem)
|
||||
@@ -549,11 +549,11 @@ class ModelDeviceComparison:
|
||||
mobilenet_tiny = _get_ratio(mobilenet_mem, tiny_mem)
|
||||
mobilenet_int8_tiny = _get_ratio(mobilenet_int8_mem, tiny_mem)
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
# DS-CNN always fits TinyML — sanity check
|
||||
assert _get_ratio(dscnn_mem, tiny_mem) == "ok", "DS-CNN must fit in TinyML device."
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
dlrm_str = f"{dlrm_mem.m_as(GB):.0f} GB"
|
||||
gpt2_str = f"{gpt2_mem.m_as(GiB):.0f} GB"
|
||||
resnet_str = f"{resnet_mem.m_as(MiB):.0f} MB"
|
||||
@@ -1174,7 +1174,7 @@ where $\tau$ is chosen to ensure that only the largest $(1 - s)$ fraction of wei
|
||||
|
||||
The primary advantage of unstructured pruning is memory efficiency. By reducing the number of nonzero parameters, pruned models require less storage, which benefits deployment on embedded or mobile devices with limited memory.
|
||||
|
||||
Unstructured pruning does not necessarily improve computational efficiency on modern hardware, however. Standard GPUs and TPUs are optimized for dense matrix multiplications, and a sparse weight matrix often cannot fully utilize hardware acceleration unless specialized sparse computation kernels are available. Unstructured pruning therefore primarily benefits model storage rather than inference acceleration.
|
||||
Unstructured pruning does not necessarily improve computational efficiency on modern hardware, however. Standard GPUs and TPUs are optimized for dense matrix multiplications, and a sparse weight matrix often cannot fully use hardware acceleration unless specialized sparse computation kernels are available. Unstructured pruning therefore primarily benefits model storage rather than inference acceleration.
|
||||
|
||||
#### Structured Pruning {#sec-model-compression-structured-pruning-9692}
|
||||
|
||||
@@ -1184,11 +1184,11 @@ Where unstructured pruning removes individual weights, structured pruning [@li20
|
||||
Neurons, filters, and layers vary dramatically in their contribution to a model's predictions. Some units primarily carry redundant or low-impact information, and removing them does not significantly degrade model performance. Identifying which structures can be pruned while preserving accuracy remains the core challenge.
|
||||
|
||||
\index{Pruning!regularity-compression trade-off}
|
||||
Hardware-aware pruning\index{Pruning!hardware-aware} strategies, such as N:M structured sparsity[^fn-nm-sparsity-a100]\index{Sparsity!N:M structured}, enforce specific patterns (e.g., ensuring 2 out of every 4 weights are zero) to align with specialized accelerator capabilities. The hardware implementation details of these patterns, including how they leverage sparse tensor cores, are covered in @sec-hardware-acceleration.
|
||||
Hardware-aware pruning\index{Pruning!hardware-aware} strategies, such as N:M structured sparsity[^fn-nm-sparsity-a100]\index{Sparsity!N:M structured}, enforce specific patterns (e.g., ensuring 2 out of every 4 weights are zero) to align with specialized accelerator capabilities. The hardware implementation details of these patterns, including how they exploit sparse tensor cores, are covered in @sec-hardware-acceleration.
|
||||
|
||||
[^fn-nm-sparsity-a100]: **N:M Structured Sparsity**: Introduced commercially with NVIDIA's A100 GPU (2020), the 2:4 pattern was chosen because it halves multiply-accumulate operations while requiring only a 2-bit index per group to select which elements participate. This fixed ratio is a hardware constraint, not a mathematical optimum: the A100's Sparse Tensor Cores are physically wired for 2:4, yielding up to 2$\times$ speedup over dense execution with no software overhead. Other ratios are not supported by current hardware, illustrating how silicon design constrains which sparsity patterns translate to actual speedup. \index{N:M Sparsity!A100 hardware}
|
||||
|
||||
To ground these distinctions, examine @fig-structured-unstructured from left to right. On the left, unstructured pruning removes individual weights (depicted as dashed connections), creating a sparse weight matrix. This can disrupt the original network structure, as shown in the fully connected network where certain connections have been randomly pruned. While this reduces the number of active parameters, the resulting sparsity requires specialized execution kernels to fully utilize computational benefits.
|
||||
To ground these distinctions, examine @fig-structured-unstructured from left to right. On the left, unstructured pruning removes individual weights (depicted as dashed connections), creating a sparse weight matrix. This can disrupt the original network structure, as shown in the fully connected network where certain connections have been randomly pruned. While this reduces the number of active parameters, the resulting sparsity requires specialized execution kernels to fully realize computational benefits.
|
||||
|
||||
::: {#fig-structured-unstructured fig-env="figure" fig-pos="htb" fig-cap="**Unstructured vs. Structured Pruning.** Unstructured pruning (left) achieves sparsity by removing individual weights, requiring specialized hardware, while structured pruning (middle, right) removes entire neurons or filters, preserving network structure for standard hardware acceleration. Source: [@qi2021efficient]." fig-alt="Three-panel diagram. Left shows unstructured pruning with dashed connections in a neural network. Middle and right show structured pruning: fully connected network with pruned neurons and CNN with pruned filters shown as dashed squares."}
|
||||
```{.tikz}
|
||||
@@ -1328,13 +1328,13 @@ Another strategy is activation-based pruning\index{Pruning!activation-based}, wh
|
||||
|
||||
Gradient-based pruning\index{Pruning!gradient-based} uses information from the training process to identify less significant neurons or filters. Units with smaller gradient magnitudes contribute less to reducing the loss function, making them candidates for removal. By ranking neurons based on their gradient values, structured pruning can remove those with the least impact on model optimization. Unlike magnitude-based or activation-based pruning, which rely on static properties of the trained model, gradient-based pruning requires access to gradient computations and is typically applied during training rather than as a post-processing step.
|
||||
|
||||
These three methods form a progression from static to dynamic assessment of parameter importance, and each presents distinct trade-offs. Magnitude-based pruning is computationally inexpensive and straightforward to implement, making it the default starting point, but it does not account for how neurons behave across different data distributions. Activation-based pruning captures more of this dynamic behavior by evaluating neurons over representative inputs, though it requires additional computation to estimate neuron importance. Gradient-based pruning leverages training dynamics most directly but may introduce prohibitive complexity for large-scale models. In practice, the choice depends on the specific constraints of the target deployment environment: magnitude-based methods suffice for most production scenarios, while gradient-based approaches justify their overhead only when accuracy preservation is paramount.
|
||||
These three methods form a progression from static to dynamic assessment of parameter importance, and each presents distinct trade-offs. Magnitude-based pruning is computationally inexpensive and straightforward to implement, making it the default starting point, but it does not account for how neurons behave across different data distributions. Activation-based pruning captures more of this dynamic behavior by evaluating neurons over representative inputs, though it requires additional computation to estimate neuron importance. Gradient-based pruning exploits training dynamics most directly but may introduce prohibitive complexity for large-scale models. In practice, the choice depends on the specific constraints of the target deployment environment: magnitude-based methods suffice for most production scenarios, while gradient-based approaches justify their overhead only when accuracy preservation is paramount.
|
||||
|
||||
#### Dynamic Pruning {#sec-model-compression-dynamic-pruning-b794}
|
||||
|
||||
Traditional pruning methods, whether unstructured or structured, involve static pruning\index{Pruning!static}: parameters are permanently removed after training or at fixed intervals during training, assuming that parameter importance is fixed. Dynamic pruning\index{Pruning!dynamic} relaxes this assumption by adapting pruning decisions based on input data or training dynamics, allowing the model to adjust its structure in real time.
|
||||
|
||||
Dynamic pruning can be implemented using runtime sparsity techniques, where the model actively determines which parameters to utilize based on input characteristics. Activation-conditioned pruning exemplifies this approach by selectively deactivating neurons or channels that exhibit low activation values for specific inputs [@dynamicpruning2023]. This method introduces input-dependent sparsity patterns, effectively reducing the computational workload during inference without permanently modifying the model architecture.
|
||||
Dynamic pruning can be implemented using runtime sparsity techniques, where the model actively determines which parameters to use based on input characteristics. Activation-conditioned pruning exemplifies this approach by selectively deactivating neurons or channels that exhibit low activation values for specific inputs [@dynamicpruning2023]. This method introduces input-dependent sparsity patterns, effectively reducing the computational workload during inference without permanently modifying the model architecture.
|
||||
|
||||
For instance, consider a convolutional neural network processing images with varying complexity. During inference of a simple image containing mostly uniform regions, many convolutional filters may produce negligible activations. Dynamic pruning identifies these low-impact filters and temporarily excludes them from computation, improving efficiency while maintaining accuracy for the current input. This adaptive behavior is particularly advantageous in latency-sensitive applications, where computational resources must be allocated judiciously based on input complexity. @sec-benchmarking presents measurement strategies for evaluating such efficiency gains.
|
||||
|
||||
@@ -2059,7 +2059,7 @@ TensorFlow takes a different approach through the TensorFlow Model Optimization
|
||||
# │ Exports: mobilenet_pruning_pct_str, mobilenet_original_size_str, mobilenet_pruned_size_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MobileNetCompressionAnchor:
|
||||
"""
|
||||
Namespace for MobileNet pruning anchor.
|
||||
@@ -2079,7 +2079,7 @@ These trade-offs become concrete when examining real-world deployments. Several
|
||||
|
||||
[^fn-bert-pruning-scale]: **BERT Pruning**: Structured pruning succeeds here because BERT's 12 attention heads per layer exhibit massive redundancy --- Michel et al. (2019) showed that removing 40% of heads changes GLUE scores by only 1.2%. This redundancy is architectural, not accidental: overparameterization aids pre-training optimization, but at deployment, each unnecessary head consumes memory bandwidth for zero accuracy gain. \index{BERT!pruning redundancy}
|
||||
|
||||
Pruning is powerful but has an inherent limitation: it starts with an existing architecture and carves away pieces. The pruned model inherits its structure from the original—same layer types, same connectivity patterns, just fewer parameters. What if the original architecture itself is inefficient for deployment? What if we want a model with a completely different structure, such as a 6-layer transformer instead of a 12-layer one, that still captures the original model's capabilities?
|
||||
Pruning has an inherent limitation: it starts with an existing architecture and carves away pieces. The pruned model inherits its structure from the original—same layer types, same connectivity patterns, just fewer parameters. What if the original architecture itself is inefficient for deployment? What if we want a model with a completely different structure, such as a 6-layer transformer instead of a 12-layer one, that still captures the original model's capabilities?
|
||||
|
||||
This limitation motivates **knowledge distillation**, a categorically different approach. Rather than modifying an existing model's weights, distillation trains a new, compact "student" model to mimic the behavior of a larger "teacher" model. The student inherits the teacher's learned knowledge without inheriting its computational overhead.
|
||||
|
||||
@@ -2287,7 +2287,7 @@ Pruning and distillation both reduce the number of parameters a model carries, b
|
||||
|
||||
### Structured Approximations {#sec-model-compression-structured-approximations-4798}
|
||||
|
||||
Rather than eliminating parameters through pruning or transferring knowledge through distillation, structured approximation methods decompose large weight matrices and tensors into lower-dimensional components. These techniques exploit the mathematical structure of neural network parameters, leveraging the observation that high-dimensional representations often admit compact, low-rank approximations. The following subsections examine low-rank factorization and tensor decomposition as complementary strategies for achieving this compression.
|
||||
Rather than eliminating parameters through pruning or transferring knowledge through distillation, structured approximation methods decompose large weight matrices and tensors into lower-dimensional components. These techniques exploit the mathematical structure of neural network parameters: high-dimensional representations often admit compact, low-rank approximations. The following subsections examine low-rank factorization and tensor decomposition as complementary strategies for achieving this compression.
|
||||
|
||||
```{python}
|
||||
#| label: lowrank-bandwidth-calc
|
||||
@@ -2309,19 +2309,19 @@ from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import KIB_TO_BYTES
|
||||
from mlsys.constants import MIB_TO_BYTES
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LowRankFactorization:
|
||||
"""
|
||||
Namespace for Low-Rank Factorization Bandwidth calculation.
|
||||
Scenario: Factoring a 4096 x 4096 matrix into rank 128 components.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
mat_dim = 4096
|
||||
rank_k = 128
|
||||
bytes_per_param = 4 # FP32
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Full Matrix: N x N
|
||||
full_params = mat_dim * mat_dim
|
||||
full_mb = (full_params * bytes_per_param) / MIB_TO_BYTES
|
||||
@@ -2332,10 +2332,10 @@ class LowRankFactorization:
|
||||
|
||||
data_reduction = full_mb / factored_mb
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(data_reduction > 10, f"Low-rank reduction ({data_reduction:.1f}x) is too low for K=128.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
mat_dim_str = f"{mat_dim}"
|
||||
rank_k_str = f"{rank_k}"
|
||||
full_mb_str = fmt(full_mb, precision=0, commas=False)
|
||||
@@ -2870,12 +2870,12 @@ Search strategies determine how to explore the architecture space efficiently wi
|
||||
| **Evolutionary Algorithms** | 200-500 GPU-days | Parallel infrastructure available | Requires large populations |
|
||||
| **Gradient-Based (DARTS)** | 1-4 GPU-days | Limited compute budget | May converge to suboptimal local minima |
|
||||
|
||||
: **NAS Search Strategy Comparison**: Trade-offs between search efficiency, use cases, and limitations for different NAS approaches. Reinforcement learning offers unconstrained exploration at high cost, evolutionary methods leverage parallelism, and gradient-based approaches achieve dramatic speedups with potential optimality trade-offs. {#tbl-nas-strategies}
|
||||
: **NAS Search Strategy Comparison**: Trade-offs between search efficiency, use cases, and limitations for different NAS approaches. Reinforcement learning offers unconstrained exploration at high cost, evolutionary methods exploit parallelism, and gradient-based approaches achieve dramatic speedups with potential optimality trade-offs. {#tbl-nas-strategies}
|
||||
|
||||
\index{Reinforcement Learning!NAS application}
|
||||
\index{Neural Architecture Search (NAS)!reinforcement learning strategy}
|
||||
\index{NASNet!RL-discovered architecture}
|
||||
Reinforcement learning based NAS treats architecture search as a sequential decision problem where a controller generates architectures and receives accuracy as reward. The controller (typically an LSTM) learns to propose better architectures over time through policy gradient optimization. While this approach discovered groundbreaking architectures like NASNet, the sequential nature limits parallelism and requires hundreds of GPU-days.
|
||||
Reinforcement learning based NAS treats architecture search as a sequential decision problem where a controller generates architectures and receives accuracy as reward. The controller (typically an LSTM) learns to propose better architectures over time through policy gradient optimization. While this approach discovered high-performing architectures like NASNet, the sequential nature limits parallelism and requires hundreds of GPU-days.
|
||||
|
||||
\index{Evolutionary Algorithms!NAS application}
|
||||
\index{Neural Architecture Search (NAS)!evolutionary strategy}
|
||||
@@ -2895,7 +2895,7 @@ where $L_{\text{lat}}(\alpha)$ is measured latency, $L_{\text{lat,target}}$ is t
|
||||
|
||||
#### When to Use NAS {#sec-model-compression-use-nas-2b47}
|
||||
|
||||
Neural Architecture Search is a powerful tool, but its significant computational cost demands careful consideration of when the investment is justified.
|
||||
Neural Architecture Search can discover architectures that outperform hand-designed alternatives, but its significant computational cost demands careful consideration of when the investment is justified.
|
||||
|
||||
NAS becomes worthwhile for novel hardware platforms with unique constraints (new accelerator architectures, extreme edge devices) where existing architectures are poorly optimized. It also makes sense at massive deployment scale (billions of inferences) where even 1-2% efficiency improvements justify the upfront search cost, or when multiple deployment configurations require architecture families (cloud, edge, mobile) that amortize one search across many variants.
|
||||
|
||||
@@ -2927,7 +2927,7 @@ Test your understanding of the structural optimization techniques covered so far
|
||||
## Quantization and Precision {#sec-model-compression-quantization-precision-cd46}
|
||||
\index{Model Compression!precision optimization}
|
||||
|
||||
A `{python} llm_7b_str` billion parameter language model stored in FP16 consumes `{python} llm_7b_mem_str` GB, yet users expect it to run on a smartphone with `{python} smartphone_ram_str` GB of shared RAM. Structural optimization alone cannot bridge this gap: even aggressive pruning rarely exceeds 50--70% parameter reduction, leaving a model far too large for the target device. The remaining leverage comes from a different dimension entirely: reducing the number of bits used to represent each parameter. *Quantization*, the process of reducing numerical precision, offers one of the most impactful optimizations for deployment, because it trades bits for speed and efficiency with minimal accuracy loss.
|
||||
A `{python} llm_7b_str` billion parameter language model stored in FP16 consumes `{python} llm_7b_mem_str` GB, yet users expect it to run on a smartphone with `{python} smartphone_ram_str` GB of shared RAM. Structural optimization alone cannot bridge this gap: even aggressive pruning rarely exceeds 50--70% parameter reduction, leaving a model far too large for the target device. The remaining gains come from a different dimension entirely: reducing the number of bits used to represent each parameter. *Quantization*, the process of reducing numerical precision, offers one of the most impactful optimizations for deployment, because it trades bits for speed and efficiency with minimal accuracy loss.
|
||||
|
||||
::: {.callout-definition title="Quantization"}
|
||||
|
||||
@@ -2979,26 +2979,26 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class EnergyDividend:
|
||||
"""
|
||||
Namespace for Energy Dividend calculation.
|
||||
Scenario: Comparing energy per operation for accumulator units.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Horowitz (2014) / Sze (2020) values
|
||||
e_add_fp32 = ENERGY_ADD_FP32_PJ.m_as('pJ')
|
||||
e_add_int8 = ENERGY_ADD_INT8_PJ.m_as('pJ')
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Dividend = FP32_Energy / INT8_Energy
|
||||
dividend = e_add_fp32 / e_add_int8
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(dividend >= 25, f"INT8 should be ~30x more efficient than FP32. Ratio: {dividend:.1f}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
energy_add_fp32_str = fmt(e_add_fp32, precision=2, commas=False)
|
||||
energy_add_int8_str = fmt(e_add_int8, precision=2, commas=False)
|
||||
energy_dividend_str = f"{int(dividend)}"
|
||||
@@ -3121,7 +3121,7 @@ These energy savings take on a different character for models where memory capac
|
||||
::: {.callout-lighthouse title="DLRM and Embedding Quantization"}
|
||||
**The Memory Capacity Constraint**: Our **DLRM Lighthouse** (@sec-network-architectures) presents a unique compression challenge. Unlike ResNet or GPT, which are constrained by compute or bandwidth, DLRM is constrained by **Memory Capacity**. Its embedding tables can reach terabytes in size, far exceeding GPU memory.
|
||||
|
||||
For DLRM, quantization is not about faster math; it's about **storage density**. Quantizing embedding tables from FP32 to INT8 (or INT4) reduces memory footprint by 4--8$\times$, allowing larger tables to fit on fewer GPUs. This is a pure **Information Density** optimization: we compress the lookup table so the **Machine** (Physics) can hold the **Algorithm** (Logic).
|
||||
For DLRM, quantization is not about faster math; it is about **storage density**. Quantizing embedding tables from FP32 to INT8 (or INT4) reduces memory footprint by 4--8$\times$, allowing larger tables to fit on fewer GPUs. This is a pure **Information Density** optimization: we compress the lookup table so the **Machine** (Physics) can hold the **Algorithm** (Logic).
|
||||
:::
|
||||
|
||||
\index{Keyword Spotting (KWS)!quantization imperative}
|
||||
@@ -3250,28 +3250,28 @@ from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import KIB_TO_BYTES
|
||||
from mlsys.constants import BYTES_FP16, BYTES_INT4, byte, GB
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class QuantizationSavings:
|
||||
"""
|
||||
Namespace for Quantization Savings calculation.
|
||||
Scenario: FP16 vs INT4 storage for an 8B model.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
params_b = 8
|
||||
bytes_fp16 = 2.0
|
||||
bytes_int4 = 0.5
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
fp16_size_gb = params_b * bytes_fp16
|
||||
int4_size_gb = params_b * bytes_int4
|
||||
|
||||
ratio = fp16_size_gb / int4_size_gb
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(ratio == 4.0, f"FP16/INT4 ratio should be exactly 4.0, got {ratio}")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
llm_params_b_str = f"{params_b}"
|
||||
fp16_bytes_str = f"{int(bytes_fp16)}"
|
||||
int4_bytes_str = f"{bytes_int4}"
|
||||
@@ -3327,27 +3327,27 @@ from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import KIB_TO_BYTES
|
||||
from mlsys.constants import SIMD_REGISTER_BITS, FP32_BITS, INT8_BITS
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class SIMDThroughput:
|
||||
"""
|
||||
Namespace for SIMD Throughput calculation.
|
||||
Scenario: Comparing ops per register for FP32 vs INT8.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
register_bits = 512
|
||||
fp32_bits = 32
|
||||
int8_bits = 8
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
ops_fp32 = register_bits // fp32_bits
|
||||
ops_int8 = register_bits // int8_bits
|
||||
gain = ops_int8 // ops_fp32
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(gain == 4, f"INT8 vs FP32 should yield 4x ops, got {gain}x")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
simd_fp32_str = f"{ops_fp32}"
|
||||
simd_int8_str = f"{ops_int8}"
|
||||
simd_gain_str = f"{gain}"
|
||||
@@ -4092,7 +4092,7 @@ LineD/.style={dashed,black,line width=0.75pt},
|
||||
```
|
||||
:::
|
||||
|
||||
Compare the two mapping diagrams side by side in @fig-calibration-ranges. Symmetric calibration (left) maps $[-1, 1]$ to $[-127, 127]$ with zero preserved, making it simpler to implement and well suited for zero-centered weight distributions. Asymmetric calibration (right) uses different ranges ($\alpha = -0.5$, $\beta = 1.5$), better utilizing the quantized range for skewed distributions at the cost of additional complexity. Most frameworks (TensorRT, PyTorch) support both modes. The conceptual difference is clear from the diagrams, but the actual computation of scale and zero-point parameters requires a concrete formula—which the following worked example of *calculating scale and zero-point* derives step by step.
|
||||
Compare the two mapping diagrams side by side in @fig-calibration-ranges. Symmetric calibration (left) maps $[-1, 1]$ to $[-127, 127]$ with zero preserved, making it simpler to implement and well suited for zero-centered weight distributions. Asymmetric calibration (right) uses different ranges ($\alpha = -0.5$, $\beta = 1.5$), better using the quantized range for skewed distributions at the cost of additional complexity. Most frameworks (TensorRT, PyTorch) support both modes. The conceptual difference is clear from the diagrams, but the actual computation of scale and zero-point parameters requires a concrete formula—which the following worked example of *calculating scale and zero-point* derives step by step.
|
||||
|
||||
```{python}
|
||||
#| label: quantization-math-calc
|
||||
@@ -4115,13 +4115,13 @@ from mlsys.formatting import fmt, check
|
||||
class QuantizationMathCalc:
|
||||
"""Derive affine quantization parameters: scale and zero-point for [-1.0, 3.0] → UINT8."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
alpha = -1.0 # activation range min
|
||||
beta = 3.0 # activation range max
|
||||
bits = 8 # target bit-width
|
||||
x_val = 0.0 # value to quantize
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# 1. Scale: s = (beta - alpha) / (2^b - 1)
|
||||
int_steps = 2**bits - 1
|
||||
scale = (beta - alpha) / int_steps
|
||||
@@ -4136,12 +4136,12 @@ class QuantizationMathCalc:
|
||||
# 4. Dequantize: x_recon = (x_q - z) * s
|
||||
x_recon = (x_q - zero_point) * scale
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(scale > 0, "Scale must be positive.")
|
||||
check(0 <= zero_point <= int_steps, "Zero-point must be in valid integer range.")
|
||||
check(abs(x_recon - x_val) < scale, "Reconstruction error must be less than one step size.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
alpha_str = fmt(alpha, precision=1, commas=False) # "-1.0"
|
||||
beta_str = fmt(beta, precision=1, commas=False) # "3.0"
|
||||
range_str = fmt(beta - alpha, precision=1, commas=False) # "4.0"
|
||||
@@ -4542,7 +4542,7 @@ Precision reduction is the most impactful deployment optimization.
|
||||
|
||||
The preceding subsections reveal PTQ's core trade-off: simplicity versus accuracy control. PTQ requires no retraining and can be applied to any pre-trained model in minutes, making it the default starting point for deployment optimization. For rapid deployment scenarios with production deadlines under two weeks and acceptable accuracy loss of 1–2%, PTQ with appropriate calibration often provides a complete solution.
|
||||
|
||||
The limitation is that PTQ offers no mechanism to recover from accuracy loss. If the quantized model's accuracy drops below the production threshold — a common outcome for transformer-based architectures where attention mechanisms amplify small numerical differences — the only recourse is to choose a less aggressive precision format, which sacrifices the efficiency gains that motivated quantization in the first place. This ceiling on PTQ's accuracy preservation motivates a more powerful approach: rather than applying quantization as a post-hoc transformation, we can integrate precision constraints directly into the training process itself.
|
||||
The limitation is that PTQ offers no mechanism to recover from accuracy loss. If the quantized model's accuracy drops below the production threshold — a common outcome for transformer-based architectures where attention mechanisms amplify small numerical differences — the only recourse is to choose a less aggressive precision format, which sacrifices the efficiency gains that motivated quantization in the first place. This ceiling on PTQ's accuracy preservation motivates a fundamentally different approach: rather than applying quantization as a post-hoc transformation, we can integrate precision constraints directly into the training process itself.
|
||||
|
||||
#### Quantization-Aware Training {#sec-model-compression-quantizationaware-training-4032}
|
||||
|
||||
@@ -4648,7 +4648,7 @@ where $q$ represents the simulated quantized value, $x$ denotes the full-precisi
|
||||
|
||||
\index{Straight-Through Estimator (STE)!etymology}
|
||||
\index{Bengio, Yoshua!straight-through estimator}
|
||||
Although the forward pass utilizes quantized values, gradient calculations during backpropagation remain in full precision. The Straight-Through Estimator (STE) accomplishes this\index{Straight-Through Estimator (STE)}[^fn-ste-gradient-trick], which approximates the gradient of the quantized function by treating the rounding operation as if it had a derivative of one. In effect, the STE pretends quantization is the identity function during backpropagation, allowing gradients to flow unchanged through otherwise non-differentiable operations. This approach prevents the gradient from being obstructed due to the non-differentiable nature of the quantization operation, thereby allowing effective model training [@bengio2013estimating].
|
||||
Although the forward pass uses quantized values, gradient calculations during backpropagation remain in full precision. The Straight-Through Estimator (STE) accomplishes this\index{Straight-Through Estimator (STE)}[^fn-ste-gradient-trick], which approximates the gradient of the quantized function by treating the rounding operation as if it had a derivative of one. In effect, the STE pretends quantization is the identity function during backpropagation, allowing gradients to flow unchanged through otherwise non-differentiable operations. This approach prevents the gradient from being obstructed due to the non-differentiable nature of the quantization operation, thereby allowing effective model training [@bengio2013estimating].
|
||||
|
||||
[^fn-ste-gradient-trick]: **Straight-Through Estimator (STE)**: Proposed by Bengio et al. (2013), the STE substitutes the identity function for the true gradient of rounding, which is zero almost everywhere (rounding is piecewise constant). This approximation is correct in magnitude but wrong in direction for weights near quantization boundaries --- a weight at 0.499 that should round to 0.0 receives the same gradient as one at 0.001, despite their opposite fates after rounding. QAT compensates by letting the model adapt to these systematic gradient errors during training, which is why QAT recovers accuracy that post-training quantization cannot. \index{STE!quantization-aware training}
|
||||
|
||||
@@ -4762,9 +4762,9 @@ The gap arises from several sources. Sparse matrices stored in dense format wast
|
||||
|
||||
## Architectural Efficiency {#sec-model-compression-architectural-efficiency-8dd3}
|
||||
|
||||
Architectural efficiency optimization ensures that computations execute efficiently on target hardware by aligning model operations with processor capabilities and memory hierarchies. Where representation optimization determines *what* computations to perform and precision optimization determines *how precisely* to compute, architectural efficiency addresses *how* operations are scheduled, memory is accessed, and workloads adapt to input characteristics. This third dimension closes the gap between theoretical compression ratios and real-world speedups.
|
||||
A ResNet-50 pruned to 50% sparsity and quantized to INT8 should theoretically run 6$\times$ faster than its dense FP32 baseline. On actual hardware, the measured speedup is often closer to 1.5$\times$. The gap between theoretical and realized gains exposes the third optimization dimension: *architectural efficiency*, which ensures that structural and precision optimizations translate into real-world speedups by aligning computation patterns with hardware capabilities. Where representation optimization determines *what* computations to perform and precision optimization determines *how precisely* to compute, architectural efficiency addresses *how* operations are scheduled, memory is accessed, and workloads adapt to input characteristics.
|
||||
|
||||
Four complementary approaches to architectural efficiency are examined: hardware-aware design principles that proactively integrate deployment constraints during model development, sparsity exploitation techniques that accelerate computation on pruned models, dynamic computation strategies that adapt workload to input complexity, and operator fusion methods that reduce memory traffic by combining operations. These techniques transform algorithmic optimizations into realized performance gains.
|
||||
Four complementary approaches close this gap: hardware-aware design principles that integrate deployment constraints during model development, sparsity exploitation techniques that accelerate computation on pruned models, dynamic computation strategies that adapt workload to input complexity, and operator fusion methods that reduce memory traffic by combining operations.
|
||||
|
||||
### Hardware-Aware Design {#sec-model-compression-hardwareaware-design-c561}
|
||||
|
||||
@@ -4772,7 +4772,7 @@ Closing the gap between theoretical complexity reduction and real-world performa
|
||||
|
||||
#### Efficient Design Principles {#sec-model-compression-efficient-design-principles-b015}
|
||||
|
||||
Designing for hardware efficiency requires structuring architectures to account for computational cost, memory usage, inference latency, and power consumption while maintaining strong predictive performance. A key aspect involves leveraging the strengths of specific hardware platforms (GPUs, TPUs, mobile or edge devices) to maximize parallelism, optimize memory hierarchies, and minimize latency through hardware-optimized operations. @tbl-hardware-efficient-design categorizes these design principles, each addressing a core aspect of computational and system constraints.
|
||||
Designing for hardware efficiency requires structuring architectures to account for computational cost, memory usage, inference latency, and power consumption while maintaining strong predictive performance. A key aspect involves exploiting the strengths of specific hardware platforms (GPUs, TPUs, mobile or edge devices) to maximize parallelism, optimize memory hierarchies, and minimize latency through hardware-optimized operations. @tbl-hardware-efficient-design categorizes these design principles, each addressing a core aspect of computational and system constraints.
|
||||
|
||||
| **Principle** | **Goal** | **Example Networks** |
|
||||
|:--------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
|
||||
@@ -4808,13 +4808,13 @@ $$
|
||||
Here, $\phi$ is a scaling coefficient, and $\alpha$, $\beta$, and $\gamma$ are scaling factors determined based on hardware constraints and empirical data. This approach ensures that models grow in a way that optimizes hardware resource usage, keeping them efficient while improving accuracy.
|
||||
|
||||
\index{EfficientNet!compound scaling validation}
|
||||
For example, the NAS-discovered **EfficientNet** (@sec-model-compression-neural-architecture-search-cf12) empirically validated this principle. Its search algorithm found that carefully balancing depth, width, and resolution via **compound scaling** yielded models that were both computationally efficient and high-performing, outperforming architectures that scaled dimensions arbitrarily. Compound scaling reduces computational cost while preserving accuracy, making it a key consideration for hardware-aware model design. This approach is particularly beneficial when deploying models on GPUs or TPUs, where parallelism can be fully leveraged, but memory and power usage need to be carefully managed. @sec-benchmarking examines performance evaluation methods for measuring these efficiency gains.
|
||||
For example, the NAS-discovered **EfficientNet** (@sec-model-compression-neural-architecture-search-cf12) empirically validated this principle. Its search algorithm found that carefully balancing depth, width, and resolution via **compound scaling** yielded models that were both computationally efficient and high-performing, outperforming architectures that scaled dimensions arbitrarily. Compound scaling reduces computational cost while preserving accuracy, making it a key consideration for hardware-aware model design. This approach is particularly beneficial when deploying models on GPUs or TPUs, where parallelism can be fully exploited, but memory and power usage need to be carefully managed. @sec-benchmarking examines performance evaluation methods for measuring these efficiency gains.
|
||||
|
||||
This principle extends beyond convolutional models to other architectures like transformers. Adjusting the number of layers, attention heads, or embedding dimensions impacts computational efficiency similarly. Hardware-aware scaling has become central to optimizing model performance across various computational constraints, especially when working with large models or resource-constrained devices.
|
||||
|
||||
#### Computation Reduction {#sec-model-compression-computation-reduction-13de}
|
||||
|
||||
Modern architectures leverage factorized computations to decompose complex operations into simpler components, reducing computational overhead while maintaining representational power. Standard convolutions apply filters uniformly across all spatial locations and channels, creating computational bottlenecks on resource-constrained hardware. Factorization techniques address this inefficiency by restructuring operations to minimize redundant computation.
|
||||
Modern architectures use factorized computations to decompose complex operations into simpler components, reducing computational overhead while maintaining representational power. Standard convolutions apply filters uniformly across all spatial locations and channels, creating computational bottlenecks on resource-constrained hardware. Factorization techniques address this inefficiency by restructuring operations to minimize redundant computation.
|
||||
|
||||
Depthwise separable convolutions\index{Depthwise Separable Convolution}\index{Model Compression!depthwise separable convolution}, introduced in MobileNet, exemplify this approach by decomposing standard convolutions into two stages: depthwise convolution (applying separate filters to each input channel independently) and pointwise convolution ($1\times1$ convolution mixing outputs across channels). The computational complexity of standard convolution with input size $h \times w$, $C_{\text{in}}$ input channels, and $C_{\text{out}}$ output channels is:
|
||||
$$
|
||||
@@ -4854,7 +4854,7 @@ $$
|
||||
|
||||
By reducing $C_{\text{in}}$ using $1\times 1$ convolutions, SqueezeNet reduces the number of parameters, achieving a 50$\times$ reduction in parameter count compared to AlexNet (from 240 MB to under 5 MB) while maintaining similar performance. This method is well-suited for edge devices that have strict memory and storage constraints.
|
||||
|
||||
Feature reuse, activation checkpointing, and parameter reduction form key components of hardware-aware model design, allowing models to fit within memory limits of modern accelerators while reducing power consumption through fewer memory accesses. Specialized accelerators like TPUs and GPUs leverage memory hierarchies, caching, and high bandwidth memory to efficiently handle sparse or reduced-memory representations, enabling faster inference with minimal overhead.
|
||||
Feature reuse, activation checkpointing, and parameter reduction form key components of hardware-aware model design, allowing models to fit within memory limits of modern accelerators while reducing power consumption through fewer memory accesses. Specialized accelerators like TPUs and GPUs exploit memory hierarchies, caching, and high bandwidth memory to efficiently handle sparse or reduced-memory representations, enabling faster inference with minimal overhead.
|
||||
|
||||
Beyond reducing what data must be stored, substantial efficiency gains emerge from optimizing how operations access memory. The next technique addresses this by combining multiple operations to reduce memory traffic.
|
||||
|
||||
@@ -4884,7 +4884,7 @@ from mlsys.constants import KIB_TO_BYTES, MILLION
|
||||
class FusionCalc:
|
||||
"""Quantify latency and bandwidth benefits of Conv-BN-ReLU operator fusion on ResNet-50."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Conv-BN-ReLU layer geometry
|
||||
conv_channels = 256
|
||||
conv_spatial = 28
|
||||
@@ -4903,7 +4903,7 @@ class FusionCalc:
|
||||
kernels_fused = 53
|
||||
latency_per_kernel_us = 10
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Feature map size (SI MB)
|
||||
feat_map_mb = conv_channels * conv_spatial * conv_spatial * bytes_per_element / MILLION
|
||||
|
||||
@@ -4934,11 +4934,11 @@ class FusionCalc:
|
||||
fused_time_us = total_fused_mb / v100_bw_gbs_value * 1000
|
||||
fusion_speedup = unfused_time_us / fused_time_us
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(bandwidth_reduction_pct > 40, "Fusion should reduce bandwidth by more than 40%.")
|
||||
check(fusion_speedup > 1, "Fused execution must be faster than unfused.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
conv_bn_relu_intermediate_mb_str = fmt(conv_bn_relu_intermediate_mb, precision=1, commas=False)
|
||||
gemm_intermediate_mb_str = fmt(gemm_intermediate_mb, precision=1, commas=False)
|
||||
feat_map_kb_str = fmt(feat_map_mb * 1000, precision=0, commas=False)
|
||||
@@ -5049,17 +5049,17 @@ from mlsys.formatting import fmt, check, md_math
|
||||
class ConvFusionCalc:
|
||||
"""Demonstrate 3x memory traffic reduction from Conv-BN-ReLU fusion (6 transfers → 2)."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
unfused_transfers = 6 # read/write for Conv, BN, ReLU
|
||||
fused_transfers = 2 # read input, write output
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
transfer_reduction = unfused_transfers / fused_transfers
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ──────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||||
check(transfer_reduction == 3, "Conv-BN-ReLU fusion must yield exactly 3x transfer reduction.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
transfer_reduction_str = fmt(transfer_reduction, precision=0, commas=False)
|
||||
conv_bn_relu_mem_md = md_math(
|
||||
f"2 \\times 256 \\times 28 \\times 28 \\times 4 \\text{{ bytes}} \\approx \\text{{{conv_bn_relu_intermediate_mb_str} MB}}"
|
||||
@@ -5531,7 +5531,7 @@ Early exit and conditional computation represent discrete choices: exit or conti
|
||||
|
||||
Fast Neural Networks (FNNs) exemplify this approach, adjusting the number of active layers based on real-time complexity estimation. If an input is straightforward, only a subset of layers is activated; if early layers produce low-confidence outputs, additional layers refine the prediction [@wu2019fast]. A related approach, dynamic layer scaling, progressively increases computational depth based on uncertainty estimates, useful for fine-grained classification tasks where some inputs require only coarse-grained processing while others need deeper feature extraction [@wang2021glam].
|
||||
|
||||
Adaptive inference excels in latency-sensitive applications where resource constraints fluctuate dynamically. In autonomous systems, for example, lane detection may require minimal computation while multi-object tracking in dense environments demands additional processing power. On hardware accelerators such as GPUs and TPUs, adaptive inference leverages parallel processing capabilities by distributing workloads dynamically, maximizing throughput while minimizing energy expenditure.
|
||||
Adaptive inference excels in latency-sensitive applications where resource constraints fluctuate dynamically. In autonomous systems, for example, lane detection may require minimal computation while multi-object tracking in dense environments demands additional processing power. On hardware accelerators such as GPUs and TPUs, adaptive inference exploits parallel processing capabilities by distributing workloads dynamically, maximizing throughput while minimizing energy expenditure.
|
||||
|
||||
#### Implementation Challenges {#sec-model-compression-implementation-challenges-1184}
|
||||
|
||||
@@ -5586,7 +5586,7 @@ Structured sparsity involves removing entire components of the network, such as
|
||||
|
||||
#### Sparsity Utilization Methods {#sec-model-compression-sparsity-utilization-methods-04c3}
|
||||
|
||||
With the distinction between unstructured and structured sparsity patterns established, the critical question becomes: how do we translate theoretical zeros into actual speedup? The challenge lies in the gap between theoretical parameter reduction and realized performance: a sparse model with 90% of weights zeroed may still run at nearly full computational cost on hardware not designed for irregular memory access. The processor cannot skip a multiplication unless it *knows* the operand is zero—and discovering that requires loading the operand from memory in the first place. Bridging this gap requires specialized utilization methods and hardware support that can efficiently skip zero-valued computations [@hoefler2021sparsity]. Structured sparsity proves more hardware-efficient, enabling accelerators like GPUs and TPUs to fully exploit regular patterns [@Han2015].
|
||||
A sparse model with 90% of weights zeroed may still run at nearly full computational cost on hardware not designed for irregular memory access. The critical question is how to translate theoretical zeros into actual speedup. The processor cannot skip a multiplication unless it *knows* the operand is zero—and discovering that requires loading the operand from memory in the first place. Bridging this gap requires specialized utilization methods and hardware support that can efficiently skip zero-valued computations [@hoefler2021sparsity]. Structured sparsity proves more hardware-efficient, enabling accelerators like GPUs and TPUs to fully exploit regular patterns [@Han2015].
|
||||
|
||||
The simplest utilization method is sparse matrix operations, which skip zero elements during computation to significantly reduce arithmetic operations. Consider the difference: multiplying a dense $4\times 4$ matrix with a vector typically requires 16 multiplications, while a sparse-aware implementation computes only the 6 nonzero operations:
|
||||
$$
|
||||
@@ -6736,7 +6736,7 @@ The coordination challenges inherent in combining sparsity with other techniques
|
||||
Unlike software functions that compose predictably, optimization techniques interact through shared physical resources: memory bandwidth, cache capacity, and arithmetic units. Pruning changes sparsity patterns that affect quantization's dynamic range. Quantization changes numerical precision that affects fusion's memory traffic assumptions. Operator fusion changes execution schedules that affect dynamic computation's branching decisions. Effective optimization therefore requires treating the model-hardware pair as a coupled system rather than optimizing each dimension independently. This is a systems engineering problem, not merely a machine learning one.
|
||||
:::
|
||||
|
||||
With the three optimization dimensions now fully explored, practitioners need systematic guidance for translating this knowledge into deployment decisions.
|
||||
Knowing *how* each technique works is necessary but not sufficient; the practical question is *which* techniques to apply for a given deployment target and how to sequence them.
|
||||
|
||||
## Technique Selection {#sec-model-compression-technique-selection-ba16}
|
||||
|
||||
@@ -6777,7 +6777,7 @@ This decision framework provides starting points for individual technique select
|
||||
|
||||
## Optimization Strategies {#sec-model-compression-optimization-strategies-f2f6}
|
||||
|
||||
The decision framework above guides individual technique selection, but the largest optimization gains emerge from combining multiple techniques. Because pruning, quantization, and architectural efficiency operate at different levels of the stack, they provide multiplicative benefits when sequenced appropriately.
|
||||
BERT compressed from 440 MB to 28 MB — a 16$\times$ reduction — not through any single technique but through sequential application of pruning, distillation, and quantization. The largest optimization gains emerge from combining techniques, because pruning, quantization, and architectural efficiency operate at different levels of the stack and provide multiplicative benefits when sequenced appropriately.
|
||||
|
||||
Why do certain combinations work? Pruning and quantization create synergistic effects because pruning reduces parameter count while quantization reduces precision, yielding multiplicative compression\index{Model Compression!compression ratio}\index{Compression Ratio!multiplicative effects}. Applying pruning first concentrates important weights into a smaller parameter set, making subsequent quantization more effective and reducing the search space for optimal quantization strategies. This sequential approach achieves compression ratios exceeding either technique alone.
|
||||
|
||||
@@ -7023,11 +7023,11 @@ When quantizing ResNet-50 from FP32 to INT8, baseline metrics show Top-1 accurac
|
||||
|
||||
With these comprehensive baselines in place, the measurement framework must track optimization impact systematically. Rather than evaluating techniques in isolation, applying our three-dimensional framework requires understanding how different approaches interact when combined. Sequential application can lead to compounding benefits or unexpected interactions that diminish overall effectiveness. @sec-benchmarking provides additional structured evaluation methods for comprehensive performance assessment.
|
||||
|
||||
Rigorous measurement tells practitioners *whether* their optimizations succeeded, but the measurements themselves require tooling to perform. Profiling, quantization, pruning, and deployment all depend on software frameworks that automate otherwise prohibitively complex workflows. We turn now to the implementation tools that make these techniques practical.
|
||||
Rigorous measurement tells practitioners *whether* their optimizations succeeded, but the measurements themselves require tooling to perform. Profiling, quantization, pruning, and deployment all depend on software frameworks that automate otherwise prohibitively complex workflows.
|
||||
|
||||
## Implementation Tools {#sec-model-compression-implementation-tools-4990}
|
||||
|
||||
Understanding optimization techniques is necessary but not sufficient; practical implementation relies on robust software support. Without framework tooling, quantization would require manual modification of model definitions and careful insertion of quantization operations throughout the network, while pruning would demand direct manipulation of weight tensors. Both become prohibitively complex as models scale.
|
||||
Quantizing a 175-billion parameter model by hand — inserting scale factors at every layer boundary, managing mixed-precision accumulation, and calibrating activation ranges — would require modifying thousands of lines of model code. Without framework tooling, even straightforward INT8 post-training quantization demands manual insertion of quantization operations throughout the network, while pruning requires direct manipulation of weight tensors. Both become prohibitively complex as models scale.
|
||||
|
||||
Modern machine learning frameworks provide high-level APIs and automated workflows that abstract away implementation complexity, making sophisticated optimization techniques accessible to practitioners. Frameworks address key challenges: providing pre-built modules for common optimization techniques, assisting with hyperparameter tuning (pruning schedules, quantization bit-widths), managing accuracy-compression trade-offs through automated evaluation, and ensuring hardware compatibility through device-specific code generation.
|
||||
|
||||
@@ -7107,11 +7107,11 @@ Sparsity heat maps show sparsity distribution across layers (@fig-sparse-heat-ma
|
||||
|
||||
{#fig-sparse-heat-map fig-alt="Heatmap visualization of a pruned neural network with weight matrix blocks. Darker regions indicate higher sparsity where more weights have been removed. Lighter regions show retained weights."}
|
||||
|
||||
With the implementation tools and visualization capabilities established, the natural question is: how do these techniques compare when a practitioner must choose among them? Each optimization approach carries distinct trade-offs in accuracy, training cost, and hardware requirements, and a structured comparison clarifies which to reach for first.
|
||||
Each optimization approach carries distinct trade-offs in accuracy, training cost, and hardware requirements. A structured comparison clarifies which to reach for first when a practitioner must choose among them.
|
||||
|
||||
## Technique Comparison {#sec-model-compression-technique-comparison-3142}
|
||||
|
||||
A comparative analysis across the three major approaches reveals how each addresses distinct aspects of the efficiency-accuracy trade-off. Pruning works best when sparse computation hardware is available and when reducing floating-point operations is critical. Quantization provides the most versatile approach with broad hardware support, making it ideal for diverse deployment scenarios. Knowledge distillation requires significant computational investment but produces consistently high-quality compressed models, making it the right choice when accuracy preservation is paramount. @tbl-optimization-comparison summarizes these trade-offs for systematic technique selection.
|
||||
An engineer with a model that exceeds her deployment budget faces three levers: prune it, quantize it, or distill it into a smaller architecture. Each lever operates on a different physical resource. Pruning reduces floating-point operations and works best when sparse computation hardware is available. Quantization reduces bit-width and provides the most versatile approach with broad hardware support. Knowledge distillation produces dense, compact models at higher training cost but consistently preserves accuracy. @tbl-optimization-comparison summarizes these trade-offs for systematic technique selection.
|
||||
|
||||
| **Technique** | **Primary Goal** | **Accuracy Impact** | **Training Cost** | **Hardware Dependency** | **Best For** |
|
||||
|:-----------------|:--------------------|:--------------------|:------------------------|:------------------------|:--------------------------------------|
|
||||
@@ -7124,7 +7124,7 @@ A comparative analysis across the three major approaches reveals how each addres
|
||||
\index{Model Compression!sequential application}
|
||||
These techniques combine synergistically, with quantization often applied after pruning or distillation to achieve compound compression benefits. Production systems frequently employ sequential application: initial pruning reduces parameter count, quantization optimizes numerical representation, and fine-tuning through distillation principles recovers any accuracy loss. Sequential application enables compression ratios of 10--50$\times$ while maintaining competitive accuracy across diverse deployment scenarios.
|
||||
|
||||
With the complete optimization toolkit now surveyed—from individual techniques through combination strategies—the most instructive lessons often come not from what works but from what fails. The following fallacies and pitfalls capture the most common mistakes engineers make when applying these techniques, each grounded in the quantitative trade-offs we have established throughout the chapter.
|
||||
The most instructive lessons in model compression often come not from what works but from what fails. The following fallacies and pitfalls capture the most common mistakes engineers make when applying these techniques, each grounded in quantitative trade-offs.
|
||||
|
||||
## Fallacies and Pitfalls {#sec-model-compression-fallacies-pitfalls-1b5e}
|
||||
|
||||
@@ -7149,14 +7149,14 @@ With the complete optimization toolkit now surveyed—from individual techniques
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import KIB_TO_BYTES
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FallaciesAnalysis:
|
||||
"""
|
||||
Namespace for Fallacies and Pitfalls.
|
||||
Scenario: Misinterpreting compression speedups.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Quantization parameters
|
||||
expected_bits = 32
|
||||
target_bits = 8
|
||||
@@ -7166,16 +7166,16 @@ class FallaciesAnalysis:
|
||||
prune_speedup = 2 # 50% structured sparsity = 2× theoretical
|
||||
actual_combined_pct = 28 # real-world end-to-end speedup (%)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
quant_speedup = expected_bits / target_bits # 4× from INT8
|
||||
combined_expected = quant_speedup * prune_speedup # 8× theoretical
|
||||
quant_after_overhead = quant_speedup * (1 - overhead_pct/100) # 3.4× actual quant-only
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(quant_after_overhead < quant_speedup, "Actual speedup should be less than theoretical due to overhead.")
|
||||
check(actual_combined_pct < combined_expected * 100, "Real-world speedup must be less than theoretical.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
int8_size_reduction_str = f"{int(quant_speedup)}"
|
||||
expected_speedup_str = fmt(combined_expected, precision=0, commas=False)
|
||||
actual_speedup_str = fmt(actual_combined_pct, precision=0, commas=False)
|
||||
@@ -7247,9 +7247,16 @@ The optimization techniques explored here (pruning, quantization, distillation,
|
||||
|
||||
::: {.callout-chapter-connection title="From Math to Physics"}
|
||||
|
||||
We have compressed the model's logic, shaving off every unnecessary bit. Logic, however, must eventually run on physics. We turn next to @sec-hardware-acceleration, where we explore how GPUs, TPUs, and NPUs are designed to exploit these optimizations and execute compressed models at maximum throughput.
|
||||
We have compressed the model's logic, shaving off every unnecessary bit. Logic, however, must eventually run on physics. @sec-hardware-acceleration examines how GPUs, TPUs, and NPUs are designed to exploit these optimizations and execute compressed models at maximum throughput.
|
||||
|
||||
:::
|
||||
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:model_compression")
|
||||
```
|
||||
|
||||
@@ -62,7 +62,7 @@ Traditional software engineering assumes that bugs are local: a defect in one mo
|
||||
|
||||
Engineering responsibility therefore expands what "correct" means for ML systems. Correctness in the traditional sense---reliable, performant, and maintainable---remains necessary, but ML systems must also be correct in a broader sense: fair across user groups, efficient in resource consumption, and transparent in their decision processes. This expansion is not abstract ethics layered on top of engineering. It is engineering itself, applied to failure modes that conventional metrics do not capture. A latency regression is visible in dashboards; a fairness regression is invisible until it harms real users. Both require systematic detection, measurement, and remediation.
|
||||
|
||||
This chapter provides frameworks for diagnosing, preventing, and mitigating these failures. We begin with concrete cases that reveal the *responsibility gap*---the distance between technical performance and responsible outcomes---and the mechanisms (proxy variables, feedback loops, distribution shift) through which it manifests. From there, we develop a responsible engineering checklist that systematizes impact assessment, model documentation, disaggregated testing, and incident response into repeatable engineering processes. The chapter then turns to environmental and cost awareness, connecting the resource consumption quantified throughout this book (training compute, inference energy, carbon footprint) to engineering ethics: efficiency optimization is not just a performance strategy but a responsibility imperative. We then examine the data governance and compliance infrastructure---access control, privacy protection, lineage tracking, and audit systems---that makes responsible practices enforceable at scale, before closing with the fallacies and pitfalls that commonly undermine even well-intentioned efforts.
|
||||
This chapter provides frameworks for diagnosing, preventing, and mitigating these failures. We begin with concrete cases that reveal the *responsibility gap*---the distance between technical performance and responsible outcomes---and the mechanisms (proxy variables, feedback loops, distribution shift) through which it manifests. From there, we develop a responsible engineering checklist that systematizes impact assessment, model documentation, disaggregated testing, and incident response into repeatable engineering processes. The chapter then turns to environmental and cost awareness, connecting the resource consumption quantified throughout this book (training compute, inference energy, carbon footprint) to engineering ethics: efficiency optimization is not just a performance strategy but a responsibility imperative. We then examine the data governance and compliance infrastructure (access control, privacy protection, lineage tracking, and audit systems) that makes responsible practices enforceable at scale, before closing with the fallacies and pitfalls that commonly undermine even well-intentioned efforts.
|
||||
|
||||
We begin with the concrete failure cases that establish *why* engineers must lead on responsibility.
|
||||
|
||||
@@ -70,7 +70,7 @@ We begin with the concrete failure cases that establish *why* engineers must lea
|
||||
|
||||
A loan model that approves 95% of qualified majority-group applicants while rejecting 40% of equally qualified minority-group applicants meets its loss function perfectly. The gap between this *technical correctness* and *responsible outcomes* represents a central challenge in machine learning systems engineering, one that existing testing methodologies were not designed to address.
|
||||
|
||||
Understanding *how* this gap manifests in practice is essential before discussing *how* to prevent it. This section traces the gap through four stages. We begin with concrete cases where optimization succeeded but systems failed, revealing the mechanisms (proxy variables, feedback loops, distribution shift) that cause harm. We then examine the silent failure modes that make these problems invisible to conventional monitoring. Turning from failure to success, we study organizations that closed the gap through systematic engineering practice. Finally, we confront the testing challenge that makes responsibility fundamentally harder to verify than traditional software correctness, and the implications for where responsibility ownership must sit within engineering organizations.\index{Responsibility Gap!technical vs. responsible success}
|
||||
The gap manifests through concrete mechanisms---proxy variables, feedback loops, distribution shift---each producing harm through a distinct pathway. Concrete cases where optimization succeeded but systems failed reveal these mechanisms and the silent failure modes that make them invisible to conventional monitoring. Organizations that closed the gap through systematic engineering practice demonstrate that prevention is feasible. The testing challenge that makes responsibility fundamentally harder to verify than traditional software correctness then determines where responsibility ownership must sit within engineering organizations.\index{Responsibility Gap!technical vs. responsible success}
|
||||
|
||||
### When Optimization Succeeds But Systems Fail {#sec-responsible-engineering-optimization-succeeds-systems-fail-1a22}
|
||||
|
||||
@@ -217,7 +217,7 @@ These failures, however, are preventable. The same engineering capabilities that
|
||||
|
||||
### When Responsible Engineering Succeeds {#sec-responsible-engineering-responsible-engineering-succeeds-29e0}
|
||||
|
||||
The preceding examples emphasize failure, but responsible engineering also produces measurable successes that demonstrate both the feasibility and business value of rigorous responsibility practices.
|
||||
Failure is not inevitable. Responsible engineering produces measurable successes that demonstrate both the feasibility and business value of rigorous responsibility practices.
|
||||
|
||||
\index{Facial Recognition!demographic disparities}Following the Gender Shades findings, Microsoft invested in improving facial recognition performance across demographic groups.\index{Bias!mitigation strategies} The approach combined technical and organizational interventions: targeted data collection to address underrepresented populations, model architecture changes to improve feature extraction for diverse skin tones, and systematic disaggregated evaluation across all demographic intersections. By 2019, Microsoft had reduced error rates for darker-skinned subjects by up to 20 times, bringing error rates below 2% for all demographic groups [@raji2019actionable]. The company published these improvements transparently, enabling external verification. The business outcome: Microsoft's facial recognition API maintained enterprise customer trust while competitors faced regulatory scrutiny and contract cancellations.
|
||||
|
||||
@@ -301,7 +301,7 @@ plt.show()
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TestingConstraintAnchor:
|
||||
"""
|
||||
Namespace for subgroup testing challenges.
|
||||
@@ -322,20 +322,20 @@ subgroup_pct_str = TestingConstraintAnchor.minority_pct_str
|
||||
subgroup_samples_str = TestingConstraintAnchor.minority_samples_str
|
||||
subgroup_data_multiplier_str = TestingConstraintAnchor.multiplier_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GenderShadesDisparity:
|
||||
"""
|
||||
Namespace for Gender Shades Error Disparity analysis.
|
||||
Scenario: Quantifying bias across demographic groups in facial recognition.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
err_light_male = 0.8
|
||||
err_light_female = 7.1
|
||||
err_dark_male = 12.0
|
||||
err_dark_female = 34.7
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
disparity_fold = err_dark_female / err_light_male
|
||||
disparity_light_female = err_light_female / err_light_male
|
||||
disparity_dark_male = err_dark_male / err_light_male
|
||||
@@ -343,10 +343,10 @@ class GenderShadesDisparity:
|
||||
acc_light_male = 100.0 - err_light_male
|
||||
acc_dark_female = 100.0 - err_dark_female
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(disparity_fold >= 40, f"Disparity ({disparity_fold:.1f}x) is too low.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
error_light_male_str = fmt(err_light_male, precision=1, commas=False)
|
||||
error_light_female_str = fmt(err_light_female, precision=1, commas=False)
|
||||
error_dark_male_str = fmt(err_dark_male, precision=1, commas=False)
|
||||
@@ -396,7 +396,7 @@ These strategies complement traditional software testing rather than replacing i
|
||||
|
||||
### Engineering Leadership on Responsibility {#sec-responsible-engineering-engineering-leadership-responsibility-e03c}
|
||||
|
||||
Responsible AI Engineering, the engineering-centered practice of imposing safety constraints on stochastic systems, cannot be delegated exclusively to ethics boards or legal departments. These groups provide essential oversight but lack the technical access required to identify problems early in the development process.
|
||||
When Amazon's ethics board finally reviewed the recruiting tool, the model had already encoded proxy signals so deeply that remediation required scrapping the project entirely. The review came too late because the technical decisions that created the problem, made months earlier by engineers, had already constrained every possible fix. Responsible AI Engineering cannot be delegated exclusively to ethics boards or legal departments. These groups provide essential oversight but lack the technical access required to identify problems early in the development process.
|
||||
|
||||
::: {.callout-definition title="Responsible AI Engineering"}
|
||||
|
||||
@@ -420,7 +420,7 @@ Engineering teams do not operate in isolation. As @fig-governance-layers makes c
|
||||
|
||||
{#fig-governance-layers fig-alt="Nested oval diagram showing governance layers from innermost to outermost: Team (reliable systems, software engineering), Organization (safety culture, organizational design), Industry (trustworthy certification, external reviews), and Government Regulation."}
|
||||
|
||||
Understanding where engineering fits within this governance ecosystem leads naturally to the question of scope: what exactly falls under an engineer's responsibility? The answer extends beyond the metrics we have optimized throughout this book, revealing the full cost of the *Iron Law*.
|
||||
The question of scope remains: what exactly falls under an engineer's responsibility? The answer extends beyond the metrics we have optimized throughout this book, revealing the full cost of the *Iron Law*.
|
||||
|
||||
::: {.callout-perspective title="The Full Cost of the Iron Law"}
|
||||
The **Iron Law of ML Systems** (Principle \ref{pri-iron-law}) established in @sec-model-training-iron-law-training-performance-a53f holds that system performance depends on the interaction between data, compute, and system overhead. We have spent previous chapters optimizing each term: compressing models (@sec-model-compression), accelerating hardware (@sec-hardware-acceleration), and automating operations (@sec-ml-operations). Yet every optimization has costs beyond those captured in benchmarks.
|
||||
@@ -436,15 +436,15 @@ Competitive differentiation\index{Trust!competitive differentiation} completes t
|
||||
|
||||
The quantization techniques from @sec-model-compression reduce inference energy by 2--4$\times$, directly supporting sustainable deployment. The monitoring infrastructure from @sec-ml-operations enables disaggregated fairness evaluation across demographic groups. Responsible engineering synthesizes these capabilities into disciplined practice through structured frameworks that translate principles into processes.
|
||||
|
||||
The preceding sections established *why* ML systems fail and *who* must lead on responsibility. Knowing that engineers must lead is insufficient without knowing *how*. The cases above reveal a pattern: every failure could have been prevented by systematic processes applied at the right stage of development. What was missing was not technical capability but disciplined practice: checklists, documentation standards, testing protocols, and monitoring infrastructure that translate responsibility principles into repeatable engineering workflows.
|
||||
Every failure examined above could have been prevented by systematic processes applied at the right stage of development. The missing ingredient was not technical capability but disciplined practice: checklists, documentation standards, testing protocols, and monitoring infrastructure that translate responsibility principles into repeatable engineering workflows.
|
||||
|
||||
## Responsible Engineering Checklist {#sec-responsible-engineering-responsible-engineering-checklist-a038}
|
||||
|
||||
The frameworks that follow integrate responsibility concerns into existing development workflows throughout the ML lifecycle.\index{Responsible Engineering!checklist methodology} Rather than treating responsibility as a separate review stage, the checklist embeds it at three points where engineering decisions have the greatest ethical impact: *pre-deployment assessment* evaluates potential harms before a system reaches users, *fairness evaluation* quantifies whether performance holds equitably across demographic groups, and *documentation standards* create the audit trails that make accountability possible. Each phase builds on the previous one: assessment identifies what to measure, fairness evaluation measures it, and documentation ensures the measurements persist beyond any single team member's tenure.
|
||||
Amazon's recruiting tool could have been caught before deployment by a structured pre-deployment review. COMPAS's error rate disparity would have surfaced through disaggregated testing. Both failures shared a common cause: responsibility was treated as a separate review stage rather than integrated into the development workflow.\index{Responsible Engineering!checklist methodology} A responsible engineering checklist embeds assessment at three points where engineering decisions have the greatest ethical impact: *pre-deployment assessment* evaluates potential harms before a system reaches users, *fairness evaluation* quantifies whether performance holds equitably across demographic groups, and *documentation standards* create the audit trails that make accountability possible. Each phase builds on the previous one: assessment identifies what to measure, fairness evaluation measures it, and documentation ensures the measurements persist beyond any single team member's tenure.
|
||||
|
||||
### Pre-Deployment Assessment {#sec-responsible-engineering-predeployment-assessment-2324}
|
||||
|
||||
Production deployment requires structured evaluation of potential impacts across multiple dimensions. @tbl-pre-deployment-assessment structures this evaluation into five phases, distinguishing critical-path blockers from high-priority items that can proceed with documented risk acceptance.
|
||||
Before a loan approval model reaches production, a team must answer: Where did the training data come from? Who is represented and who is missing? What happens when the model fails, and what recourse do affected users have? @tbl-pre-deployment-assessment structures this evaluation into five phases, distinguishing critical-path blockers from high-priority items that can proceed with documented risk acceptance.
|
||||
|
||||
| **Phase** | **Priority** | **Key Questions** | **Documentation Required** |
|
||||
|:---------------|:--------------|:-----------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------|
|
||||
@@ -477,25 +477,25 @@ The Evaluation row in @tbl-pre-deployment-assessment raises a critical question:
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from IPython.display import Markdown
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class RepresentationStats:
|
||||
"""
|
||||
Namespace for Statistics of Representation.
|
||||
Scenario: Random vs Stratified sampling for a 1% minority group.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
target_imgs = 1000
|
||||
minority_frac = 0.01
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
random_total = target_imgs / minority_frac
|
||||
multiplier = random_total / target_imgs
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(multiplier == 100, "Multiplier should be 100.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
repr_target_images_str = f"{target_imgs:,}"
|
||||
repr_group_fraction_pct_str = f"{int(minority_frac * 100)}"
|
||||
repr_group_fraction_str = f"{minority_frac}"
|
||||
@@ -602,7 +602,7 @@ Google's What-If Tool enables interactive exploration of model behavior across d
|
||||
|
||||
#### Worked Example: Fairness Analysis in Loan Approval {#sec-responsible-engineering-worked-example-fairness-analysis-loan-approval-2c72}
|
||||
|
||||
A concrete example illustrates how fairness metrics reveal disparities invisible in aggregate performance measures.\index{Fairness Metrics!loan approval case study} @tbl-confusion-group-a and @tbl-confusion-group-b present loan approval outcomes for a model evaluated on two demographic groups.
|
||||
A loan approval model reports 85% accuracy across all applicants---a number that satisfies most stakeholders.\index{Fairness Metrics!loan approval case study} @tbl-confusion-group-a and @tbl-confusion-group-b reveal what the aggregate conceals: loan approval outcomes for the same model evaluated separately on two demographic groups.
|
||||
|
||||
```{python}
|
||||
#| label: fairness-metrics-calc
|
||||
@@ -621,20 +621,20 @@ A concrete example illustrates how fairness metrics reveal disparities invisible
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LoanFairness:
|
||||
"""
|
||||
Namespace for Loan Approval Fairness analysis.
|
||||
Scenario: Comparing approval rates and TPR across Majority/Minority groups.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
a_tp, a_fn = 4500, 500
|
||||
a_fp, a_tn = 1000, 4000
|
||||
b_tp, b_fn = 600, 400
|
||||
b_fp, b_tn = 200, 800
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
a_total = a_tp + a_fn + a_fp + a_tn
|
||||
b_total = b_tp + b_fn + b_fp + b_tn
|
||||
|
||||
@@ -651,10 +651,10 @@ class LoanFairness:
|
||||
a_fnr_pct = a_fn / (a_tp + a_fn) * 100
|
||||
b_fnr_pct = b_fn / (b_tp + b_fn) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(tpr_disparity >= 25, f"TPR Disparity ({tpr_disparity:.1f}%) is too low.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
a_approval_str = fmt(a_app_pct, precision=0, commas=False)
|
||||
b_approval_str = fmt(b_app_pct, precision=0, commas=False)
|
||||
dp_disparity_str = fmt(dp_disparity, precision=0, commas=False)
|
||||
@@ -804,12 +804,12 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class FairnessPrice:
|
||||
"""Utility cost of closing a TPR gap via threshold adjustment in a hiring model."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
hire_value = 100_000 # Value of a successful hire ($)
|
||||
bad_hire_cost = 50_000 # Cost of a bad hire ($)
|
||||
fp_increase_pp = 5 # FP increase to close 20% TPR gap (%)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Illustrative estimate: with a 5 pp FP increase, extra bad hires cost
|
||||
# fp_increase_pp/100 * bad_hire_cost per negative applicant, offset
|
||||
# against the full hire_value per positive applicant. Assuming a
|
||||
@@ -825,7 +825,7 @@ class FairnessPrice:
|
||||
# magnitude, not the precise number.
|
||||
utility_loss_pct = 3 # Approximate net utility loss (%)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
hire_value_k_str = f"${hire_value/1000:.0f}k" # e.g. "$100k"
|
||||
bad_hire_cost_k_str = f"${bad_hire_cost/1000:.0f}k" # e.g. "$50k"
|
||||
fp_increase_str = f"{fp_increase_pp}%" # e.g. "5%"
|
||||
@@ -863,7 +863,7 @@ Quantifying disparities through metrics is necessary but not sufficient for resp
|
||||
|
||||
### Explainability Requirements {#sec-responsible-engineering-explainability-requirements-0b67}
|
||||
|
||||
Explainability[^fn-explainability-interpretability]\index{Explainability!definition and purposes} enables human oversight of automated decisions, supports debugging when problems emerge, and satisfies regulatory requirements for decision transparency.\index{Transparency!regulatory requirements}
|
||||
A loan applicant denied credit by an algorithmic system has a right to know *why*---not in aggregate statistical terms, but in terms specific to her application. Explainability[^fn-explainability-interpretability]\index{Explainability!definition and purposes} provides this capability: it enables human oversight of automated decisions, supports debugging when problems emerge, and satisfies regulatory requirements for decision transparency.\index{Transparency!regulatory requirements}
|
||||
|
||||
[^fn-explainability-interpretability]: **Explainability vs. Interpretability**: *Interpretability* is an intrinsic model property---the degree to which a human can understand internal mechanics (linear regression is interpretable; a 100-layer network is not). *Explainability* is a post-hoc capability added without changing the model (LIME, SHAP). The systems implication: interpretable models constrain architecture selection (simpler models, fewer features), while explainability adds 10--100$\times$ inference latency as a separate module. Regulations like the EU AI Act demand "meaningful information about the logic involved" without specifying which approach, leaving the latency-vs-architecture trade-off to engineering teams. \index{Explainability!architecture trade-off}
|
||||
|
||||
@@ -899,7 +899,7 @@ The explainability requirements outlined above are not merely engineering best p
|
||||
|
||||
### The Regulatory Landscape {#sec-responsible-engineering-regulatory-landscape-1ec1}
|
||||
|
||||
Responsible engineering now operates within explicit regulatory frameworks\index{Regulatory Compliance!AI governance} that mandate specific technical requirements for transparency, oversight, and accountability. While regulations vary by jurisdiction, several convergent patterns have emerged that engineers must understand.
|
||||
In 2024, the EU AI Act imposed fines up to 35 million EUR or 7% of global turnover for non-compliant high-risk AI systems, and the US Federal Trade Commission brought its first enforcement actions against algorithmic discrimination. Responsible engineering now operates within explicit regulatory frameworks\index{Regulatory Compliance!AI governance} that mandate specific technical requirements for transparency, oversight, and accountability. While regulations vary by jurisdiction, several convergent patterns have emerged that engineers must understand.
|
||||
|
||||
#### The EU AI Act {#sec-responsible-engineering-eu-ai-act-1f56}
|
||||
|
||||
@@ -956,7 +956,7 @@ ML systems create unique maintenance challenges [@sculley2015hidden].\index{Tech
|
||||
|
||||
The monitoring infrastructure from @sec-ml-operations provides the foundation for responsible system operation, extending traditional operational metrics to include outcome quality measures.
|
||||
|
||||
Responsible monitoring extends along several interconnected dimensions. Performance stability tracking detects gradual prediction quality degradation that might not trigger immediate alerts—the slow accuracy decay that accumulates over weeks is far more dangerous than a sudden crash because it evades threshold-based alarms. Subgroup parity monitoring adds a fairness lens to this temporal tracking, comparing error rates across demographic groups to detect emerging disparities before they cause significant harm. These model-level metrics must be complemented by input distribution monitoring that catches population shifts and potential adversarial manipulation at the data layer, and by outcome monitoring that validates whether predictions translate to intended real-world results. Perhaps most critically, user feedback systems close the loop by surfacing complaints and corrections that reveal problems invisible to any automated metric—the kind of harm that only affected users can articulate.
|
||||
Responsible monitoring extends along several interconnected dimensions. Performance stability tracking detects gradual prediction quality degradation that might not trigger immediate alerts—the slow accuracy decay that accumulates over weeks is far more dangerous than a sudden crash because it evades threshold-based alarms. Subgroup parity monitoring adds a fairness lens to this temporal tracking, comparing error rates across demographic groups to detect emerging disparities before they cause significant harm. These model-level metrics must be complemented by input distribution monitoring that catches population shifts and potential adversarial manipulation at the data layer, and by outcome monitoring that validates whether predictions translate to intended real-world results. User feedback systems close the loop by surfacing complaints and corrections that reveal problems invisible to any automated metric—the kind of harm that only affected users can articulate.
|
||||
|
||||
Effective monitoring requires both data collection infrastructure and disciplined review processes. Dashboards that no one examines provide no protection. Engineering teams should establish regular review cadences with clear ownership and escalation procedures.
|
||||
|
||||
@@ -968,7 +968,7 @@ In 2020, researchers estimated that training a single large NLP model emitted as
|
||||
|
||||
### Efficiency as Responsibility {#sec-responsible-engineering-efficiency-responsibility-fb99}
|
||||
|
||||
The computational demands of modern ML systems have grown dramatically. Training large language models requires thousands of GPU hours, consuming energy measured in megawatt-hours.\index{Energy Consumption!training costs} Much of this expense, however, is not intrinsic to the learning task but represents *accidental complexity*: training from scratch when fine-tuning would suffice, using larger models than tasks require, and running hyperparameter searches that explore redundant configurations. Computational cost is largely a function of engineering discipline, not just model physics.\index{Green AI!efficiency as metric}[^fn-green-ai-efficiency]
|
||||
Training a single large language model consumes thousands of GPU hours and energy measured in megawatt-hours.\index{Energy Consumption!training costs} Much of this expense, however, is not intrinsic to the learning task but represents *accidental complexity*: training from scratch when fine-tuning would suffice, using larger models than tasks require, and running hyperparameter searches that explore redundant configurations. Computational cost is largely a function of engineering discipline, not just model physics.\index{Green AI!efficiency as metric}[^fn-green-ai-efficiency]
|
||||
|
||||
[^fn-green-ai-efficiency]: **Green AI**: Schwartz et al. (2020) contrasted "Red AI" (performance at any cost) with "Green AI" (efficiency as primary metric), documenting that state-of-the-art accuracy gains from 2012--2018 required a 300,000$\times$ compute increase. Their proposal---reporting FLOPs alongside accuracy for every published result---reframes efficiency from an engineering preference into a scientific reporting obligation, making the resource cost of marginal accuracy gains visible and comparable across research groups. \index{Green AI!compute reporting}
|
||||
|
||||
@@ -1005,7 +1005,7 @@ from mlsys import Hardware, Models
|
||||
|
||||
class EdgeEfficiencyCalc:
|
||||
"""Power and latency constraints across device tiers vs. model architectures."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
h_phone = Hardware.Edge.Generic_Phone
|
||||
|
||||
smart_power = h_phone.tdp if h_phone.tdp else 3 * watt
|
||||
@@ -1033,7 +1033,7 @@ class EdgeEfficiencyCalc:
|
||||
tiny_power = 50 * milliwatt
|
||||
tiny_latency = 200 * ms
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
smart_power_str = f"{smart_power.m_as(watt):.0f} W"
|
||||
smart_latency_str = f"{smart_latency.m_as(ms):.0f} ms"
|
||||
iot_power_str = f"{iot_power.m_as(milliwatt):.0f} mW"
|
||||
@@ -1140,7 +1140,7 @@ from mlsys.constants import (
|
||||
|
||||
class InferenceCostCalc:
|
||||
"""Training vs. inference TCO comparison for a 10M-user recommendation system."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
gpu_rate_value = CLOUD_GPU_TRAINING_PER_HOUR.m_as(USD / hour)
|
||||
data_prep_hrs_value = 100 # Data preparation GPU-hours
|
||||
hyperparam_hrs_value = 500 # Hyperparameter search GPU-hours
|
||||
@@ -1152,7 +1152,7 @@ class InferenceCostCalc:
|
||||
inference_ms_value = 10 # Inference latency (ms)
|
||||
gpu_inf_rate_value = CLOUD_GPU_INFERENCE_PER_HOUR.m_as(USD / hour)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
data_prep_cost_value = data_prep_hrs_value * gpu_rate_value
|
||||
hyperparam_cost_value = hyperparam_hrs_value * gpu_rate_value
|
||||
train_cost_value = train_hrs_value * gpu_rate_value
|
||||
@@ -1167,7 +1167,7 @@ class InferenceCostCalc:
|
||||
lifecycle_train_cost_value = total_train_cost_value * retrain_quarters_value
|
||||
inf_train_ratio_value = lifecycle_inf_cost_value / lifecycle_train_cost_value
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
data_prep_str = fmt(data_prep_cost_value, precision=0, commas=True)
|
||||
hyperparam_str = fmt(hyperparam_cost_value, precision=0, commas=True)
|
||||
train_cost_str = fmt(train_cost_value, precision=0, commas=True)
|
||||
@@ -1248,8 +1248,8 @@ Engineers can estimate three-year total cost of ownership using a structured app
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import MILLION, MS_PER_SEC, HOURS_PER_DAY, SEC_PER_HOUR, CLOUD_GPU_TRAINING_PER_HOUR, USD, hour
|
||||
|
||||
# ┌── P.I.C.O. SCENARIO (Unwrapped for stability) ──────────────────────────────
|
||||
# 1. PARAMETERS (Inputs)
|
||||
# ┌── LEGO (Unwrapped for stability) ──────────────────────────────
|
||||
# 1. LOAD (Constants)
|
||||
gpu_rate = CLOUD_GPU_TRAINING_PER_HOUR.m_as(USD / hour) # $4/hour
|
||||
carbon_per_gpu_hr = 0.16
|
||||
t_data_prep_hrs = 100
|
||||
@@ -1264,7 +1264,7 @@ o_monitor_yr = 50000.0
|
||||
o_oncall_yr = 100000.0
|
||||
o_incident_yr = 20000.0
|
||||
|
||||
# 2. CALCULATION (The Physics)
|
||||
# 2. EXECUTE (The Compute)
|
||||
# A. Training
|
||||
train_cost_cycle = (t_data_prep_hrs * gpu_rate) + (t_hparam_exps * t_hparam_cost_exp) + (t_final_hrs * gpu_rate)
|
||||
train_tco_3yr = train_cost_cycle * t_cycles_3yr
|
||||
@@ -1296,10 +1296,10 @@ p_train = (train_tco_3yr / total_tco) * 100
|
||||
p_inf = (inf_tco_3yr / total_tco) * 100
|
||||
p_ops = (o_total_3yr / total_tco) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
check(inf_tco_3yr >= train_tco_3yr * 5, "Inference TCO doesn't dominate Training.")
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(inf_tco_3yr >= train_tco_3yr * 5, "Inference TCO does not dominate Training.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
users_daily_m_str = f"{i_users // MILLION}"
|
||||
recs_per_user_str = f"{i_recs_per_user}"
|
||||
inference_ms_str = f"{int(i_latency_s * MS_PER_SEC)}"
|
||||
@@ -1437,14 +1437,14 @@ Operational costs encompass infrastructure, personnel, and incident response. @t
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TCOSummary:
|
||||
"""
|
||||
Namespace for TCO Summary and Quantization ROI.
|
||||
Scenario: Quantifying savings from a 20% latency reduction.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
quant_reduction_pct = 0.20 # 20%
|
||||
|
||||
# Get values from upstream LifecycleEconomics class
|
||||
@@ -1452,17 +1452,17 @@ class TCOSummary:
|
||||
train_tco_3yr = LifecycleEconomics.train_tco_3yr
|
||||
inf_carbon_3yr = LifecycleEconomics.inf_carbon_3yr
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
inf_train_ratio = inf_tco_3yr / train_tco_3yr
|
||||
|
||||
# Savings
|
||||
savings_dollars = inf_tco_3yr * quant_reduction_pct
|
||||
savings_carbon_kg = inf_carbon_3yr * quant_reduction_pct
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(savings_dollars >= 100_000, f"Savings (${savings_dollars:,.0f}) too small to justify optimization.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
inf_train_ratio_str = fmt(inf_train_ratio, precision=0, commas=False)
|
||||
quant_savings_str = fmt(savings_dollars / 1000, precision=0, commas=False) # In K$
|
||||
quant_carbon_str = fmt(savings_carbon_kg / 1000, precision=0, commas=False) # In Tons
|
||||
@@ -1520,18 +1520,18 @@ from mlsys.formatting import fmt, check
|
||||
|
||||
class CarbonScaleCalc:
|
||||
"""GPT-3 scale training carbon footprint in tonnes CO2 and passenger-car equivalents."""
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
train_energy_mwh = 1300 # Training energy consumption (MWh)
|
||||
carbon_intensity = 0.4 # US grid average (kg CO2/kWh)
|
||||
car_annual_tons = 4.6 # Passenger car annual emissions (tons CO2)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
train_energy_kwh = train_energy_mwh * 1000
|
||||
total_emissions_kg = train_energy_kwh * carbon_intensity
|
||||
total_emissions_tons = total_emissions_kg / 1000
|
||||
cars_eq = total_emissions_tons / car_annual_tons
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
train_energy_mwh_str = fmt(train_energy_mwh, precision=0, commas=True)
|
||||
carbon_intensity_str = f"{carbon_intensity:.1f}"
|
||||
total_emissions_kg_str = fmt(total_emissions_kg, precision=0, commas=True)
|
||||
@@ -1571,7 +1571,7 @@ The checklists, fairness metrics, explainability mechanisms, and efficiency anal
|
||||
|
||||
In January 2023, Meta received a EUR 390 million fine from the Irish Data Protection Commission for processing user data for behavioral advertising without adequate legal basis---a penalty that stemmed not from a data breach but from insufficient governance infrastructure to demonstrate lawful processing. The storage architectures examined in @sec-data-engineering are not merely technical infrastructure but governance enforcement mechanisms\index{Data Governance!enforcement mechanisms} that determine who accesses data, how usage is tracked, and whether systems comply with regulatory requirements.\index{Compliance!data governance} Every architectural decision, from acquisition strategies through processing pipelines to storage design, carries governance implications that manifest when systems face regulatory audits, privacy violations, or ethical challenges. Data governance transforms from abstract policy into concrete engineering: access control systems that enforce who can read training data, audit infrastructure that tracks every data access for compliance, privacy-preserving techniques that protect individuals while enabling model training, and lineage systems that document how raw audio recordings become production models.
|
||||
|
||||
Data governance encompasses four interconnected domains. Security infrastructure protects data assets through access control and encryption, establishing the perimeter within which all other governance operates. Privacy mechanisms then determine what information is exposed even to authorized users, respecting individual rights while enabling model training. Compliance frameworks translate jurisdiction-specific regulatory requirements into architectural constraints that shape how data flows through the system. Finally, lineage and audit systems create the accountability trails that make the first three domains verifiable—without them, security policies, privacy guarantees, and compliance claims are unenforceable assertions rather than demonstrable properties. We examine each in turn, beginning with a critical framing: compliance is not optional.
|
||||
Data governance encompasses four interconnected domains. Security infrastructure protects data assets through access control and encryption, establishing the perimeter within which all other governance operates. Privacy mechanisms then determine what information is exposed even to authorized users, respecting individual rights while enabling model training. Compliance frameworks translate jurisdiction-specific regulatory requirements into architectural constraints that shape how data flows through the system. Finally, lineage and audit systems create the accountability trails that make the first three domains verifiable---without them, security policies, privacy guarantees, and compliance claims are unenforceable assertions rather than demonstrable properties. The starting point is a critical constraint: compliance is not optional.
|
||||
|
||||
::: {.callout-warning title="Compliance as Engineering Need"}
|
||||
Data governance is not optional. The EU General Data Protection Regulation (GDPR) imposes fines up to 4% of global annual revenue or 20 million euros (whichever is greater) for non-compliance. GDPR mandates specific technical capabilities: the right to erasure (Article 17) requires systems that can locate and delete all data associated with an individual, including derived features and model artifacts. The right to explanation (Article 22) requires systems that can justify automated decisions. California's CCPA, Brazil's LGPD, and China's PIPL impose similar obligations with jurisdiction-specific requirements. For ML systems, these are not legal abstractions but engineering specifications that must be built into data pipelines, storage architectures, and model training workflows from the outset.
|
||||
@@ -1850,7 +1850,7 @@ KWS systems face particularly acute privacy challenges because the always-listen
|
||||
|
||||
### Architecting for Regulatory Compliance {#sec-responsible-engineering-architecting-regulatory-compliance-eb56}
|
||||
|
||||
Security and privacy controls protect data at the technical level, but they operate within a regulatory landscape that specifies *what* must be protected, *for whom*, and *how long*. Compliance requirements transform from legal obligations into system architecture constraints\index{Regulatory Compliance!architecture constraints} that shape pipeline design, storage choices, and operational procedures. GDPR's data minimization principle\index{GDPR!data minimization principle}\index{Data Minimization!privacy by design} requires limiting collection and retention to what is necessary for stated purposes. For KWS systems, this means justifying why voice samples need retention beyond training, documenting retention periods in system design documents, and implementing automated deletion once periods expire. The "right to access" requires systems to retrieve all data associated with a user, consolidating results from distributed storage systems.
|
||||
When a European user invokes the "right to erasure" under GDPR, the voice assistant must locate and delete every recording, derived feature, and model artifact associated with that user across distributed storage systems---within 30 days. This is not a policy aspiration; it is an engineering specification with a deadline. Compliance requirements transform from legal obligations into system architecture constraints\index{Regulatory Compliance!architecture constraints} that shape pipeline design, storage choices, and operational procedures. GDPR's data minimization principle\index{GDPR!data minimization principle}\index{Data Minimization!privacy by design} requires limiting collection and retention to what is necessary for stated purposes. For KWS systems, this means justifying why voice samples need retention beyond training, documenting retention periods in system design documents, and implementing automated deletion once periods expire. The "right to access" requires systems to retrieve all data associated with a user, consolidating results from distributed storage systems.
|
||||
|
||||
Voice assistants operating globally face overlapping regulatory regimes because compliance requirements vary by jurisdiction and apply differently based on user age and data sensitivity.\index{Data Localization!cross-border transfer} European requirements for cross-border data transfer restrict storing EU users' voice data on servers outside designated countries unless specific safeguards exist, driving architectural decisions about regional data lakes, feature store replication strategies, and processing localization. Standardized documentation frameworks like data cards\index{Data Cards!compliance documentation} [@pushkarna2022data] translate these compliance requirements into operational artifacts. Examine the data card template in @fig-data-card to see how this structured format turns abstract compliance obligations into concrete, machine-checkable fields. Training pipelines check that input datasets have valid data cards before processing, and serving systems enforce that only models trained on compliant data can deploy to production.
|
||||
|
||||
@@ -2120,3 +2120,10 @@ In @sec-conclusion, we assemble these pieces into a coherent philosophy of engin
|
||||
|
||||
::: { .quiz-end }
|
||||
:::
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:responsible_engr")
|
||||
```
|
||||
|
||||
@@ -96,7 +96,7 @@ from mlsys.constants import *
|
||||
from mlsys.formatting import fmt, sci, md_math, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TrainingEconomicsAnchor:
|
||||
"""
|
||||
Namespace for Training vs Inference cost asymmetry.
|
||||
@@ -111,7 +111,7 @@ class TrainingEconomicsAnchor:
|
||||
gpt2_train_cost_str = TrainingEconomicsAnchor.gpt2_train_cost_str
|
||||
gpt2_inf_cost_str = TrainingEconomicsAnchor.gpt2_inf_cost_str
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TrainingHardware:
|
||||
"""
|
||||
Namespace for Training Hardware Specs.
|
||||
@@ -289,12 +289,12 @@ class TrainingScenarios:
|
||||
grad_range_max_exp = 3
|
||||
small_lr = 2.5e-4
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(tanh_speedup >= 1.5, f"Tanh speedup over sigmoid should be >= 1.5x, got {tanh_speedup:.1f}x")
|
||||
check(total_preprocess_ms == 30, f"Preprocessing time mismatch: {total_preprocess_ms} != 30ms")
|
||||
check(buffer_mem_gb > 2.5, f"Buffer memory ({buffer_mem_gb:.1f} GB) unexpectedly low.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt2_training_cost_2019_str = fmt(gpt2_cost_2019, precision=0, commas=True)
|
||||
gpt4_training_cost_est_str = fmt(gpt4_cost_est / MILLION, precision=0, commas=False)
|
||||
gpt2_fwd_flops_str = sci(gpt2_fwd_flops)
|
||||
@@ -632,7 +632,7 @@ These scaling challenges share a common thread: every bottleneck traces back to
|
||||
|
||||
## Mathematical Foundations {#sec-model-training-mathematical-foundations-d894}
|
||||
|
||||
\index{Training!mathematical foundations}\index{Training!computational cost}@sec-neural-computation established *what* neural network operations compute and *why* they enable learning. This section shifts perspective to *what they cost*---the FLOPs consumed, the memory required, and the bandwidth demanded when these conceptually simple operations execute at scale. A matrix multiplication is just $C = AB$ in notation, but training GPT-2 requires executing that operation billions of times with matrices too large to fit in fast memory. The activation function $f(x) = \max(0, x)$ appears trivial, yet the choice between ReLU and sigmoid determines whether Tensor Cores can accelerate computation.
|
||||
\index{Training!mathematical foundations}\index{Training!computational cost}@sec-neural-computation established *what* neural network operations compute and *why* they enable learning. The question for a systems engineer is *what they cost*---the FLOPs consumed, the memory required, and the bandwidth demanded when these conceptually simple operations execute at scale. A matrix multiplication is just $C = AB$ in notation, but training GPT-2 requires executing that operation billions of times with matrices too large to fit in fast memory. The activation function $f(x) = \max(0, x)$ appears trivial, yet the choice between ReLU and sigmoid determines whether Tensor Cores can accelerate computation.
|
||||
|
||||
Four dimensions structure this cost analysis. First, FLOP counts of matrix operations that dominate training, accounting for 60--90% of training time [@he2016deep]. Second, memory requirements for storing activations and optimizer states simultaneously. Third, bandwidth demands that determine whether operations are compute-bound or memory-bound. Fourth, arithmetic intensity classifications that guide optimization strategy selection. Together, these dimensions provide the vocabulary for analyzing the computational intensity, memory pressure, and data dependencies introduced in @sec-model-training-training-systems-fundamentals-05d2.
|
||||
|
||||
@@ -700,14 +700,14 @@ from mlsys import Hardware, Models
|
||||
from mlsys.constants import TFLOPs, second, GPT2_HIDDEN_DIM, GPT2_LAYERS
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT2Compute:
|
||||
"""
|
||||
Namespace for GPT-2 Compute Breakdown.
|
||||
Scenario: Training GPT-2 XL (1.5B) for 50k steps.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Architecture (GPT-2 XL)
|
||||
model = Models.GPT2
|
||||
hidden_dim = GPT2_HIDDEN_DIM
|
||||
@@ -723,7 +723,7 @@ class GPT2Compute:
|
||||
# Hardware (V100)
|
||||
v100_tflops = Hardware.Cloud.V100.peak_flops.m_as(TFLOPs/second)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# A. Attention Layer
|
||||
# QKV: 3 * (Batch * Seq * Hidden * Hidden)
|
||||
macs_qkv = 3 * (batch * seq_len * hidden_dim * hidden_dim)
|
||||
@@ -751,10 +751,10 @@ class GPT2Compute:
|
||||
step_tflops = flops_step_total / TRILLION
|
||||
v100_time_s = step_tflops / v100_tflops
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(flops_training_total >= QUADRILLION, f"Training FLOPs ({flops_training_total:.1e}) too low.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
qkv_billion_str = fmt(flops_qkv/BILLION, precision=0, commas=False)
|
||||
attn_billion_str = fmt(flops_score/BILLION, precision=1, commas=False)
|
||||
|
||||
@@ -810,7 +810,7 @@ $$
|
||||
- With `{python} GPT2Compute.n_layers_gpt2_str` layers in GPT-2: ~`{python} GPT2Compute.per_step_t_str` trillion FLOPs per training step
|
||||
- At `{python} GPT2Compute.training_steps_str` training steps: ~`{python} GPT2Compute.total_peta_str` petaFLOPS total training computation
|
||||
|
||||
**System Implication:** A V100 GPU (`{python} TrainingHardware.v100_tflops_fp16_str` TFLOPS peak FP16 with Tensor Cores, `{python} TrainingHardware.v100_tflops_fp32_str` TFLOPS FP32 without) would require `{python} GPT2Compute.v100_time_str` seconds just for the attention computations per step at 100% utilization (theoretical peak; practical throughput would be lower). Actual training steps take 180 to 220ms, requiring 8 to 32 GPUs to achieve this throughput depending on utilization and interconnect efficiency.
|
||||
**System Implication:** A V100 GPU (`{python} TrainingHardware.v100_tflops_fp16_str` TFLOPS peak FP16 with Tensor Cores, `{python} TrainingHardware.v100_tflops_fp32_str` TFLOPS FP32 without) would require `{python} GPT2Compute.v100_time_str` seconds just for the attention computations per step at 100% utilization (theoretical peak; practical throughput would be lower). Actual training steps take 180 to 220 ms, requiring 8 to 32 GPUs to achieve this throughput depending on utilization and interconnect efficiency.
|
||||
|
||||
:::
|
||||
|
||||
@@ -961,7 +961,7 @@ Forward pass operations and their computational characteristics establish the wo
|
||||
|
||||
Optimization algorithms answer this question: given a loss value and the gradient information it produces, how should each parameter change to reduce future errors? These algorithms govern the learning trajectory, translating gradients into parameter updates that steer the model toward better performance.
|
||||
|
||||
The selection and design of optimization algorithms have direct system-level implications for computation efficiency, memory requirements, and scalability. While this section covers optimization algorithms used during training, post-training compression techniques (quantization, pruning, knowledge distillation) are detailed in @sec-model-compression, and systematic hyperparameter optimization approaches are covered in @sec-ml-workflow.
|
||||
The selection and design of optimization algorithms have direct system-level implications for computation efficiency, memory requirements, and scalability. The focus here is on the algorithms themselves during training; post-training compression techniques (quantization, pruning, knowledge distillation) are detailed in @sec-model-compression, and systematic hyperparameter optimization approaches are covered in @sec-ml-workflow.
|
||||
|
||||
#### Gradient-Based Optimization Methods {#sec-model-training-gradientbased-optimization-methods-9798}
|
||||
|
||||
@@ -1033,25 +1033,25 @@ $$\begin{aligned}
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResNetBatchMemory:
|
||||
"""
|
||||
Namespace for ResNet-50 Batch Memory Scaling.
|
||||
Scenario: Comparing memory footprint at B=32 vs B=64.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Values derived from empirical measurements (hardcoded for narrative consistency)
|
||||
act_mem_b32_gb = 8
|
||||
grad_mem_b32_gb = 4
|
||||
param_mem_b32_mb = 200
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Doubling batch size doubles activation and gradient memory
|
||||
act_mem_b64_gb = act_mem_b32_gb * 2
|
||||
grad_mem_b64_gb = grad_mem_b32_gb * 2
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
resnet50_act_mem_b32_gb_str = fmt(act_mem_b32_gb, precision=0, commas=False)
|
||||
resnet50_grad_mem_b32_gb_str = fmt(grad_mem_b32_gb, precision=0, commas=False)
|
||||
resnet50_param_mem_b32_mb_str = fmt(param_mem_b32_mb, precision=0, commas=False)
|
||||
@@ -1121,23 +1121,23 @@ v_t = \beta_2 v_{t-1} + (1-\beta_2)\big(\nabla \mathcal{L}(\theta_t)\big)^2
|
||||
from mlsys.constants import BYTES_FP32, byte, Mparam
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AdamMemory:
|
||||
"""
|
||||
Namespace for Adam Memory Overhead Calculation.
|
||||
Scenario: Memory cost for a generic 100M parameter model.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
params_m = 100
|
||||
vectors = 2 # m_t, v_t
|
||||
bytes_per_val = BYTES_FP32.m_as(byte)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Overhead = params * 2 vectors * 4 bytes
|
||||
overhead_mb = params_m * vectors * bytes_per_val
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
adam_overhead_str = fmt(overhead_mb, precision=0, commas=False)
|
||||
|
||||
# Note: Use AdamMemory.adam_overhead_str directly.
|
||||
@@ -1188,18 +1188,18 @@ from mlsys.constants import BYTES_FP32, BYTES_FP16, GB, GiB
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT2Optimizer:
|
||||
"""
|
||||
Namespace for GPT-2 Optimizer Memory.
|
||||
Scenario: FP32 vs Mixed Precision (AMP) baselines.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
model = Models.GPT2
|
||||
v100_mem_gib = Hardware.Cloud.V100.memory_capacity.m_as(GiB)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# FP32 Baseline
|
||||
param_fp32_gb = model_memory(model.parameters, BYTES_FP32, GB)
|
||||
grad_fp32_gb = model_memory(model.parameters, BYTES_FP32, GB)
|
||||
@@ -1214,7 +1214,7 @@ class GPT2Optimizer:
|
||||
# Optimizer remains FP32
|
||||
total_static_amp = param_fp16_gb + grad_fp16_gb + master_fp32_gb + adam_m_fp32_gb + adam_v_fp32_gb
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
param_fp32_str = fmt(param_fp32_gb, precision=1, commas=False)
|
||||
grad_fp32_str = fmt(grad_fp32_gb, precision=1, commas=False)
|
||||
adam_state_str = fmt(adam_m_fp32_gb + adam_v_fp32_gb, precision=1, commas=False)
|
||||
@@ -1439,14 +1439,14 @@ from mlsys.constants import BYTES_FP16, BYTES_ADAM_STATE, GB, MB, GiB, GPT2_HIDD
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT2ActivationMemory:
|
||||
"""
|
||||
Namespace for Activation Memory breakdown.
|
||||
Scenario: Comparing Activations vs Parameters for GPT-2 XL training.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Architecture
|
||||
model = Models.GPT2
|
||||
hidden_dim = GPT2_HIDDEN_DIM
|
||||
@@ -1465,7 +1465,7 @@ class GPT2ActivationMemory:
|
||||
# Hardware
|
||||
v100_mem_gb = Hardware.Cloud.V100.memory_capacity.m_as(GiB)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# A. Per-Layer Activations (Forward)
|
||||
# Self-Attention: Q,K,V,Out projections + Scores + Dropout masks
|
||||
# Approx: 4*B*S*H (QKV+Out) + S*S*Heads (Scores)
|
||||
@@ -1498,10 +1498,10 @@ class GPT2ActivationMemory:
|
||||
recompute_overhead = 33
|
||||
act_fp32_gb = total_act_gb * 2
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(total_act_gb >= params_gb, f"Activations ({total_act_gb:.1f}G) should exceed Params ({params_gb:.1f}G).")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
batch_size_str = fmt(batch_size, precision=0, commas=False)
|
||||
seq_len_str = fmt(seq_len, precision=0, commas=False)
|
||||
n_layers_str = fmt(layers, precision=0, commas=False)
|
||||
@@ -1639,7 +1639,7 @@ Consider @tbl-training-arithmetic-intensity: dense matrix multiplication achieve
|
||||
|
||||
: **Training Operation Classifications.** Different operations in the training pipeline exhibit vastly different arithmetic intensities, determining whether they are limited by compute throughput or memory bandwidth. This classification guides optimization strategy: memory-bound operations benefit from precision reduction and operator fusion, while compute-bound operations benefit from faster hardware and increased parallelism. {#tbl-training-arithmetic-intensity}
|
||||
|
||||
To build intuition for these relationships, study the roofline diagram in @fig-training-roofline, a powerful tool for understanding hardware utilization. The ridge point marks the "knee" where the sloped memory-bound region meets the flat compute-bound ceiling. Operations falling left of this point are starved for data: the GPU could compute faster, but memory bandwidth cannot deliver operands quickly enough. Operations to the right are compute-bound: adding more memory bandwidth would not help because the arithmetic units themselves limit throughput. Notice how GPT-2's training operations distribute across this landscape.
|
||||
To build intuition for these relationships, study the roofline diagram in @fig-training-roofline, the standard diagnostic tool for understanding hardware utilization. The ridge point marks the "knee" where the sloped memory-bound region meets the flat compute-bound ceiling. Operations falling left of this point are starved for data: the GPU could compute faster, but memory bandwidth cannot deliver operands quickly enough. Operations to the right are compute-bound: adding more memory bandwidth would not help because the arithmetic units themselves limit throughput. Notice how GPT-2's training operations distribute across this landscape.
|
||||
|
||||
::: {#fig-training-roofline fig-env="figure" fig-pos="htb" fig-cap="**Training Roofline Model**: GPT-2 training operations mapped against arithmetic intensity on a log-log roofline diagram. Matrix multiplications operate in the compute-bound regime (right of the ridge point), while normalization and activation operations fall in the memory-bound region (left). FlashAttention shifts standard attention from below to above the ridge point, demonstrating how algorithmic redesign can move operations into a more efficient regime." fig-alt="Log-log plot showing roofline model with memory-bound slope and compute-bound ceiling. Points show different training operations: MatMul above ridge point, LayerNorm and Softmax below. Arrow shows FlashAttention improvement."}
|
||||
```{python}
|
||||
@@ -1729,21 +1729,21 @@ plt.show()
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AttentionIntensity:
|
||||
"""
|
||||
Namespace for Attention Intensity.
|
||||
Scenario: H/8 intensity formula application.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
h_small = 768 # GPT-2 Small
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Intensity = H / 8
|
||||
intensity = h_small / 8
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
attn_intensity_str = fmt(intensity, precision=0, commas=False)
|
||||
H_small_str = f"{h_small}"
|
||||
|
||||
@@ -1757,7 +1757,7 @@ GPUs have characteristic hardware ridge points where operations transition from
|
||||
|
||||
::: {.callout-perspective title="Peak FLOPS vs. Sustained Performance"}
|
||||
|
||||
Hardware vendors often market "Peak TFLOPS," but for a systems engineer, this number is often a theoretical limit that is rarely reached. The intensity gap reveals that most neural network operations—especially in the backward pass—have arithmetic intensities well below the hardware's ridge point. When an operation is memory-bound (like LayerNorm or Softmax), doubling the hardware's peak TFLOPS does *nothing* for performance. This is why **Mixed-Precision (FP16/BF16)** is so effective: it doesn't just enable faster arithmetic; it halves the bytes moved per operation, effectively doubling the "Data Supply Rate" and allowing the system to reach a much higher percentage of its peak computational capability. Successful optimization is the art of increasing arithmetic intensity through kernel fusion and reducing data movement through precision management.
|
||||
Hardware vendors often market "Peak TFLOPS," but for a systems engineer, this number is often a theoretical limit that is rarely reached. The intensity gap reveals that most neural network operations—especially in the backward pass—have arithmetic intensities well below the hardware's ridge point. When an operation is memory-bound (like LayerNorm or Softmax), doubling the hardware's peak TFLOPS does *nothing* for performance. This is why **Mixed-Precision (FP16/BF16)** is so effective: it does not just enable faster arithmetic; it halves the bytes moved per operation, effectively doubling the "Data Supply Rate" and allowing the system to reach a much higher percentage of its peak computational capability. Successful optimization is the art of increasing arithmetic intensity through kernel fusion and reducing data movement through precision management.
|
||||
:::
|
||||
|
||||
\index{Batch Size!arithmetic intensity effect}
|
||||
@@ -2071,14 +2071,14 @@ Applying this throughput analysis to our GPT-2 Lighthouse Model reveals where th
|
||||
from mlsys.constants import PCIE_GEN3_BW, GB, second
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT2DataPipeline:
|
||||
"""
|
||||
Namespace for Data Pipeline Bottleneck Analysis.
|
||||
Scenario: Tokenization vs PCIe Transfer speed.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
batch_size = 32
|
||||
seq_len = 1024
|
||||
token_rate = 500_000 # tokens/sec/core
|
||||
@@ -2086,7 +2086,7 @@ class GPT2DataPipeline:
|
||||
|
||||
pcie_bw = PCIE_GEN3_BW.m_as(GB/second)
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
tokens_per_batch = batch_size * seq_len
|
||||
tokenization_ms = (tokens_per_batch / token_rate) * 1000
|
||||
|
||||
@@ -2097,7 +2097,7 @@ class GPT2DataPipeline:
|
||||
|
||||
parallel_token_ms = tokenization_ms / workers
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
tokens_per_batch_str = f"{tokens_per_batch // 1000}K"
|
||||
tokenization_ms_str = fmt(tokenization_ms, precision=0, commas=False)
|
||||
batch_kb_str = fmt(batch_kb, precision=0, commas=False)
|
||||
@@ -2116,7 +2116,7 @@ Training language models like GPT-2 requires a specialized data pipeline optimiz
|
||||
**Pipeline Stages**
|
||||
|
||||
1. Raw Text Storage (Storage Zone)
|
||||
- OpenWebText dataset: ~40GB raw text files
|
||||
- OpenWebText dataset: ~40 GB raw text files
|
||||
- Stored on NVMe SSD: `{python} TrainingHardware.nvme_bw_str` GB/s sequential read bandwidth
|
||||
- Random access to different documents: ~0.35 GB/s effective (F_access ≈ 0.1)
|
||||
2. Tokenization (CPU Preprocessing Zone)
|
||||
@@ -2126,11 +2126,11 @@ Training language models like GPT-2 requires a specialized data pipeline optimiz
|
||||
- Processing rate: ~500K tokens/second per CPU core
|
||||
- For batch_size=32, seq_len=1024: need `{python} GPT2DataPipeline.tokens_per_batch_str` tokens/batch
|
||||
- Single core: `{python} GPT2DataPipeline.tokens_per_batch_str` tokens ÷ 500K tokens/s = `{python} GPT2DataPipeline.tokenization_ms_str` ms per batch
|
||||
- Bottleneck: GPU forward pass only takes 80ms
|
||||
- Bottleneck: GPU forward pass only takes 80 ms
|
||||
3. Batching & Padding (CPU)
|
||||
- Pad sequences to uniform length (1024 tokens)
|
||||
- Pack into tensors: [32, 1024] int64 = `{python} GPT2DataPipeline.batch_kb_str` KB per batch
|
||||
- Trivial time: <5ms
|
||||
- Trivial time: <5 ms
|
||||
4. GPU Transfer (PCIe)
|
||||
- PCIe Gen3 x16: `{python} GPT2DataPipeline.pcie_gen3_str` GB/s theoretical
|
||||
- `{python} GPT2DataPipeline.batch_kb_str` KB per batch ÷ `{python} GPT2DataPipeline.pcie_gen3_str` GB/s = `{python} GPT2DataPipeline.transfer_ms_str` ms (negligible)
|
||||
@@ -2138,10 +2138,10 @@ Training language models like GPT-2 requires a specialized data pipeline optimiz
|
||||
**Bottleneck Analysis**
|
||||
|
||||
- Tokenization: `{python} GPT2DataPipeline.tokenization_ms_str` ms
|
||||
- GPU compute: 80ms
|
||||
- Transfer: <1ms
|
||||
- GPU compute: 80 ms
|
||||
- Transfer: <1 ms
|
||||
|
||||
System is balanced (tokenization ≈ GPU compute), but tokenization becomes bottleneck with faster GPUs (A100: 45ms compute means tokenization limits throughput).
|
||||
System is balanced (tokenization ≈ GPU compute), but tokenization becomes bottleneck with faster GPUs (A100: 45 ms compute means tokenization limits throughput).
|
||||
|
||||
**Optimization Applied**
|
||||
|
||||
@@ -2176,19 +2176,19 @@ While data pipeline throughput determines how fast training data reaches the GPU
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.constants import BYTES_FP16, ALLREDUCE_FACTOR
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class NetworkWall:
|
||||
"""
|
||||
Namespace for Network Wall Calculation.
|
||||
Scenario: Gradient synchronization bottleneck.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
params_b = 7
|
||||
bytes_per_param = BYTES_FP16.m_as(byte) # 2
|
||||
network_bw_gbs = 12.5 # 100 Gbps
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Gradient Size = 7B * 2 bytes = 14 GB
|
||||
gradient_size_gb = params_b * bytes_per_param
|
||||
|
||||
@@ -2198,7 +2198,7 @@ class NetworkWall:
|
||||
# Time = Data / Bandwidth
|
||||
time_s = allreduce_gb / network_bw_gbs
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
model_params_b_str = fmt(params_b, precision=0, commas=False)
|
||||
gradient_size_str = fmt(gradient_size_gb, precision=0, commas=False)
|
||||
allreduce_str = fmt(allreduce_gb, precision=0, commas=False)
|
||||
@@ -2343,14 +2343,14 @@ These hardware utilization patterns reinforce the batch-size--utilization relati
|
||||
from mlsys.constants import BYTES_FP16, BYTES_FP32, BYTES_ADAM_STATE, byte
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class VRAMRequirements:
|
||||
"""
|
||||
Namespace for VRAM Requirements Calculation.
|
||||
Scenario: Can we train a 7B model on a 24GB GPU?
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
params_b = 7
|
||||
gpu_capacity_gb = 24
|
||||
|
||||
@@ -2365,13 +2365,13 @@ class VRAMRequirements:
|
||||
layers = 32
|
||||
activations_gb = 2
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
weights_gb = params_b * bytes_fp16
|
||||
gradients_gb = params_b * bytes_fp16
|
||||
optimizer_gb = params_b * bytes_adam
|
||||
subtotal_gb = weights_gb + gradients_gb + optimizer_gb
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
vram_params_b_str = fmt(params_b, precision=0, commas=False)
|
||||
vram_gpu_capacity_str = fmt(gpu_capacity_gb, precision=0, commas=False)
|
||||
vram_fp16_bytes_str = f"{int(bytes_fp16)}"
|
||||
@@ -2436,14 +2436,14 @@ from mlsys.constants import BYTES_FP32, BYTES_FP16, GB, MB, Mparam, Bparam
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class ResNetMemoryScaling:
|
||||
"""
|
||||
Namespace for ResNet-50 Memory Scaling.
|
||||
Scenario: Impact of batch size on total memory footprint.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# ResNet-50 Conv1 output: 112 × 112 × 64
|
||||
first_conv_h, first_conv_w, first_conv_c = 112, 112, 64
|
||||
|
||||
@@ -2457,7 +2457,7 @@ class ResNetMemoryScaling:
|
||||
# GPT-3
|
||||
gpt3_params = 175 * BILLION
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Conv1 size
|
||||
first_conv_mb = (first_conv_h * first_conv_w * first_conv_c * 4) / MILLION # FP32=4 bytes
|
||||
|
||||
@@ -2474,7 +2474,7 @@ class ResNetMemoryScaling:
|
||||
gpt3_fp32_gb = model_memory(gpt3_params, BYTES_FP32, GB)
|
||||
gpt3_fp16_gb = model_memory(gpt3_params, BYTES_FP16, GB)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
first_conv_mb_str = fmt(first_conv_mb, precision=0, commas=False)
|
||||
|
||||
total_gb_b32_str = fmt(total_gb_b32, precision=1, commas=False)
|
||||
@@ -2676,14 +2676,14 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class LlamaTraining:
|
||||
"""
|
||||
Namespace for "The Utility Bill" callout.
|
||||
Scenario: Training Llama-2-70B on 1000 H100s.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
# Model: Llama-2-70B
|
||||
params = 70 * BILLION
|
||||
tokens = 2 * TRILLION
|
||||
@@ -2698,7 +2698,7 @@ class LlamaTraining:
|
||||
rental_rate = 3 # $/hr
|
||||
purchase_price = 30_000 # $
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# Compute Logic
|
||||
effective_tflops = peak_tflops * utilization
|
||||
total_flops = scaling_factor * params * tokens
|
||||
@@ -2714,13 +2714,13 @@ class LlamaTraining:
|
||||
purchase_cost = num_gpus * purchase_price
|
||||
breakeven_runs = purchase_cost / rental_cost
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(rental_cost < purchase_cost,
|
||||
f"Renting (${rental_cost:,.0f}) is more expensive than buying (${purchase_cost:,.0f}) for 1 run!")
|
||||
check(breakeven_runs >= 3,
|
||||
f"Breakeven ({breakeven_runs:.1f}) is too low, weakens 'Cloud for bursty' argument.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
# Helper for scientific notation parts
|
||||
_flops_str = f"{total_flops:.1e}"
|
||||
flops_mantissa = _flops_str.split("e+")[0]
|
||||
@@ -2858,7 +2858,7 @@ These bottlenecks interact in complex ways, illustrating the Conservation of Com
|
||||
|
||||
The pipeline architecture established above creates opportunities for targeted optimizations. Effective optimization follows a systematic methodology that applies regardless of system scale or model architecture. This three-phase framework provides the foundation for all optimization work: profile to identify bottlenecks, select appropriate techniques for the identified constraints, and compose solutions that address multiple bottlenecks simultaneously without creating conflicts.
|
||||
|
||||
The profiling phase employs tools like PyTorch Profiler, TensorFlow Profiler, or NVIDIA Nsight Systems to reveal where time is spent during training iterations. These are the same profiling approaches introduced in the overview, now applied systematically to quantify which bottleneck dominates. A profile might show `{python} TrainingScenarios.profile_data_pct_str`% of time in data loading, `{python} TrainingScenarios.profile_compute_pct_str`% in computation, and `{python} TrainingScenarios.profile_mem_pct_str`% in memory operations, clearly indicating data loading as the primary target for optimization.
|
||||
The profiling phase employs tools like PyTorch Profiler, TensorFlow Profiler, or NVIDIA Nsight Systems to reveal where time is spent during training iterations. These are the same profiling approaches introduced in the overview, now applied systematically to quantify which bottleneck dominates. A profile might show `{python} TrainingScenarios.profile_data_pct_str`% of time in data loading, `{python} TrainingScenarios.profile_compute_pct_str`% in computation, and `{python} TrainingScenarios.profile_mem_pct_str`% in memory operations, indicating data loading as the primary target for optimization.
|
||||
|
||||
The selection phase matches optimization techniques to identified bottlenecks. Each technique we examine targets specific constraints: prefetching addresses data movement latency, mixed-precision training tackles both computational throughput and memory constraints, and gradient accumulation manages memory limitations. Selection requires understanding not just which bottleneck exists, but the characteristics of the hardware, model architecture, and training configuration that influence technique effectiveness.
|
||||
|
||||
@@ -3278,14 +3278,14 @@ from mlsys.constants import GPT2_PARAMS, Mparam, Bparam, BYTES_FP32, BYTES_FP16,
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MixedPrecisionMemory:
|
||||
"""
|
||||
Namespace for Mixed Precision Memory Savings.
|
||||
Scenario: FP32 vs Mixed Precision vs Checkpointing for GPT-2.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
batch_size = 32
|
||||
|
||||
# Pre-calculated activation sizes (GB)
|
||||
@@ -3293,7 +3293,7 @@ class MixedPrecisionMemory:
|
||||
act_fp16_gb = 32.6
|
||||
act_ckpt_gb = 8.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
# A. FP32 Baseline
|
||||
# Params (4 bytes), Grads (4 bytes), Optimizer (8 bytes: m, v)
|
||||
p_fp32 = model_memory(GPT2_PARAMS, BYTES_FP32, GB)
|
||||
@@ -3318,11 +3318,11 @@ class MixedPrecisionMemory:
|
||||
# Savings
|
||||
savings_pct = ((total_fp32 - total_mp) / total_fp32) * 100
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(total_mp < total_fp32, f"Mixed Precision ({total_mp:.1f}G) didn't save memory vs FP32 ({total_fp32:.1f}G).")
|
||||
check(total_ckpt < total_mp, "Checkpointing should further reduce memory.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt2_b_str = fmt(GPT2_PARAMS.m_as(Bparam), precision=1, commas=False)
|
||||
mp_batch_size_str = fmt(batch_size, precision=0, commas=False)
|
||||
|
||||
@@ -3390,24 +3390,24 @@ model_1b_fp16_gb_str = MixedPrecisionMemory.model_1b_fp16_gb_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class MixedPrecisionSpeedup:
|
||||
"""
|
||||
Namespace for Mixed Precision Speedup.
|
||||
Scenario: V100 throughput (samples/sec) FP32 vs FP16.
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
throughput_fp32 = 90.0
|
||||
throughput_fp16 = 220.0
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
speedup = throughput_fp16 / throughput_fp32
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(speedup >= 2.0, f"Speedup ({speedup:.1f}x) is too small to justify mixed precision complexity.")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
v100_mp_speedup_str = fmt(speedup, precision=1, commas=False)
|
||||
throughput_fp32_str = fmt(throughput_fp32, precision=0, commas=False)
|
||||
throughput_fp16_str = fmt(throughput_fp16, precision=0, commas=False)
|
||||
@@ -3604,22 +3604,22 @@ Optimal mixed-precision training requires matching the precision format to hardw
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class CrossGenPrecisionCalc:
|
||||
"""Cross-generation GPU throughput speedup: V100 FP32 baseline."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
v100_fp32_sps = 18 # samples/sec
|
||||
v100_fp16_sps = 45 # samples/sec
|
||||
a100_bf16_sps = 165 # samples/sec
|
||||
h100_fp8_sps = 380 # samples/sec
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
v100_fp16_speedup = v100_fp16_sps / v100_fp32_sps
|
||||
a100_over_v100 = a100_bf16_sps / v100_fp32_sps
|
||||
h100_over_v100 = h100_fp8_sps / v100_fp32_sps
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
v100_fp16_speedup_str = fmt(v100_fp16_speedup, precision=1, commas=False)
|
||||
a100_over_v100_str = fmt(a100_over_v100, precision=1, commas=False)
|
||||
h100_over_v100_str = fmt(h100_over_v100, precision=0, commas=False)
|
||||
@@ -3634,7 +3634,7 @@ a100_over_v100_str = CrossGenPrecisionCalc.a100_over_v100_str
|
||||
h100_over_v100_str = CrossGenPrecisionCalc.h100_over_v100_str
|
||||
```
|
||||
|
||||
The performance impact across generations is substantial. Training our lighthouse GPT-2 model (`{python} TrainingModels.gpt2_params_b_str` B parameters) on a single GPU illustrates how hardware and precision co-evolve: V100 achieves `{python} v100_fp32_sps` samples/sec in FP32 and `{python} v100_fp16_sps` samples/sec in FP16 (`{python} v100_fp16_speedup_str`$\times$ speedup), A100 reaches `{python} a100_bf16_sps` samples/sec in BF16 (`{python} a100_over_v100_str`$\times$ over V100 FP32), and H100 delivers `{python} h100_fp8_sps` samples/sec in FP8 (`{python} h100_over_v100_str`$\times$ over V100 FP32). These speedups compound with the memory savings discussed earlier, enabling both faster iteration and larger models. The hardware-software co-design principle emerges clearly: algorithmic techniques like mixed precision unlock specialized hardware capabilities, while hardware features like Tensor Cores make certain algorithms practical.
|
||||
The performance impact across generations is substantial. Training our lighthouse GPT-2 model (`{python} TrainingModels.gpt2_params_b_str` B parameters) on a single GPU illustrates how hardware and precision co-evolve: V100 achieves `{python} v100_fp32_sps` samples/sec in FP32 and `{python} v100_fp16_sps` samples/sec in FP16 (`{python} v100_fp16_speedup_str`$\times$ speedup), A100 reaches `{python} a100_bf16_sps` samples/sec in BF16 (`{python} a100_over_v100_str`$\times$ over V100 FP32), and H100 delivers `{python} h100_fp8_sps` samples/sec in FP8 (`{python} h100_over_v100_str`$\times$ over V100 FP32). These speedups compound with the memory savings discussed earlier, enabling both faster iteration and larger models. The hardware-software co-design principle is evident: algorithmic techniques like mixed precision unlock specialized hardware capabilities, while hardware features like Tensor Cores make certain algorithms practical.
|
||||
|
||||
### Flash Attention: IO-Aware Attention Optimization {#sec-model-training-flash-attention-ioaware-attention-optimization-3da0}
|
||||
|
||||
@@ -3673,21 +3673,21 @@ $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right
|
||||
from mlsys.constants import BYTES_FP32, MB, GB, byte
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class AttentionMemoryCalc:
|
||||
"""Quadratic memory cost of materializing the N×N attention matrix."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
seq_len = 4096
|
||||
embed_dim = 64 # per head
|
||||
n_heads = 16
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
attn_matrix_mb = (seq_len ** 2 * BYTES_FP32).m_as(MB)
|
||||
total_attn_mb = attn_matrix_mb * n_heads
|
||||
total_attn_gb = (total_attn_mb * MB).m_as(GB)
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fa_seq_len_str = f"{seq_len:,}"
|
||||
embed_dim_str = f"{embed_dim}"
|
||||
fa_n_heads_str = f"{n_heads}"
|
||||
@@ -3771,15 +3771,15 @@ Flash Attention achieves asymptotic improvements in both memory footprint and me
|
||||
from mlsys.constants import BYTES_FP32, MB
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FlashAttentionCalc:
|
||||
"""Standard vs FlashAttention per-head memory comparison."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
fa_n = 4096 # sequence length
|
||||
fa_d = 64 # head dimension
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Standard attention: n^2 attention matrix
|
||||
fa_standard_mb = (fa_n**2 * BYTES_FP32).m_as(MB)
|
||||
|
||||
@@ -3789,7 +3789,7 @@ class FlashAttentionCalc:
|
||||
# Reduction factor
|
||||
fa_reduction = fa_standard_mb / fa_flash_mb
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
fa_standard_mb_str = fmt(fa_standard_mb, precision=0, commas=False)
|
||||
fa_flash_mb_str = fmt(fa_flash_mb, precision=0, commas=False)
|
||||
fa_reduction_str = fmt(fa_reduction, precision=0, commas=False)
|
||||
@@ -3889,7 +3889,7 @@ The benefits of Flash Attention become concrete when measured on real hardware.
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class FlashAttentionSpeedup:
|
||||
"""
|
||||
Namespace for Flash Attention Speedup Calculation.
|
||||
@@ -3925,18 +3925,7 @@ Subsequent versions have continued improving performance: Flash Attention 2 (202
|
||||
|
||||
#### When to Use Flash Attention {#sec-model-training-use-flash-attention-375d}
|
||||
|
||||
Flash Attention should be considered the default attention implementation for transformer training with clear decision criteria:
|
||||
|
||||
**Always use Flash Attention when:**
|
||||
- Training any transformer model with sequence length > 512 tokens
|
||||
- Sequence length > 2048 tokens (essential, standard attention likely OOMs)
|
||||
- Using modern GPUs (A100, H100) with hardware support
|
||||
- Memory is constrained and larger batches are desired
|
||||
|
||||
**Flash Attention provides diminishing returns when:**
|
||||
- Sequence length < 512 tokens (overhead of tiling not worthwhile)
|
||||
- Using very old GPU architectures without fast SRAM
|
||||
- Non-attention architectures (CNNs, MLPs)
|
||||
Flash Attention should be considered the default attention implementation for transformer training. It is essential for any model with sequence lengths exceeding 512 tokens, and mandatory above 2,048 tokens where standard attention likely exhausts memory. Modern GPUs (A100, H100) with hardware support for fast SRAM benefit most. The returns diminish for sequence lengths below 512 tokens (where tiling overhead is not worthwhile), on pre-Volta GPU architectures without fast SRAM, and for non-attention architectures (CNNs, MLPs).
|
||||
|
||||
In practice, deep learning frameworks handle Flash Attention integration transparently. PyTorch 2.0+ automatically selects Flash Attention when available and appropriate. For optimal performance:
|
||||
|
||||
@@ -3951,13 +3940,9 @@ The integration is typically a single-line change---swapping a manual attention
|
||||
\index{IO-Aware Design!algorithm optimization}\index{Memory Bandwidth!bottleneck mitigation}\index{Tiling Algorithms!matrix computation}
|
||||
Flash Attention exemplifies a fundamental systems engineering principle: **IO-aware algorithm design**. The core insight recognizes that modern accelerators are increasingly compute-abundant but bandwidth-constrained. An algorithm's runtime is determined not by FLOP count but by memory traffic.
|
||||
|
||||
This principle extends beyond attention:
|
||||
This principle extends beyond attention. In IO-aware matrix multiplication, tiling algorithms like those in CUTLASS minimize DRAM traffic by maximizing data reuse in fast caches. A naive $n \times n$ matrix multiply performs $O(n^3)$ FLOPs with $O(n^2)$ memory traffic, while blocked algorithms maintain $O(n^3)$ FLOPs but reduce cache misses through locality optimization.
|
||||
|
||||
**IO-aware matrix multiplication.** Tiling algorithms like those in CUTLASS minimize DRAM traffic by maximizing data reuse in fast caches. A naive $n \times n$ matrix multiply performs $O(n^3)$ FLOPs with $O(n^2)$ memory traffic, while blocked algorithms maintain $O(n^3)$ FLOPs but reduce cache misses through locality optimization.
|
||||
|
||||
**Communication-efficient distributed training.** Gradient compression techniques apply similar principles, trading extra computation (compression/decompression) for reduced network bandwidth consumption.
|
||||
|
||||
**Edge deployment.** Low-power edge devices with limited memory bandwidth benefit even more from IO-aware algorithms, where a 10% increase in FLOPs that halves memory traffic yields 3--5$\times$ energy savings.
|
||||
The same logic applies to communication-efficient distributed training: gradient compression techniques trade extra computation (compression/decompression) for reduced network bandwidth consumption. Low-power edge devices with limited memory bandwidth benefit even more from IO-aware algorithms, where a 10% increase in FLOPs that halves memory traffic yields 3--5$\times$ energy savings.
|
||||
|
||||
Flash Attention's impact on practical model training capabilities is substantial. By eliminating the $O(n^2)$ memory bottleneck, it enables:
|
||||
|
||||
@@ -4143,7 +4128,7 @@ With checkpointing, only a subset of the activations is retained during the forw
|
||||
|
||||
The implementation involves three steps. First, split the model into segments. Second, retain activations only at the boundaries of these segments during the forward pass. Third, recompute activations for intermediate layers during the backward pass when needed.
|
||||
|
||||
Frameworks like PyTorch provide tools such as `torch.utils.checkpoint` to simplify this process. Checkpointing is particularly effective for very deep architectures, such as transformers or large convolutional networks, where the memory required for storing activations can exceed the GPU's capacity.
|
||||
Frameworks like PyTorch provide tools such as `torch.utils.checkpoint` to simplify this process. Checkpointing is particularly effective for deep architectures with dozens or hundreds of layers, such as transformers or large convolutional networks, where the memory required for storing activations can exceed the GPU's capacity.
|
||||
|
||||
The synergy between gradient accumulation and checkpointing enables training of larger, more complex models. Gradient accumulation manages memory constraints related to batch size, while checkpointing optimizes memory usage for intermediate activations. Together, these techniques expand the range of models that can be trained on available hardware.
|
||||
|
||||
@@ -4200,7 +4185,7 @@ Returning to our GPT-2 Lighthouse Model, *gradient accumulation* is essential fo
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GradientAccumulation:
|
||||
"""
|
||||
Namespace for Gradient Accumulation Calculation.
|
||||
@@ -4355,7 +4340,7 @@ Both techniques introduce explicit trade-offs. Activation checkpointing adds app
|
||||
| **Implementation Complexity** | Moderate (requires tuning of prefetch parameters) | Low to moderate (with framework support) | Moderate (requires careful segmentation and accumulation) |
|
||||
| **Main Benefits** | Reduces training time, improves hardware utilization | Faster training, larger models, reduced memory usage | Enables larger batch sizes and deeper models |
|
||||
| **Primary Challenges** | Tuning buffer sizes, increased memory usage | Potential numerical instability, loss scaling needed | Increased computational overhead, slower parameter updates |
|
||||
| **Ideal Use Cases** | Large datasets, complex preprocessing | Large-scale models, especially in NLP and computer vision | Very deep networks, memory-constrained environments |
|
||||
| **Ideal Use Cases** | Large datasets, complex preprocessing | Large-scale models, especially in NLP and computer vision | Deep networks (50+ layers), memory-constrained environments |
|
||||
|
||||
: **Optimization Strategies.** Prefetching, mixed-precision training, and gradient accumulation address distinct bottlenecks in AI training pipelines: data transfer, memory consumption, and backpropagation. Selecting an appropriate strategy balances implementation complexity against gains in speed and resource utilization, depending on hardware and workload characteristics. {#tbl-optimization}
|
||||
|
||||
@@ -4392,18 +4377,18 @@ from mlsys.constants import GPT2_PARAMS, GPT2_LAYERS, GPT2_HIDDEN_DIM, V100_MEM_
|
||||
from mlsys.formatting import fmt, check
|
||||
from mlsys.formulas import model_memory
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT2WalkthroughCalc:
|
||||
"""Three-step memory reduction: FP32 → AMP → gradient checkpointing."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
batch_size = 32
|
||||
seq_len = 1024
|
||||
act_fp32_gb = 65.0 # empirical activations for GPT-2 XL, batch=32, seq=1024
|
||||
checkpoint_factor = 4 # checkpoint every 4 layers → 4× reduction
|
||||
recompute_overhead_pct = 33 # empirical: ~33% more compute
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
# Step 1: FP32 Baseline
|
||||
params_fp32_gb = model_memory(GPT2_PARAMS, BYTES_FP32, GB)
|
||||
grads_fp32_gb = params_fp32_gb
|
||||
@@ -4429,7 +4414,7 @@ class GPT2WalkthroughCalc:
|
||||
# Improvement calculations
|
||||
amp_reduction_pct = (1 - total_amp_gb / total_fp32_gb) * 100
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
params_fp32_str = fmt(params_fp32_gb, precision=1, commas=False)
|
||||
grads_fp32_str = fmt(grads_fp32_gb, precision=1, commas=False)
|
||||
adam_fp32_str = fmt(adam_fp32_gb, precision=1, commas=False)
|
||||
@@ -4528,7 +4513,7 @@ Remaining bottleneck: compute-bound---an *Algorithm* constraint in D·A·M terms
|
||||
Three key principles emerge from this analysis:
|
||||
|
||||
1. **Profile before optimizing**: Each optimization targeted a specific bottleneck revealed by profiling
|
||||
2. **Techniques compose**: Mixed precision alone wasn't enough; combining it with checkpointing and prefetching achieved the goal
|
||||
2. **Techniques compose**: Mixed precision alone was not enough; combining it with checkpointing and prefetching achieved the goal
|
||||
3. **Trade-offs are explicit**: We accepted `{python} recompute_overhead_str`% more compute (checkpointing) to gain ~3$\times$ memory reduction
|
||||
|
||||
The systematic framework—profile, identify bottleneck, apply targeted technique, re-profile—transforms optimization from trial-and-error into engineering practice.
|
||||
@@ -4556,11 +4541,11 @@ The GPT-2 case study demonstrates how the optimization techniques examined in th
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class GPT2SummaryCalc:
|
||||
"""GPT-2 optimization summary: FP32 baseline vs AMP + checkpointing."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
# Baseline (FP32)
|
||||
b_param = 6.0 # GB
|
||||
b_grad = 6.0 # GB
|
||||
@@ -4581,14 +4566,14 @@ class GPT2SummaryCalc:
|
||||
o_energy = 115000 # kWh
|
||||
o_carbon = 52.0 # tons CO₂
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
b_total_mem = b_param + b_grad + b_master + b_opt + b_act
|
||||
b_cost = b_energy * 0.10
|
||||
|
||||
o_total_mem = o_param + o_grad + o_master + o_opt + o_act
|
||||
o_cost = o_energy * 0.10
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
# Note: Units embedded in _str vars because these populate a summary table
|
||||
b_param_str = f"{fmt(b_param, precision=1, commas=False)} GB"
|
||||
b_grad_str = f"{fmt(b_grad, precision=1, commas=False)} GB"
|
||||
@@ -4670,11 +4655,11 @@ o_carbon_str = GPT2SummaryCalc.o_carbon_str
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class OptimizationSummaryCalc:
|
||||
"""Headline improvement ratios from GPT-2 optimization walkthrough."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
naive_mem_gb = 89.0
|
||||
optimized_mem_gb = 32.0
|
||||
naive_energy_kwh = 275_000
|
||||
@@ -4682,12 +4667,12 @@ class OptimizationSummaryCalc:
|
||||
naive_days = 14
|
||||
optimized_days = 8.4
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
mem_reduction = naive_mem_gb / optimized_mem_gb
|
||||
energy_reduction_pct = (1 - optimized_energy_kwh / naive_energy_kwh) * 100
|
||||
time_speedup = naive_days / optimized_days
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ─────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||||
mem_reduction_str = fmt(mem_reduction, precision=1, commas=False)
|
||||
energy_reduction_pct_str = fmt(energy_reduction_pct, precision=0, commas=False)
|
||||
time_speedup_str = fmt(time_speedup, precision=1, commas=False)
|
||||
@@ -4775,9 +4760,9 @@ plt.show()
|
||||
### Single-Node Multi-GPU Training {#sec-model-training-singlenode-multigpu-training-c87f}
|
||||
|
||||
\index{Multi-GPU Training!single node}\index{Training!multi-GPU configuration}
|
||||
Multi-GPU training within a single node, the scope of this book, predates large-scale distributed systems. AlexNet[^fn-training-alexnet] (2012) famously split its model across two GTX 580 GPUs---not because the model was too large, but because the 3GB memory per GPU couldn't hold both the model and the batch activations. This single-node, multi-GPU configuration remains common today and introduces the core parallelism strategies without the complexity of network communication.
|
||||
Multi-GPU training within a single node, the scope of this book, predates large-scale distributed systems. AlexNet[^fn-training-alexnet] (2012) famously split its model across two GTX 580 GPUs---not because the model was too large, but because the 3 GB memory per GPU could not hold both the model and the batch activations. This single-node, multi-GPU configuration remains common today and introduces the core parallelism strategies without the complexity of network communication.
|
||||
|
||||
[^fn-training-alexnet]: **AlexNet (2012)**: The model's 60M parameters (~240MB) fit on one GPU, but the large intermediate feature maps (*activations*) produced during training did not. This forced a model-parallel design where specific layers communicated across GPUs, a workaround dictated entirely by the memory ceiling of a single 3GB GTX 580. \index{AlexNet!multi-GPU origin}
|
||||
[^fn-training-alexnet]: **AlexNet (2012)**: The model's 60M parameters (~240 MB) fit on one GPU, but the large intermediate feature maps (*activations*) produced during training did not. This forced a model-parallel design where specific layers communicated across GPUs, a workaround dictated entirely by the memory ceiling of a single 3 GB GTX 580. \index{AlexNet!multi-GPU origin}
|
||||
|
||||
The two foundational strategies---data parallelism and model parallelism---represent fundamentally different answers to the question: *what do we replicate, and what do we partition?* This distinction determines memory requirements, communication patterns, and scaling behavior.
|
||||
|
||||
@@ -4831,7 +4816,7 @@ Data parallelism replicates the entire model on each GPU, with each processing d
|
||||
|
||||
Data parallelism's appeal lies in its simplicity and efficiency. Each GPU runs the identical forward-backward computation, just on different data. The only coordination required is averaging gradients at the end of each step---a single synchronization point per iteration. This makes data parallelism the default choice when models fit in GPU memory. Frameworks like PyTorch's `DistributedDataParallel` and TensorFlow's `MirroredStrategy` automate the gradient synchronization, making multi-GPU data parallelism nearly as simple as single-GPU training.
|
||||
|
||||
However, data parallelism has a hard constraint: every GPU must hold a complete copy of the model. For a 7B parameter model in FP16, that's 14 GB just for weights---before gradients, optimizer states, or activations. When models exceed available GPU memory, a different strategy becomes necessary.
|
||||
However, data parallelism has a hard constraint: every GPU must hold a complete copy of the model. For a 7B parameter model in FP16, that amounts to 14 GB just for weights---before gradients, optimizer states, or activations. When models exceed available GPU memory, a different strategy becomes necessary.
|
||||
|
||||
#### Model Parallelism {#sec-model-training-model-parallelism-c97e}
|
||||
|
||||
@@ -4949,7 +4934,7 @@ When single-node multi-GPU training remains insufficient, distributed training e
|
||||
|
||||
Recall the **Energy-Movement Invariant** from @sec-data-engineering: moving data is 100--1,000$\times$ more expensive than computing on it. In distributed training, this physical law manifests as the **Communication Tax**.
|
||||
|
||||
When you synchronize gradients across a fleet of GPUs, you are moving megabytes of data across a network or PCIe bus for every few milliseconds of computation. If the energy required for communication ($E_{net}$) exceeds the energy for computation ($E_{compute}$), your system efficiency ($\eta$) collapses. This is why techniques like **Mixed Precision** (@sec-model-training-mixedprecision-training-9218) and **Gradient Compression** are essential: they aren't just "speedups"; they are essential tools for managing the physical limits of distributed scaling.
|
||||
When you synchronize gradients across a fleet of GPUs, you are moving megabytes of data across a network or PCIe bus for every few milliseconds of computation. If the energy required for communication ($E_{net}$) exceeds the energy for computation ($E_{compute}$), your system efficiency ($\eta$) collapses. This is why techniques like **Mixed Precision** (@sec-model-training-mixedprecision-training-9218) and **Gradient Compression** are essential: they are not just "speedups"; they are essential tools for managing the physical limits of distributed scaling.
|
||||
:::
|
||||
|
||||
Beyond data and model parallelism, three additional strategies address the specific challenges of distributed training:
|
||||
@@ -5075,13 +5060,13 @@ from mlsys.constants import (
|
||||
)
|
||||
from mlsys.formatting import fmt, check
|
||||
|
||||
# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
|
||||
# ┌── LEGO ───────────────────────────────────────────────
|
||||
class TrainingCarbonFootprint:
|
||||
"""
|
||||
Namespace for Carbon Footprint Calculation.
|
||||
Scenario: Energy and CO2 analysis for training a 7B model.
|
||||
"""
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||||
cf_params = 7 * BILLION
|
||||
cf_tokens = 1 * TRILLION
|
||||
cf_scaling_factor = 6 # Chinchilla scaling constant
|
||||
@@ -5090,7 +5075,7 @@ class TrainingCarbonFootprint:
|
||||
cf_gpu_tdp_w = A100_TDP.m_as(watt)
|
||||
cf_cpu_tdp_per_host_w = 200 # CPU power per host
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||||
cf_hosts = cf_num_gpus // GPUS_PER_HOST
|
||||
|
||||
# Compute time
|
||||
@@ -5117,11 +5102,11 @@ class TrainingCarbonFootprint:
|
||||
cf_flops_mantissa_str = f"{cf_total_flops:.1e}".split("e+")[0]
|
||||
cf_flops_exp_str = f"{int(f'{cf_total_flops:.1e}'.split('e+')[1])}"
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(cf_time_days > 0, "Training time must be positive")
|
||||
check(cf_energy_kwh > 0, "Energy consumption must be positive")
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
cf_num_gpus_str = f"{cf_num_gpus}"
|
||||
cf_hosts_str = f"{cf_hosts}"
|
||||
cf_cpu_tdp_per_host_w_str = f"{cf_cpu_tdp_per_host_w}"
|
||||
@@ -5151,7 +5136,7 @@ cf_flops_exp_str = TrainingCarbonFootprint.cf_flops_exp_str
|
||||
::: {.callout-notebook title="The Carbon Footprint of Training"}
|
||||
|
||||
**Scaling the Utility Bill**:
|
||||
Training large models is not just a compute challenge; it's a massive energy sink. We can quantify the environmental impact of scaling training using the **Energy Corollary** to the Iron Law:
|
||||
Training large models is not just a compute challenge; it is a massive energy sink. We can quantify the environmental impact of scaling training using the **Energy Corollary** to the Iron Law:
|
||||
|
||||
1. **Workload**: Training a 7B parameter model for 1 trillion tokens.
|
||||
2. **Compute**: ≈ `{python} cf_flops_mantissa_str` $\times$ $10^{`{python} cf_flops_exp_str`}$ FLOPs.
|
||||
@@ -5179,10 +5164,10 @@ With this vocabulary of parallelism strategies (data, model, pipeline, tensor, a
|
||||
|
||||
| **Scale** | **Typical Approach** | **Rationale** |
|
||||
|:-----------------------|:-----------------------|:------------------------------------------------|
|
||||
| **<1B params, <100GB** | Single GPU | All optimizations fit; fastest iteration |
|
||||
| **1-10B params, <1TB** | Single node (1-8 GPUs) | Model parallelism within node avoids network |
|
||||
| **<1B params, <100 GB** | Single GPU | All optimizations fit; fastest iteration |
|
||||
| **1-10B params, <1 TB** | Single node (1-8 GPUs) | Model parallelism within node avoids network |
|
||||
| **10B+ params** | Multi-node cluster | Memory requirements exceed single-node capacity |
|
||||
| **>10TB dataset** | Multi-node + streaming | I/O bandwidth requires distributed storage |
|
||||
| **>10 TB dataset** | Multi-node + streaming | I/O bandwidth requires distributed storage |
|
||||
|
||||
: **Scaling Decision Guidelines.** Model size, dataset scale, and available hardware determine when distributed training complexity is justified. Single-machine optimization provides better cost-efficiency below these thresholds. {#tbl-scaling-decision}
|
||||
|
||||
@@ -5234,7 +5219,7 @@ from mlsys.formatting import fmt, check
|
||||
class FallaciesPitfallsSetup:
|
||||
"""Quantitative values for all Fallacies and Pitfalls examples."""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ──────────────────────────────────────────────
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
|
||||
# Fallacy 1: Model scaling without data
|
||||
fp_model_20b_params = 20 # billion parameters
|
||||
@@ -5291,12 +5276,12 @@ class FallaciesPitfallsSetup:
|
||||
fp_prefetch_time_after = 55
|
||||
fp_prefetch_reduction = int(100 * (fp_prefetch_time_before - fp_prefetch_time_after) / fp_prefetch_time_before)
|
||||
|
||||
# ┌── 2. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
# ┌── 2. GUARD (Invariants) ───────────────────────────────────────────
|
||||
check(fp_model_20b_total_gb > fp_model_20b_fp16_gb, "Total memory must exceed weights alone")
|
||||
check(fp_actual_speedup_max < fp_gpu_count, "Actual speedup must be less than GPU count (Amdahl)")
|
||||
check(fp_prefetch_reduction > 0, "Prefetch must reduce training time")
|
||||
|
||||
# ┌── 3. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
# ┌── 3. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
fp_model_20b_params_str = fmt(fp_model_20b_params, precision=0, commas=False)
|
||||
fp_model_7b_params_str = fmt(fp_model_7b_params, precision=0, commas=False)
|
||||
fp_model_20b_total_gb_str = fmt(fp_model_20b_total_gb, precision=0, commas=False)
|
||||
@@ -5375,7 +5360,7 @@ The systematic approach developed throughout this chapter---quantifying costs th
|
||||
|
||||
**Fallacy:** *Larger models always yield better performance.*
|
||||
|
||||
The allure of scale is powerful: if a 7B model works well, surely a 20B model works better. In practice, scaling without proportionally increasing data causes severe overfitting. A `{python} fp_model_20b_params_str`B parameter model requires approximately `{python} fp_model_20b_total_gb_str` GB memory (`{python} fp_model_20b_fp16_gb_str` GB parameters in FP16 + `{python} fp_model_20b_optim_gb_str` GB optimizer states) yet delivers *worse* accuracy than a `{python} fp_model_7b_params_str`B model when trained on datasets under `{python} fp_data_threshold_m_str`M examples. Beyond critical thresholds, doubling model size while holding data constant typically degrades validation accuracy by `{python} fp_overfit_degrade_min`--`{python} fp_overfit_degrade_max`% due to overfitting. Model capacity must match dataset size, as established in @sec-model-training-mathematical-foundations-d894. Teams that pursue scale without commensurate data budgets waste months of compute on models that underperform smaller variants.
|
||||
The allure of scale is seductive: if a 7B model works well, surely a 20B model works better. In practice, scaling without proportionally increasing data causes severe overfitting. A `{python} fp_model_20b_params_str`B parameter model requires approximately `{python} fp_model_20b_total_gb_str` GB memory (`{python} fp_model_20b_fp16_gb_str` GB parameters in FP16 + `{python} fp_model_20b_optim_gb_str` GB optimizer states) yet delivers *worse* accuracy than a `{python} fp_model_7b_params_str`B model when trained on datasets under `{python} fp_data_threshold_m_str`M examples. Beyond critical thresholds, doubling model size while holding data constant typically degrades validation accuracy by `{python} fp_overfit_degrade_min`--`{python} fp_overfit_degrade_max`% due to overfitting. Model capacity must match dataset size, as established in @sec-model-training-mathematical-foundations-d894. Teams that pursue scale without commensurate data budgets waste months of compute on models that underperform smaller variants.
|
||||
|
||||
**Pitfall:** *Assuming distributed training automatically accelerates development.*
|
||||
|
||||
@@ -5435,3 +5420,10 @@ Training produces the model artifact---a collection of billions of learned param
|
||||
```{=latex}
|
||||
\part{key:vol1_optimize}
|
||||
```
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
#| label: chapter-end
|
||||
from mlsys.registry import end_chapter
|
||||
end_chapter("vol1:training")
|
||||
```
|
||||
|
||||
@@ -7,4 +7,4 @@ from .deployment import Tiers
|
||||
|
||||
# Export constants and registry for legacy support
|
||||
from .constants import ureg, Q_
|
||||
from .registry import start_chapter
|
||||
from .registry import start_chapter, end_chapter
|
||||
|
||||
@@ -12,6 +12,11 @@ def start_chapter(chapter_id):
|
||||
TAPE.append({"type": "chapter_start", "chapter": chapter_id})
|
||||
|
||||
|
||||
def end_chapter(chapter_id):
|
||||
"""Close the tape for a chapter. Call from the last Python cell."""
|
||||
TAPE.append({"type": "chapter_end", "chapter": chapter_id})
|
||||
|
||||
|
||||
def record(name, value, units=None, context=None):
|
||||
"""Record a calculation step and return the value."""
|
||||
entry = {
|
||||
|
||||
Reference in New Issue
Block a user