Files
cs249r_book/mlsysim/docs/laws-explained.md
Vijay Janapa Reddi 5edb1df253 fix(mlsysim): round 10 fixes + laws-explained doc
- Fix goodput formula: use steady-state overhead model (checkpoint/interval +
  recovery/MTBF) instead of prob_fail-based formula that approaches zero at scale
- Fix speculative decode: draft model uses its own KV cache, not target's
- Clarify hierarchical AllReduce: document NCCL reduce-scatter design choice
- Add docs/laws-explained.md: plain-English explanation of all 22 walls + Iron Law
2026-04-01 18:15:49 -04:00

301 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# The 22 Laws of ML Systems — Plain English
> Every equation in mlsysim answers one question. Here is what each one says,
> why it matters, and how to explain it to anyone in 30 seconds.
---
## Domain 1: Node — What Can One Chip Do?
### Wall 1: The Compute Wall
**Equation:** `Time = Operations / (Peak_FLOPS × Efficiency)`
**Plain English:** How long does it take to do all the math? Take the total number
of multiply-adds your model needs, divide by how fast your chip can do math
(adjusted for real-world inefficiency), and that's your compute time.
**30-second version:** "If your model needs 16 billion multiplications and your GPU
can do 989 trillion per second at 50% efficiency, the math takes 0.03 milliseconds."
**Why it matters:** This is the *speed limit* for any workload. No software optimization
can make your model run faster than the hardware can crunch numbers.
---
### Wall 2: The Memory Wall
**Equation:** `Time = Weight_Bytes / Memory_Bandwidth`
**Plain English:** How long does it take to load the model from memory? Your model's
weights sit in HBM (the GPU's memory). To use them, the GPU must read them through
a pipe with a fixed width (bandwidth). Bigger model = longer loading time.
**30-second version:** "A 16 GB model on a GPU with 3.35 TB/s bandwidth takes 4.8 ms
just to load the weights — before any computation even starts."
**Why it matters:** For most LLM inference at batch size 1, this is THE bottleneck.
The GPU finishes the math long before it finishes reading the weights. This is why
memory bandwidth matters more than FLOPS for serving chatbots.
---
### Wall 3: The Software Wall
**Equation:** `MFU = Achieved_FLOPS / Peak_FLOPS`
**Plain English:** How much of the chip's power are you actually using? MFU (Model
FLOPs Utilization) measures the gap between what the hardware *could* do and what
your software *actually* achieves. Low MFU means wasted silicon.
**30-second version:** "If your GPU can do 989 TFLOPS but your training only achieves
450 TFLOPS, your MFU is 45%. The other 55% is lost to kernel launches, memory stalls,
and framework overhead."
**Why it matters:** A 50% MFU means you're paying for twice the hardware you're using.
Improving MFU is often cheaper than buying more GPUs.
---
### Wall 4: The Serving Wall
**Equation:** `TTFT = Prefill_FLOPs / Peak; ITL = Weights / Bandwidth`
**Plain English:** LLM serving has two phases with completely different bottlenecks.
*Prefill* (processing your prompt) is compute-bound — it's doing tons of math.
*Decode* (generating each token) is memory-bound — it's loading weights for every
single token. These two phases need different optimizations.
**30-second version:** "Prefill is like reading a book fast (CPU-intensive). Decode is
like looking up one word at a time in a dictionary (memory-intensive). That's why your
first token takes 50ms but each subsequent token takes 5ms."
---
### Wall 5: The Batching Wall
**Equation:** `KV_cache = 2 × Layers × Heads × Dim × SeqLen × Batch × BytesPerElem`
**Plain English:** Each active request in an LLM server needs its own "memory" of the
conversation (the KV cache). This memory grows with sequence length and eats into the
GPU's finite HBM. More concurrent requests = more KV cache = less room for new requests.
**30-second version:** "Each Llama-3-8B request at 4K context needs ~500 MB of KV cache.
An 80 GB GPU can only serve ~100 concurrent requests before running out of memory."
---
### Wall 6: The Streaming Wall
**Equation:** `Time_per_layer = max(Weight_Injection_Time, Compute_Time)`
**Plain English:** Wafer-scale chips (like Cerebras) flip the normal GPU bottleneck.
Instead of weights sitting in memory and being read by compute, weights are *streamed*
from external memory nodes while activations live on-chip in SRAM. The bottleneck shifts
from memory bandwidth to the injection interconnect speed.
---
### Wall 7: The Tail Latency Wall
**Equation:** Erlang C queueing model (M/M/c)
**Plain English:** When your servers are busy, new requests have to wait in line. The
waiting time grows *non-linearly* — at 80% utilization, the P99 latency might be 2x
the average. At 95% utilization, it can be 10x. This is why you can't run servers at
100% — the tail latency destroys your SLA.
**30-second version:** "It's like a highway. At 70% capacity, traffic flows. At 95%
capacity, everyone's stuck. Servers work the same way."
---
## Domain 2: Data — Can You Feed the Beast?
### Wall 8: The Ingestion Wall
**Equation:** `Utilization = Data_Demand / Storage_Supply`
**Plain English:** Your GPU is hungry. Can your storage feed it fast enough? If the
GPU consumes data faster than the disk can supply it, the GPU sits idle waiting for
food. This ratio tells you if your data pipeline is the bottleneck.
---
### Wall 9: The Transformation Wall
**Equation:** `Time = Batch × Sequence / CPU_Throughput`
**Plain English:** Before data reaches the GPU, CPUs must decode images, tokenize text,
and apply augmentations. If these CPU operations can't keep up with GPU consumption,
the GPU starves. This is why training pipelines often need 10-100x more CPU cores
than you'd expect.
---
### Wall 10: The Locality Wall
**Equation:** `Effective_BW = Link_BW × Bisection_Fraction / Oversubscription`
**Plain English:** Not all network bandwidth is created equal. A fat-tree network
gives full bisection bandwidth (any node can talk to any other at full speed). A ring
or torus does not — distant nodes communicate through intermediate hops, reducing
effective bandwidth. The topology determines how much of the raw link speed you
actually get for collective communication.
---
## Domain 3: Algorithm — How Much Compute Do You Need?
### Wall 11: The Complexity Wall (Chinchilla Scaling)
**Equation:** `Compute = 6 × Parameters × Tokens; Optimal_Params = √(Budget/120)`
**Plain English:** There's a sweet spot between model size and training data. The
Chinchilla scaling law says: for a fixed compute budget, the optimal model size is
proportional to the square root of your budget, and you should train on 20 tokens
per parameter. Train a too-big model on too-little data = waste. Train a too-small
model on too-much data = also waste.
**30-second version:** "With $10M of compute, you should train a ~90B parameter model
on ~1.8T tokens. Going bigger without more data just wastes money."
---
### Wall 12: The Reasoning Wall [Emerging]
**Equation:** `Time = Steps × Time_per_Step`
**Plain English:** Chain-of-thought reasoning makes models think longer before answering.
Each reasoning step costs compute. More steps = better answers but higher cost. This is
inference-time scaling — spending more compute at serving time instead of training time.
---
### Wall 13: The Fidelity Wall (Compression)
**Equation:** `Compression = 32/bits (quantization); Compression = 1/(1-sparsity) (pruning)`
**Plain English:** You can shrink a model by using fewer bits (quantization: FP32→INT8 = 4x
smaller) or by removing weights (pruning: 50% sparse = 2x smaller). But shrinking costs
accuracy. The question is always: how much accuracy can you afford to lose?
**Key insight:** Storage shrinks for all compression, but *inference speedup* depends on
the method. Unstructured pruning saves storage but gives zero speedup on GPU GEMM. Only
structured pruning and N:M sparsity accelerate actual hardware execution.
---
## Domain 4: Fleet — What Happens at Scale?
### Wall 14: The Communication Wall (AllReduce)
**Equation:** `Time = 2(N-1)/N × Message/Bandwidth + 2(N-1) × Latency`
**Plain English:** Distributed training means every GPU must share its gradients with
every other GPU after each training step. This "AllReduce" operation moves data through
the network. The time depends on how much data (gradient size) and how fast the network
(bandwidth). As you add more GPUs, the communication cost grows, eventually dominating
over the compute savings.
**30-second version:** "1 GB of gradients across 8 GPUs at 50 GB/s NVLink takes 35ms.
That's time your GPUs are waiting instead of computing."
---
### Wall 15: The Fragility Wall (Reliability)
**Equation:** `Cluster_MTBF = Component_MTBF / N_components`
**Plain English:** If one GPU fails every 50,000 hours, a 1000-GPU cluster has a failure
every 50 hours. At 10,000 GPUs, you get a failure every 5 hours. This is why large training
runs need checkpointing — without it, a single GPU failure wastes days of compute.
**30-second version:** "At 1000 GPUs, something breaks every 2 days. At 100,000 GPUs,
something breaks every 30 minutes."
---
### Wall 16: The Multi-tenant Wall (Queueing)
**Equation:** `Wait_Time = Utilization / [2 × Service_Rate × (1 - Utilization)]`
**Plain English:** When multiple teams share a cluster, jobs wait in line. The wait time
grows non-linearly with utilization — at 80% cluster utilization, the average wait is
2x the job duration. This is why shared research clusters feel slow even when they're
"only" 80% busy.
---
## Domain 5: Operations — Is It Worth It?
### Wall 17: The Capital Wall (TCO)
**Equation:** `TCO = CapEx + OpEx_energy + OpEx_maintenance`
**Plain English:** The total cost of running ML infrastructure is hardware purchase
(CapEx, amortized over 3-5 years) plus electricity plus maintenance. A 1024-GPU H100
cluster costs ~$30M in hardware. Electricity is surprisingly small (~10% of total).
The dominant cost lever is utilization — idle GPUs cost the same as busy ones.
---
### Wall 18: The Sustainability Wall
**Equation:** `Carbon = Energy × PUE × Carbon_Intensity`
**Plain English:** Every GPU-hour converts electricity into carbon emissions. How much
depends on three things: how much energy (determined by power and time), how efficiently
the datacenter uses it (PUE — a 1.4 PUE wastes 40% on cooling), and how dirty the grid
is (Quebec hydro = 20 gCO₂/kWh vs Poland coal = 820 gCO₂/kWh).
**30-second version:** "Training the same model in Quebec produces 41x less carbon than
training it in Poland. Geography is the biggest lever for sustainable AI."
---
### Wall 19: The Checkpoint Wall
**Equation:** `MFU_penalty = Write_Time / Checkpoint_Interval`
**Plain English:** Long training runs must periodically save their state to disk in case
of failure. This I/O burst interrupts training and reduces effective utilization. The
optimal checkpoint frequency balances the cost of saving too often (I/O overhead) against
the cost of saving too rarely (wasted compute when failures hit).
---
### Wall 20: The Safety Wall (Privacy)
**Equation:** `Slowdown ∝ 1/ε (DP-SGD)`
**Plain English:** Training with differential privacy (adding noise to protect individual
data points) makes training slower. Stronger privacy guarantees (smaller ε) mean more
noise, which means more training steps to converge. Privacy is not free — it's a compute
tax.
---
## Domain 6: Meta-Analysis — Cross-Cutting Tools
### Wall 21: Sensitivity Analysis
**Equation:** `∂Time/∂parameter` (partial derivatives)
**Plain English:** Which constraint is actually binding? Sensitivity analysis perturbs
each parameter (bandwidth, FLOPS, network speed) by a small amount and measures how
much the total time changes. The parameter with the largest derivative is your
bottleneck — that's where to invest next.
---
### Wall 22: Synthesis (Inverse Roofline)
**Equation:** `Required_BW = Weights / Target_Latency`
**Plain English:** Given a latency target (e.g., "50ms per token"), what hardware specs
do you need? This flips the Roofline model backwards — instead of predicting performance
from hardware, it derives the minimum hardware specs from a performance requirement.
**30-second version:** "If you need 50ms decode latency for a 140 GB model, you need
at least 2.8 TB/s of memory bandwidth. That's more than one A100 — you need tensor
parallelism across two H100s."
---
## The Iron Law (The Master Equation)
**Equation:** `Time = FLOPs / (N × Peak × MFU × η_scaling × Goodput)`
**Plain English:** This one equation governs all of ML systems. Training time equals the
total compute divided by how much useful compute you can actually deliver. Each term in
the denominator represents a different loss:
| Term | What it means | What reduces it |
|------|--------------|----------------|
| N | Number of devices | Budget |
| Peak | Raw hardware speed | GPU generation |
| MFU | Software efficiency | Kernel optimization, FlashAttention |
| η_scaling | Communication overhead | Network bandwidth, gradient compression |
| Goodput | Time lost to failures | Checkpointing, fault tolerance |
**Every wall in the taxonomy maps to one of these five terms.** That's the whole framework.