Files
cs249r_book/mlsysim/docs/glossary.qmd
Vijay Janapa Reddi 85a58c65c2 fix(slides): repair blank-pages and Vol1/Vol2 collision in release PDFs
Two issues caused the deployed slide PDFs to be unusable:

1. Every chapter .tex declared `\setsansfont{Helvetica Neue}` — proprietary
   to Apple, not installed on the Ubuntu CI runner. xelatex bombed mid-frame,
   the workflow's `|| true` swallowed the error, and the resulting PDF had
   most text never typeset (blank pages with only logos/rules surviving).
   Switch all 35 decks to TeX Gyre Heros (sans) and TeX Gyre Cursor (mono),
   both bundled with texlive-fonts-extra — no external font downloads needed.
   Drop the JetBrains Mono wget step and fonts-liberation from both slide
   workflows accordingly.

2. Vol1 and Vol2 each ship `00_course_overview.pdf` and `01_introduction.pdf`.
   The publish workflow uploaded them to a flat GitHub Release namespace, so
   the second upload silently overwrote the first — clicking Vol I's Course
   Overview actually downloaded Vol II's deck. Stage prefixed copies
   (vol1_*.pdf, vol2_*.pdf) before upload, and update slides/vol{1,2}.qmd
   plus the mlsysim cross-links to point at the new prefixed URLs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:35:11 -04:00

548 lines
30 KiB
Plaintext

---
title: "Glossary"
subtitle: "Definitions for every term used in the MLSYSIM documentation."
---
This page defines every technical term used across the MLSYSIM documentation.
When a term is first used on any page, it either links here or is defined inline.
Terms marked with slide links point to the relevant lecture deck for deeper coverage.
::: {.callout-tip collapse="true"}
## Slide deck key
All slide links point to the [Machine Learning Systems](https://mlsysbook.ai/slides/) lecture decks.
**Vol I** covers single-machine foundations; **Vol II** covers distributed and at-scale systems.
:::
---
## A
**AllReduce**
: A collective communication primitive in which every device contributes a local tensor and
receives the globally reduced (typically summed) result. The dominant synchronization
pattern in data-parallel training. Ring-AllReduce and tree-AllReduce are common algorithms;
performance is modeled by the *Alpha-Beta Model*.
*Slides:*
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf),
[Vol II Ch 6 -- Collective Communication](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf)
**Alpha-Beta Model** ($\alpha$-$\beta$)
: An analytical model for communication cost: $T_\text{comm} = \alpha + n\beta$,
where $\alpha$ is the per-message latency (seconds), $n$ is the message size (bytes),
and $\beta$ is the inverse bandwidth (seconds/byte). Used throughout MLSYSIM to
estimate collective communication overhead in distributed training.
*Slides:*
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf),
[Vol II Ch 6 -- Collective Communication](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf)
**Arithmetic Intensity** (AI)
: The ratio of floating-point operations to bytes of memory accessed: $I = \text{FLOPs} / \text{Bytes}$.
High arithmetic intensity means the workload reuses data extensively (compute-bound);
low arithmetic intensity means it streams data without reuse (memory-bound).
Units: FLOP/byte. Determines which side of the *Ridge Point* a workload falls on
in the *Roofline Model*.
*Slides:*
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
---
## B
**Bandwidth** (Memory Bandwidth)
: The rate at which data can be transferred between memory (DRAM/HBM) and compute units.
Measured in GB/s or TB/s. The A100, for example, provides 2 TB/s of HBM bandwidth.
Not to be confused with *network bandwidth* (inter-node communication rate) or
*bisection bandwidth* (aggregate cross-section throughput of a network fabric).
*Slides:*
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)
**Batch Size**
: The number of inputs processed simultaneously in one forward pass.
Larger batch sizes increase *Arithmetic Intensity*, shifting workloads from
memory-bound toward compute-bound. In distributed training, the *global* batch
size equals the per-device batch size multiplied by the number of data-parallel replicas.
*Slides:*
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
[Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf)
**Bottleneck**
: The hardware resource that limits performance. For a given workload-hardware pair,
either compute or memory bandwidth is the bottleneck, determined by comparing the
workload's *Arithmetic Intensity* to the hardware's *Ridge Point*.
*Slides:*
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
---
## C
**CapEx** (Capital Expenditure)
: The upfront cost of purchasing hardware. In *TCO* analysis, CapEx is amortized over
the hardware's useful lifetime (typically 3--5 years).
*Slides:*
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
**Carbon Intensity**
: The mass of CO~2~-equivalent emissions per unit of electricity consumed, measured in
gCO~2~e/kWh. Varies dramatically by region: ~20 gCO~2~e/kWh (Quebec hydro) to
~820 gCO~2~e/kWh (Poland coal). MLSYSIM uses per-region carbon intensity values
from the sustainability registry to estimate training and inference emissions.
*Slides:*
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
**Compute-Bound**
: A workload whose performance is limited by the hardware's peak FLOP/s rate rather
than memory bandwidth. Occurs when *Arithmetic Intensity* exceeds the *Ridge Point*.
Remedies include using tensor cores, upgrading to a faster accelerator, or reducing
precision. Contrast with *Memory-Bound*.
*Slides:*
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
**Continuous Batching**
: A serving optimization that dynamically inserts and retires requests from a running
batch, rather than waiting for all sequences in a static batch to finish before
starting new ones. Dramatically improves GPU utilization for LLM inference, where
sequence lengths vary widely. Also called *iteration-level batching*.
*Slides:*
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
**CUDA** (Compute Unified Device Architecture)
: NVIDIA's programming platform for writing GPU-accelerated programs. A "CUDA kernel"
is a function that runs in parallel across thousands of GPU threads. *Dispatch Tax*
is the per-kernel launch overhead inherent to this model.
*Slides:*
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
---
## D
**Data Parallelism** (DP)
: A distributed training strategy where the full model is replicated across $N$ devices,
each processing a different shard of the batch. Requires an *AllReduce* synchronization
step after each backward pass to average gradients. Scales well for models that fit
in a single device's memory. See also *Tensor Parallelism*, *Pipeline Parallelism*,
and *3D Parallelism*.
*Slides:*
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
**Dispatch Tax**
: The constant per-operation overhead of launching a GPU kernel (typically 0.01--0.1 ms
for CUDA kernel launch). Becomes significant at small batch sizes where kernel launch
time dominates actual compute time. Captured as the additive term in the *Iron Law*.
*Slides:*
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)
---
## F
**FLOP/s** (Floating-Point Operations per Second)
: The rate at which a device can perform floating-point arithmetic. The A100 achieves
312 TFLOP/s at FP16 via its *Tensor Cores*. Also written as TFLOP/s (tera-) or
PFLOP/s (peta-). Not to be confused with *FLOPs* (a count, not a rate).
*Slides:*
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)
**FLOPs** (Floating-Point Operations)
: A count of arithmetic operations (multiplies, adds, etc.) required to execute a single
inference or training step. A ResNet-50 inference requires ~8 GFLOPs; a GPT-3 forward
pass requires ~350 TFLOPs. Not the same as *FLOP/s* (the rate).
*Slides:*
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf)
**Forward Pass / Backward Pass**
: In neural network training, the *forward pass* runs input data through the model to produce
a prediction. The *backward pass* (backpropagation) computes gradients---the direction
and magnitude of change needed for each parameter to reduce error. In distributed systems,
gradients must be synchronized across all devices after each backward pass via *AllReduce*.
*Slides:*
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
[Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf)
---
## G
**GQA** (Grouped Query Attention)
: A transformer attention variant where multiple query heads share a single key-value head,
reducing *KV-Cache* memory by a factor equal to the group size without significantly
affecting model quality. Used in Llama 3 and other modern LLMs. See also *KV-Cache*.
*Slides:*
[Vol I Ch 6 -- Network Architectures](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf),
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
---
## H
**HBM** (High-Bandwidth Memory)
: Stacked DRAM technology used in modern AI accelerators. Provides far higher bandwidth
than GDDR (e.g., 2 TB/s on A100, 3.35 TB/s on H100) at the cost of limited capacity
(40--80 GB per device). The bandwidth ceiling in the *Roofline Model* is set by HBM.
*Slides:*
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf)
---
## I
**InfiniBand**
: A high-throughput, low-latency network fabric commonly used in GPU clusters for
distributed training. Supports *RDMA* (Remote Direct Memory Access) for zero-copy
data transfer that bypasses the CPU. NDR InfiniBand provides 400 Gb/s per port.
See also *NVLink* (intra-node) vs. InfiniBand (inter-node).
*Slides:*
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)
**Iron Law of ML Systems**
: The fundamental performance equation:
$$T = \max\!\left(\frac{\text{FLOPs}}{\text{Peak} \times \eta},\; \frac{\text{Bytes}}{\text{BW}}\right) + \text{Dispatch\_Tax}$$
The $\max$ captures the *Roofline Model* insight that performance is limited by whichever
resource---compute or memory bandwidth---is the bottleneck. Named by analogy with the
Iron Law of processor performance in computer architecture.
*Slides:*
[Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf),
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
**ITL** (Inter-Token Latency)
: The time to generate each successive token after the first during LLM autoregressive
decoding. Almost always *Memory-Bound*---each decode step loads the full model
weights plus the *KV-Cache*. Measured in ms/token. See also *TTFT*.
*Slides:*
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
---
## K
**Knowledge Distillation**
: A model compression technique where a smaller "student" model is trained to match the
output distribution of a larger "teacher" model. Reduces model size and inference cost
while retaining much of the teacher's accuracy. See also *Quantization* and *Pruning*.
*Slides:*
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
**KV-Cache**
: The cached Key and Value matrices from the transformer attention mechanism, retained
across decoding steps to avoid recomputation. Memory footprint grows linearly with
sequence length and batch size:
$\text{Bytes} = 2 \times L \times B \times d \times \text{layers} \times \text{bytes\_per\_param}$.
*GQA* reduces KV-Cache size; *PagedAttention* manages it more efficiently.
*Slides:*
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
---
## L
**Latency**
: The wall-clock time to complete one inference or training step. In MLSYSIM, latency
is the primary output of the *Iron Law* equation. Measured in ms or $\mu$s.
Maximizing *Throughput* often conflicts with minimizing latency.
*Slides:*
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf)
**LLM** (Large Language Model)
: A transformer-based model trained on large text corpora, typically with billions of
parameters. Examples: GPT-4, Llama 3, Gemini. Key serving metrics: *TTFT* and *ITL*.
Key memory bottleneck: *KV-Cache*.
*Slides:*
[Vol I Ch 6 -- Network Architectures](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf)
---
## M
**Memory-Bound**
: A workload whose performance is limited by the hardware's memory *Bandwidth*, not its
peak FLOP/s. Occurs when *Arithmetic Intensity* falls below the *Ridge Point*.
Remedies include lower *Precision*, *Operator Fusion*, or faster memory (e.g., HBM3).
Contrast with *Compute-Bound*.
*Slides:*
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
**MFU** (Model FLOP Utilization)
: The fraction of theoretical peak FLOP/s actually achieved:
$\text{MFU} = \text{Achieved FLOP/s} / \text{Peak FLOP/s}$.
Well-optimized training achieves 30--50% MFU; poorly optimized code may fall below 10%.
MFU is the single most important efficiency metric for large-scale training runs.
*Slides:*
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
**Microbatch**
: A subdivision of the training batch used in *Pipeline Parallelism*. Increasing the
number of microbatches $M$ reduces the *Pipeline Bubble* fraction:
$\text{Bubble} = (P{-}1) / (P{-}1{+}M)$, where $P$ is the pipeline depth.
*Slides:*
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
**MTBF** (Mean Time Between Failures)
: The average time a component operates before failing. For a fleet of $N$ identical nodes,
$\text{MTBF}_\text{fleet} = \text{MTBF}_\text{node} / N$. A 1,024-node cluster with
100,000-hour node MTBF has a fleet MTBF of ~98 hours. Input to the *Young-Daly Formula*.
*Slides:*
[Vol II Ch 7 -- Fault Tolerance](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf)
---
## N
**NVLink**
: NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication within a server.
Provides 900 GB/s bidirectional bandwidth per GPU in DGX H100 systems. Required for
*Tensor Parallelism*, where low-latency intra-node communication is critical.
Contrast with *InfiniBand* for inter-node communication.
*Slides:*
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)
---
## O
**OpEx** (Operational Expenditure)
: The ongoing costs of running hardware: electricity, networking, cooling, labor.
In cloud pricing, OpEx dominates over a 3-year period by 2--5x over *CapEx*.
*Slides:*
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
**Operator Fusion**
: Combining multiple small GPU kernels into a single larger one to reduce
memory transfers between operations. For example, fusing a matrix multiply followed
by an activation function avoids writing and re-reading the intermediate result
from *HBM*. A key optimization for reducing *Memory-Bound* overhead.
*Slides:*
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
---
## P
**Pipeline Bubble**
: The fraction of time a pipeline-parallel system spends idle waiting for the
first *Microbatch* to propagate through all stages:
$\text{Bubble} = (P{-}1) / (P{-}1{+}M)$,
where $P$ is pipeline depth and $M$ is microbatch count.
*Slides:*
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
**Pipeline Parallelism** (PP)
: A distributed training strategy that splits the model's layers across devices,
each device processing a different "stage." Introduces a *Pipeline Bubble* of idle
time. Complementary to *Data Parallelism* and *Tensor Parallelism* in *3D Parallelism*.
*Slides:*
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
**Precision**
: The numerical format used to represent weights and activations. `fp32` (32-bit float)
is most accurate; `fp16`/`bf16` (16-bit) halves memory and doubles throughput
on *Tensor Cores*; `int8` and `int4` further reduce memory at the cost of accuracy.
Lower precision increases *Arithmetic Intensity* by reducing bytes per operation.
*Slides:*
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
**Progressive Lowering**
: MLSYSIM's architectural principle: workload specifications (demand) are progressively
mapped onto hardware specifications (supply) through a chain of analytical transformations.
The reverse of how hardware is typically specified---starting from the algorithm, not the chip.
**Pruning**
: A model compression technique that removes redundant weights or entire structures
(channels, attention heads) from a trained model. *Unstructured* pruning zeros out
individual weights; *structured* pruning removes whole rows/columns for hardware-friendly
speedups. See also *Quantization* and *Knowledge Distillation*.
*Slides:*
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
**PUE** (Power Usage Effectiveness)
: $\text{PUE} = \text{Total Facility Power} / \text{IT Equipment Power}$.
A PUE of 1.0 is theoretical perfection; hyperscale datacenters achieve 1.1--1.4.
Higher PUE means more energy wasted on cooling and facility overhead. Used in MLSYSIM's
sustainability solver alongside *Carbon Intensity* and *WUE*.
*Slides:*
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
---
## Q
**Quantization**
: Reducing the numerical *Precision* of model weights and/or activations (e.g., FP32 to
INT8 or INT4) to shrink memory footprint and increase throughput. *Post-Training
Quantization* (PTQ) converts a pre-trained model without retraining; *Quantization-Aware
Training* (QAT) simulates low-precision during training for higher accuracy.
*Slides:*
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
---
## R
**Ridge Point**
: The *Arithmetic Intensity* at which a workload transitions from *Memory-Bound* to
*Compute-Bound* on a given hardware platform:
$I^* = \text{Peak FLOP/s} / \text{Memory BW}$.
For the A100 at FP16: $I^* = 312 \text{ TFLOP/s} / 2 \text{ TB/s} = 156 \text{ FLOP/byte}$.
*Slides:*
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
**Roofline Model**
: A visual and analytical tool that plots hardware performance ceilings (the "roofline")
and shows where workloads sit relative to them. The sloped region is *Memory-Bound*;
the flat region is *Compute-Bound*; the inflection point is the *Ridge Point*.
Introduced by Williams, Waterman, and Patterson (2009). MLSYSIM implements a
generalized roofline via the *Iron Law*.
*Slides:*
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
---
## S
**SLA** (Service Level Agreement)
: A target performance guarantee, typically specifying maximum acceptable latency and minimum
throughput. For LLM serving, common SLAs target *TTFT* < 200 ms and *ITL* < 50 ms/token.
*Slides:*
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf)
**Speculative Decoding**
: An inference optimization where a small, fast "draft" model generates candidate tokens
that are then verified in parallel by the full model. Reduces *ITL* by converting
sequential autoregressive steps into a single parallel verification pass, at the cost
of occasional rejected tokens.
*Slides:*
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
**SSoT** (Single Source of Truth)
: The principle that each specification (chip peak FLOP/s, grid carbon intensity, etc.)
has exactly one authoritative location---the MLSys Zoo. All computations derive from
the Zoo, eliminating inconsistencies from stale copied values.
**Systolic Array**
: A grid of processing elements that rhythmically pass data to their neighbors, performing
a multiply-accumulate at each step. The dominant dataflow architecture in ML accelerators:
Google TPUs use systolic arrays for matrix multiplication, and NVIDIA *Tensor Cores*
implement a similar systolic-like pattern.
*Slides:*
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
---
## T
**TCO** (Total Cost of Ownership)
: The full cost of a system over its lifetime:
$\text{TCO} = \text{CapEx}_{\text{amortized}} + \text{OpEx}$.
Includes hardware purchase, electricity, cooling, networking, and labor. MLSYSIM's
TCO solver computes this from hardware registry specs and regional energy costs.
*Slides:*
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
**TDP** (Thermal Design Power)
: The maximum sustained power a chip is designed to dissipate under load, in Watts.
Relevant for datacenter cooling capacity planning. An H100 SXM5 has a TDP of 700 W.
Used in MLSYSIM to compute energy consumption and *TCO*.
*Slides:*
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf)
**Tensor Core**
: A specialized hardware unit in NVIDIA GPUs designed for matrix-multiply-accumulate
operations. Achieves much higher throughput than standard CUDA cores for ML workloads.
The A100's 312 TFLOP/s peak (FP16) comes from its tensor cores, not its CUDA cores.
Functionally similar to a *Systolic Array*.
*Slides:*
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
**Tensor Parallelism** (TP)
: A distributed training strategy that splits individual matrix multiplications across
devices within a node. Requires high-bandwidth intra-node connectivity (*NVLink*).
Combined with *Data Parallelism* and *Pipeline Parallelism* in *3D Parallelism*.
*Slides:*
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
**3D Parallelism**
: The combination of *Data Parallelism*, *Tensor Parallelism*, and *Pipeline Parallelism*
to scale training across hundreds or thousands of GPUs. TP operates within a node
(over *NVLink*), PP across a small group of nodes, and DP across the remaining replicas.
The standard recipe for training frontier LLMs.
*Slides:*
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
**Throughput**
: The number of samples (or tokens) processed per unit time:
$\text{Throughput} = \text{Batch Size} / \text{Latency}$.
Maximizing throughput often conflicts with minimizing *Latency*---larger batches
increase throughput but also increase per-request latency.
*Slides:*
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)
**TTFT** (Time to First Token)
: The latency from receiving a user query to generating the first output token in an LLM
serving system. Determined primarily by the *prefill* phase, which is *Compute-Bound*.
Target: <200 ms for interactive applications. See also *ITL*.
*Slides:*
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
---
## U
**Utilization** ($\eta$)
: The fraction of theoretical peak FLOP/s actually achieved in practice. Typical values:
30--50% for well-optimized training, 10--30% for inference. MLSYSIM uses $\eta$ as a
parameter in the *Iron Law*; see the hardware registry for per-device defaults.
Closely related to *MFU*.
*Slides:*
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
---
## W
**WUE** (Water Usage Effectiveness)
: Liters of water consumed per kilowatt-hour of energy. Relevant for datacenters using
evaporative cooling. MLSYSIM estimates water usage as:
$\text{Water (L)} = \text{Energy (kWh)} \times \text{WUE}$.
*Slides:*
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
---
## Y
**Young-Daly Formula**
: The optimal checkpoint interval for fault-tolerant distributed training:
$\tau_\text{opt} = \sqrt{2 \times \delta \times \text{MTBF}_\text{fleet}}$,
where $\delta$ is the time to save one checkpoint and *MTBF* is the mean time between
failures of the fleet. Named after Young (1974) and Daly (2006).
*Slides:*
[Vol II Ch 7 -- Fault Tolerance](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf)
---
*This glossary is updated with each MLSYSIM release. If a term is missing, please
[open an issue](https://github.com/harvard-edge/cs249r_book/issues).*