cs249r_book/mlsysim/docs/glossary.qmd

---
title: "Glossary"
subtitle: "Definitions for every term used in the MLSYSIM documentation."
---

This page defines every technical term used across the MLSYSIM documentation.
When a term is first used on any page, it either links here or is defined inline.
Terms marked with slide links point to the relevant lecture deck for deeper coverage.

::: {.callout-tip collapse="true"}
## Slide deck key

All slide links point to the [Machine Learning Systems](https://mlsysbook.ai/slides/) lecture decks.
**Vol I** covers single-machine foundations; **Vol II** covers distributed and at-scale systems.
:::

---

## A

**AllReduce**
: A collective communication primitive in which every device contributes a local tensor and
  receives the globally reduced (typically summed) result. The dominant synchronization
  pattern in data-parallel training. Ring-AllReduce and tree-AllReduce are common algorithms;
  performance is modeled by the *Alpha-Beta Model*.
  *Slides:*
  [Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf),
  [Vol II Ch 6 -- Collective Communication](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf)

**Alpha-Beta Model** ($\alpha$-$\beta$)
: An analytical model for communication cost: $T_\text{comm} = \alpha + n\beta$,
  where $\alpha$ is the per-message latency (seconds), $n$ is the message size (bytes),
  and $\beta$ is the inverse bandwidth (seconds/byte). Used throughout MLSYSIM to
  estimate collective communication overhead in distributed training.
  *Slides:*
  [Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf),
  [Vol II Ch 6 -- Collective Communication](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf)

**Arithmetic Intensity** (AI)
: The ratio of floating-point operations to bytes of memory accessed: $I = \text{FLOPs} / \text{Bytes}$.
  High arithmetic intensity means the workload reuses data extensively (compute-bound);
  low arithmetic intensity means it streams data without reuse (memory-bound).
  Units: FLOP/byte. Determines which side of the *Ridge Point* a workload falls on
  in the *Roofline Model*.
  *Slides:*
  [Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)

---

## B

**Bandwidth** (Memory Bandwidth)
: The rate at which data can be transferred between memory (DRAM/HBM) and compute units.
  Measured in GB/s or TB/s. The A100, for example, provides 2 TB/s of HBM bandwidth.
  Not to be confused with *network bandwidth* (inter-node communication rate) or
  *bisection bandwidth* (aggregate cross-section throughput of a network fabric).
  *Slides:*
  [Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
  [Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)

**Batch Size**
: The number of inputs processed simultaneously in one forward pass.
  Larger batch sizes increase *Arithmetic Intensity*, shifting workloads from
  memory-bound toward compute-bound. In distributed training, the *global* batch
  size equals the per-device batch size multiplied by the number of data-parallel replicas.
  *Slides:*
  [Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
  [Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf)

**Bottleneck**
: The hardware resource that limits performance. For a given workload-hardware pair,
  either compute or memory bandwidth is the bottleneck, determined by comparing the
  workload's *Arithmetic Intensity* to the hardware's *Ridge Point*.
  *Slides:*
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

---

## C

**CapEx** (Capital Expenditure)
: The upfront cost of purchasing hardware. In *TCO* analysis, CapEx is amortized over
  the hardware's useful lifetime (typically 3--5 years).
  *Slides:*
  [Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)

**Carbon Intensity**
: The mass of CO~2~-equivalent emissions per unit of electricity consumed, measured in
  gCO~2~e/kWh. Varies dramatically by region: ~20 gCO~2~e/kWh (Quebec hydro) to
  ~820 gCO~2~e/kWh (Poland coal). MLSYSIM uses per-region carbon intensity values
  from the sustainability registry to estimate training and inference emissions.
  *Slides:*
  [Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)

**Compute-Bound**
: A workload whose performance is limited by the hardware's peak FLOP/s rate rather
  than memory bandwidth. Occurs when *Arithmetic Intensity* exceeds the *Ridge Point*.
  Remedies include using tensor cores, upgrading to a faster accelerator, or reducing
  precision. Contrast with *Memory-Bound*.
  *Slides:*
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

**Continuous Batching**
: A serving optimization that dynamically inserts and retires requests from a running
  batch, rather than waiting for all sequences in a static batch to finish before
  starting new ones. Dramatically improves GPU utilization for LLM inference, where
  sequence lengths vary widely. Also called *iteration-level batching*.
  *Slides:*
  [Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
  [Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)

**CUDA** (Compute Unified Device Architecture)
: NVIDIA's programming platform for writing GPU-accelerated programs. A "CUDA kernel"
  is a function that runs in parallel across thousands of GPU threads. *Dispatch Tax*
  is the per-kernel launch overhead inherent to this model.
  *Slides:*
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)

---

## D

**Data Parallelism** (DP)
: A distributed training strategy where the full model is replicated across $N$ devices,
  each processing a different shard of the batch. Requires an *AllReduce* synchronization
  step after each backward pass to average gradients. Scales well for models that fit
  in a single device's memory. See also *Tensor Parallelism*, *Pipeline Parallelism*,
  and *3D Parallelism*.
  *Slides:*
  [Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)

**Dispatch Tax**
: The constant per-operation overhead of launching a GPU kernel (typically 0.01--0.1 ms
  for CUDA kernel launch). Becomes significant at small batch sizes where kernel launch
  time dominates actual compute time. Captured as the additive term in the *Iron Law*.
  *Slides:*
  [Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)

---

## F

**FLOP/s** (Floating-Point Operations per Second)
: The rate at which a device can perform floating-point arithmetic. The A100 achieves
  312 TFLOP/s at FP16 via its *Tensor Cores*. Also written as TFLOP/s (tera-) or
  PFLOP/s (peta-). Not to be confused with *FLOPs* (a count, not a rate).
  *Slides:*
  [Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
  [Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)

**FLOPs** (Floating-Point Operations)
: A count of arithmetic operations (multiplies, adds, etc.) required to execute a single
  inference or training step. A ResNet-50 inference requires ~8 GFLOPs; a GPT-3 forward
  pass requires ~350 TFLOPs. Not the same as *FLOP/s* (the rate).
  *Slides:*
  [Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf)

**Forward Pass / Backward Pass**
: In neural network training, the *forward pass* runs input data through the model to produce
  a prediction. The *backward pass* (backpropagation) computes gradients---the direction
  and magnitude of change needed for each parameter to reduce error. In distributed systems,
  gradients must be synchronized across all devices after each backward pass via *AllReduce*.
  *Slides:*
  [Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
  [Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf)

---

## G

**GQA** (Grouped Query Attention)
: A transformer attention variant where multiple query heads share a single key-value head,
  reducing *KV-Cache* memory by a factor equal to the group size without significantly
  affecting model quality. Used in Llama 3 and other modern LLMs. See also *KV-Cache*.
  *Slides:*
  [Vol I Ch 6 -- Network Architectures](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf),
  [Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)

---

## H

**HBM** (High-Bandwidth Memory)
: Stacked DRAM technology used in modern AI accelerators. Provides far higher bandwidth
  than GDDR (e.g., 2 TB/s on A100, 3.35 TB/s on H100) at the cost of limited capacity
  (40--80 GB per device). The bandwidth ceiling in the *Roofline Model* is set by HBM.
  *Slides:*
  [Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf)

---

## I

**InfiniBand**
: A high-throughput, low-latency network fabric commonly used in GPU clusters for
  distributed training. Supports *RDMA* (Remote Direct Memory Access) for zero-copy
  data transfer that bypasses the CPU. NDR InfiniBand provides 400 Gb/s per port.
  See also *NVLink* (intra-node) vs. InfiniBand (inter-node).
  *Slides:*
  [Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)

**Iron Law of ML Systems**
: The fundamental performance equation:
  $$T = \max\!\left(\frac{\text{FLOPs}}{\text{Peak} \times \eta},\; \frac{\text{Bytes}}{\text{BW}}\right) + \text{Dispatch\_Tax}$$
  The $\max$ captures the *Roofline Model* insight that performance is limited by whichever
  resource---compute or memory bandwidth---is the bottleneck. Named by analogy with the
  Iron Law of processor performance in computer architecture.
  *Slides:*
  [Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf),
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)

**ITL** (Inter-Token Latency)
: The time to generate each successive token after the first during LLM autoregressive
  decoding. Almost always *Memory-Bound*---each decode step loads the full model
  weights plus the *KV-Cache*. Measured in ms/token. See also *TTFT*.
  *Slides:*
  [Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
  [Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)

---

## K

**Knowledge Distillation**
: A model compression technique where a smaller "student" model is trained to match the
  output distribution of a larger "teacher" model. Reduces model size and inference cost
  while retaining much of the teacher's accuracy. See also *Quantization* and *Pruning*.
  *Slides:*
  [Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)

**KV-Cache**
: The cached Key and Value matrices from the transformer attention mechanism, retained
  across decoding steps to avoid recomputation. Memory footprint grows linearly with
  sequence length and batch size:
  $\text{Bytes} = 2 \times L \times B \times d \times \text{layers} \times \text{bytes\_per\_param}$.
  *GQA* reduces KV-Cache size; *PagedAttention* manages it more efficiently.
  *Slides:*
  [Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
  [Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)

---

## L

**Latency**
: The wall-clock time to complete one inference or training step. In MLSYSIM, latency
  is the primary output of the *Iron Law* equation. Measured in ms or $\mu$s.
  Maximizing *Throughput* often conflicts with minimizing latency.
  *Slides:*
  [Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
  [Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf)

**LLM** (Large Language Model)
: A transformer-based model trained on large text corpora, typically with billions of
  parameters. Examples: GPT-4, Llama 3, Gemini. Key serving metrics: *TTFT* and *ITL*.
  Key memory bottleneck: *KV-Cache*.
  *Slides:*
  [Vol I Ch 6 -- Network Architectures](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf)

---

## M

**Memory-Bound**
: A workload whose performance is limited by the hardware's memory *Bandwidth*, not its
  peak FLOP/s. Occurs when *Arithmetic Intensity* falls below the *Ridge Point*.
  Remedies include lower *Precision*, *Operator Fusion*, or faster memory (e.g., HBM3).
  Contrast with *Compute-Bound*.
  *Slides:*
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

**MFU** (Model FLOP Utilization)
: The fraction of theoretical peak FLOP/s actually achieved:
  $\text{MFU} = \text{Achieved FLOP/s} / \text{Peak FLOP/s}$.
  Well-optimized training achieves 30--50% MFU; poorly optimized code may fall below 10%.
  MFU is the single most important efficiency metric for large-scale training runs.
  *Slides:*
  [Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

**Microbatch**
: A subdivision of the training batch used in *Pipeline Parallelism*. Increasing the
  number of microbatches $M$ reduces the *Pipeline Bubble* fraction:
  $\text{Bubble} = (P{-}1) / (P{-}1{+}M)$, where $P$ is the pipeline depth.
  *Slides:*
  [Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)

**MTBF** (Mean Time Between Failures)
: The average time a component operates before failing. For a fleet of $N$ identical nodes,
  $\text{MTBF}_\text{fleet} = \text{MTBF}_\text{node} / N$. A 1,024-node cluster with
  100,000-hour node MTBF has a fleet MTBF of ~98 hours. Input to the *Young-Daly Formula*.
  *Slides:*
  [Vol II Ch 7 -- Fault Tolerance](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf)

---

## N

**NVLink**
: NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication within a server.
  Provides 900 GB/s bidirectional bandwidth per GPU in DGX H100 systems. Required for
  *Tensor Parallelism*, where low-latency intra-node communication is critical.
  Contrast with *InfiniBand* for inter-node communication.
  *Slides:*
  [Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
  [Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)

---

## O

**OpEx** (Operational Expenditure)
: The ongoing costs of running hardware: electricity, networking, cooling, labor.
  In cloud pricing, OpEx dominates over a 3-year period by 2--5x over *CapEx*.
  *Slides:*
  [Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)

**Operator Fusion**
: Combining multiple small GPU kernels into a single larger one to reduce
  memory transfers between operations. For example, fusing a matrix multiply followed
  by an activation function avoids writing and re-reading the intermediate result
  from *HBM*. A key optimization for reducing *Memory-Bound* overhead.
  *Slides:*
  [Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

---

## P

**Pipeline Bubble**
: The fraction of time a pipeline-parallel system spends idle waiting for the
  first *Microbatch* to propagate through all stages:
  $\text{Bubble} = (P{-}1) / (P{-}1{+}M)$,
  where $P$ is pipeline depth and $M$ is microbatch count.
  *Slides:*
  [Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)

**Pipeline Parallelism** (PP)
: A distributed training strategy that splits the model's layers across devices,
  each device processing a different "stage." Introduces a *Pipeline Bubble* of idle
  time. Complementary to *Data Parallelism* and *Tensor Parallelism* in *3D Parallelism*.
  *Slides:*
  [Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)

**Precision**
: The numerical format used to represent weights and activations. `fp32` (32-bit float)
  is most accurate; `fp16`/`bf16` (16-bit) halves memory and doubles throughput
  on *Tensor Cores*; `int8` and `int4` further reduce memory at the cost of accuracy.
  Lower precision increases *Arithmetic Intensity* by reducing bytes per operation.
  *Slides:*
  [Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)

**Progressive Lowering**
: MLSYSIM's architectural principle: workload specifications (demand) are progressively
  mapped onto hardware specifications (supply) through a chain of analytical transformations.
  The reverse of how hardware is typically specified---starting from the algorithm, not the chip.

**Pruning**
: A model compression technique that removes redundant weights or entire structures
  (channels, attention heads) from a trained model. *Unstructured* pruning zeros out
  individual weights; *structured* pruning removes whole rows/columns for hardware-friendly
  speedups. See also *Quantization* and *Knowledge Distillation*.
  *Slides:*
  [Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)

**PUE** (Power Usage Effectiveness)
: $\text{PUE} = \text{Total Facility Power} / \text{IT Equipment Power}$.
  A PUE of 1.0 is theoretical perfection; hyperscale datacenters achieve 1.1--1.4.
  Higher PUE means more energy wasted on cooling and facility overhead. Used in MLSYSIM's
  sustainability solver alongside *Carbon Intensity* and *WUE*.
  *Slides:*
  [Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
  [Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)

---

## Q

**Quantization**
: Reducing the numerical *Precision* of model weights and/or activations (e.g., FP32 to
  INT8 or INT4) to shrink memory footprint and increase throughput. *Post-Training
  Quantization* (PTQ) converts a pre-trained model without retraining; *Quantization-Aware
  Training* (QAT) simulates low-precision during training for higher accuracy.
  *Slides:*
  [Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)

---

## R

**Ridge Point**
: The *Arithmetic Intensity* at which a workload transitions from *Memory-Bound* to
  *Compute-Bound* on a given hardware platform:
  $I^* = \text{Peak FLOP/s} / \text{Memory BW}$.
  For the A100 at FP16: $I^* = 312 \text{ TFLOP/s} / 2 \text{ TB/s} = 156 \text{ FLOP/byte}$.
  *Slides:*
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

**Roofline Model**
: A visual and analytical tool that plots hardware performance ceilings (the "roofline")
  and shows where workloads sit relative to them. The sloped region is *Memory-Bound*;
  the flat region is *Compute-Bound*; the inflection point is the *Ridge Point*.
  Introduced by Williams, Waterman, and Patterson (2009). MLSYSIM implements a
  generalized roofline via the *Iron Law*.
  *Slides:*
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

---

## S

**SLA** (Service Level Agreement)
: A target performance guarantee, typically specifying maximum acceptable latency and minimum
  throughput. For LLM serving, common SLAs target *TTFT* < 200 ms and *ITL* < 50 ms/token.
  *Slides:*
  [Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
  [Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf)

**Speculative Decoding**
: An inference optimization where a small, fast "draft" model generates candidate tokens
  that are then verified in parallel by the full model. Reduces *ITL* by converting
  sequential autoregressive steps into a single parallel verification pass, at the cost
  of occasional rejected tokens.
  *Slides:*
  [Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

**SSoT** (Single Source of Truth)
: The principle that each specification (chip peak FLOP/s, grid carbon intensity, etc.)
  has exactly one authoritative location---the MLSys Zoo. All computations derive from
  the Zoo, eliminating inconsistencies from stale copied values.

**Systolic Array**
: A grid of processing elements that rhythmically pass data to their neighbors, performing
  a multiply-accumulate at each step. The dominant dataflow architecture in ML accelerators:
  Google TPUs use systolic arrays for matrix multiplication, and NVIDIA *Tensor Cores*
  implement a similar systolic-like pattern.
  *Slides:*
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)

---

## T

**TCO** (Total Cost of Ownership)
: The full cost of a system over its lifetime:
  $\text{TCO} = \text{CapEx}_{\text{amortized}} + \text{OpEx}$.
  Includes hardware purchase, electricity, cooling, networking, and labor. MLSYSIM's
  TCO solver computes this from hardware registry specs and regional energy costs.
  *Slides:*
  [Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
  [Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)

**TDP** (Thermal Design Power)
: The maximum sustained power a chip is designed to dissipate under load, in Watts.
  Relevant for datacenter cooling capacity planning. An H100 SXM5 has a TDP of 700 W.
  Used in MLSYSIM to compute energy consumption and *TCO*.
  *Slides:*
  [Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf)

**Tensor Core**
: A specialized hardware unit in NVIDIA GPUs designed for matrix-multiply-accumulate
  operations. Achieves much higher throughput than standard CUDA cores for ML workloads.
  The A100's 312 TFLOP/s peak (FP16) comes from its tensor cores, not its CUDA cores.
  Functionally similar to a *Systolic Array*.
  *Slides:*
  [Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
  [Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)

**Tensor Parallelism** (TP)
: A distributed training strategy that splits individual matrix multiplications across
  devices within a node. Requires high-bandwidth intra-node connectivity (*NVLink*).
  Combined with *Data Parallelism* and *Pipeline Parallelism* in *3D Parallelism*.
  *Slides:*
  [Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)

**3D Parallelism**
: The combination of *Data Parallelism*, *Tensor Parallelism*, and *Pipeline Parallelism*
  to scale training across hundreds or thousands of GPUs. TP operates within a node
  (over *NVLink*), PP across a small group of nodes, and DP across the remaining replicas.
  The standard recipe for training frontier LLMs.
  *Slides:*
  [Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)

**Throughput**
: The number of samples (or tokens) processed per unit time:
  $\text{Throughput} = \text{Batch Size} / \text{Latency}$.
  Maximizing throughput often conflicts with minimizing *Latency*---larger batches
  increase throughput but also increase per-request latency.
  *Slides:*
  [Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)

**TTFT** (Time to First Token)
: The latency from receiving a user query to generating the first output token in an LLM
  serving system. Determined primarily by the *prefill* phase, which is *Compute-Bound*.
  Target: <200 ms for interactive applications. See also *ITL*.
  *Slides:*
  [Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
  [Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)

---

## U

**Utilization** ($\eta$)
: The fraction of theoretical peak FLOP/s actually achieved in practice. Typical values:
  30--50% for well-optimized training, 10--30% for inference. MLSYSIM uses $\eta$ as a
  parameter in the *Iron Law*; see the hardware registry for per-device defaults.
  Closely related to *MFU*.
  *Slides:*
  [Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
  [Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)

---

## W

**WUE** (Water Usage Effectiveness)
: Liters of water consumed per kilowatt-hour of energy. Relevant for datacenters using
  evaporative cooling. MLSYSIM estimates water usage as:
  $\text{Water (L)} = \text{Energy (kWh)} \times \text{WUE}$.
  *Slides:*
  [Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)

---

## Y

**Young-Daly Formula**
: The optimal checkpoint interval for fault-tolerant distributed training:
  $\tau_\text{opt} = \sqrt{2 \times \delta \times \text{MTBF}_\text{fleet}}$,
  where $\delta$ is the time to save one checkpoint and *MTBF* is the mean time between
  failures of the fleet. Named after Young (1974) and Daly (2006).
  *Slides:*
  [Vol II Ch 7 -- Fault Tolerance](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf)

---

*This glossary is updated with each MLSYSIM release. If a term is missing, please
[open an issue](https://github.com/harvard-edge/cs249r_book/issues).*