mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-25 01:32:43 -05:00
Two issues caused the deployed slide PDFs to be unusable:
1. Every chapter .tex declared `\setsansfont{Helvetica Neue}` — proprietary
to Apple, not installed on the Ubuntu CI runner. xelatex bombed mid-frame,
the workflow's `|| true` swallowed the error, and the resulting PDF had
most text never typeset (blank pages with only logos/rules surviving).
Switch all 35 decks to TeX Gyre Heros (sans) and TeX Gyre Cursor (mono),
both bundled with texlive-fonts-extra — no external font downloads needed.
Drop the JetBrains Mono wget step and fonts-liberation from both slide
workflows accordingly.
2. Vol1 and Vol2 each ship `00_course_overview.pdf` and `01_introduction.pdf`.
The publish workflow uploaded them to a flat GitHub Release namespace, so
the second upload silently overwrote the first — clicking Vol I's Course
Overview actually downloaded Vol II's deck. Stage prefixed copies
(vol1_*.pdf, vol2_*.pdf) before upload, and update slides/vol{1,2}.qmd
plus the mlsysim cross-links to point at the new prefixed URLs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
548 lines
30 KiB
Plaintext
548 lines
30 KiB
Plaintext
---
|
|
title: "Glossary"
|
|
subtitle: "Definitions for every term used in the MLSYSIM documentation."
|
|
---
|
|
|
|
This page defines every technical term used across the MLSYSIM documentation.
|
|
When a term is first used on any page, it either links here or is defined inline.
|
|
Terms marked with slide links point to the relevant lecture deck for deeper coverage.
|
|
|
|
::: {.callout-tip collapse="true"}
|
|
## Slide deck key
|
|
|
|
All slide links point to the [Machine Learning Systems](https://mlsysbook.ai/slides/) lecture decks.
|
|
**Vol I** covers single-machine foundations; **Vol II** covers distributed and at-scale systems.
|
|
:::
|
|
|
|
---
|
|
|
|
## A
|
|
|
|
**AllReduce**
|
|
: A collective communication primitive in which every device contributes a local tensor and
|
|
receives the globally reduced (typically summed) result. The dominant synchronization
|
|
pattern in data-parallel training. Ring-AllReduce and tree-AllReduce are common algorithms;
|
|
performance is modeled by the *Alpha-Beta Model*.
|
|
*Slides:*
|
|
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf),
|
|
[Vol II Ch 6 -- Collective Communication](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf)
|
|
|
|
**Alpha-Beta Model** ($\alpha$-$\beta$)
|
|
: An analytical model for communication cost: $T_\text{comm} = \alpha + n\beta$,
|
|
where $\alpha$ is the per-message latency (seconds), $n$ is the message size (bytes),
|
|
and $\beta$ is the inverse bandwidth (seconds/byte). Used throughout MLSYSIM to
|
|
estimate collective communication overhead in distributed training.
|
|
*Slides:*
|
|
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf),
|
|
[Vol II Ch 6 -- Collective Communication](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf)
|
|
|
|
**Arithmetic Intensity** (AI)
|
|
: The ratio of floating-point operations to bytes of memory accessed: $I = \text{FLOPs} / \text{Bytes}$.
|
|
High arithmetic intensity means the workload reuses data extensively (compute-bound);
|
|
low arithmetic intensity means it streams data without reuse (memory-bound).
|
|
Units: FLOP/byte. Determines which side of the *Ridge Point* a workload falls on
|
|
in the *Roofline Model*.
|
|
*Slides:*
|
|
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
|
|
|
|
---
|
|
|
|
## B
|
|
|
|
**Bandwidth** (Memory Bandwidth)
|
|
: The rate at which data can be transferred between memory (DRAM/HBM) and compute units.
|
|
Measured in GB/s or TB/s. The A100, for example, provides 2 TB/s of HBM bandwidth.
|
|
Not to be confused with *network bandwidth* (inter-node communication rate) or
|
|
*bisection bandwidth* (aggregate cross-section throughput of a network fabric).
|
|
*Slides:*
|
|
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
|
|
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)
|
|
|
|
**Batch Size**
|
|
: The number of inputs processed simultaneously in one forward pass.
|
|
Larger batch sizes increase *Arithmetic Intensity*, shifting workloads from
|
|
memory-bound toward compute-bound. In distributed training, the *global* batch
|
|
size equals the per-device batch size multiplied by the number of data-parallel replicas.
|
|
*Slides:*
|
|
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
|
|
[Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf)
|
|
|
|
**Bottleneck**
|
|
: The hardware resource that limits performance. For a given workload-hardware pair,
|
|
either compute or memory bandwidth is the bottleneck, determined by comparing the
|
|
workload's *Arithmetic Intensity* to the hardware's *Ridge Point*.
|
|
*Slides:*
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
---
|
|
|
|
## C
|
|
|
|
**CapEx** (Capital Expenditure)
|
|
: The upfront cost of purchasing hardware. In *TCO* analysis, CapEx is amortized over
|
|
the hardware's useful lifetime (typically 3--5 years).
|
|
*Slides:*
|
|
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
|
|
|
|
**Carbon Intensity**
|
|
: The mass of CO~2~-equivalent emissions per unit of electricity consumed, measured in
|
|
gCO~2~e/kWh. Varies dramatically by region: ~20 gCO~2~e/kWh (Quebec hydro) to
|
|
~820 gCO~2~e/kWh (Poland coal). MLSYSIM uses per-region carbon intensity values
|
|
from the sustainability registry to estimate training and inference emissions.
|
|
*Slides:*
|
|
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
|
|
|
|
**Compute-Bound**
|
|
: A workload whose performance is limited by the hardware's peak FLOP/s rate rather
|
|
than memory bandwidth. Occurs when *Arithmetic Intensity* exceeds the *Ridge Point*.
|
|
Remedies include using tensor cores, upgrading to a faster accelerator, or reducing
|
|
precision. Contrast with *Memory-Bound*.
|
|
*Slides:*
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
**Continuous Batching**
|
|
: A serving optimization that dynamically inserts and retires requests from a running
|
|
batch, rather than waiting for all sequences in a static batch to finish before
|
|
starting new ones. Dramatically improves GPU utilization for LLM inference, where
|
|
sequence lengths vary widely. Also called *iteration-level batching*.
|
|
*Slides:*
|
|
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
|
|
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
|
|
|
|
**CUDA** (Compute Unified Device Architecture)
|
|
: NVIDIA's programming platform for writing GPU-accelerated programs. A "CUDA kernel"
|
|
is a function that runs in parallel across thousands of GPU threads. *Dispatch Tax*
|
|
is the per-kernel launch overhead inherent to this model.
|
|
*Slides:*
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
|
|
|
|
---
|
|
|
|
## D
|
|
|
|
**Data Parallelism** (DP)
|
|
: A distributed training strategy where the full model is replicated across $N$ devices,
|
|
each processing a different shard of the batch. Requires an *AllReduce* synchronization
|
|
step after each backward pass to average gradients. Scales well for models that fit
|
|
in a single device's memory. See also *Tensor Parallelism*, *Pipeline Parallelism*,
|
|
and *3D Parallelism*.
|
|
*Slides:*
|
|
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
|
|
|
|
**Dispatch Tax**
|
|
: The constant per-operation overhead of launching a GPU kernel (typically 0.01--0.1 ms
|
|
for CUDA kernel launch). Becomes significant at small batch sizes where kernel launch
|
|
time dominates actual compute time. Captured as the additive term in the *Iron Law*.
|
|
*Slides:*
|
|
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)
|
|
|
|
---
|
|
|
|
## F
|
|
|
|
**FLOP/s** (Floating-Point Operations per Second)
|
|
: The rate at which a device can perform floating-point arithmetic. The A100 achieves
|
|
312 TFLOP/s at FP16 via its *Tensor Cores*. Also written as TFLOP/s (tera-) or
|
|
PFLOP/s (peta-). Not to be confused with *FLOPs* (a count, not a rate).
|
|
*Slides:*
|
|
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
|
|
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)
|
|
|
|
**FLOPs** (Floating-Point Operations)
|
|
: A count of arithmetic operations (multiplies, adds, etc.) required to execute a single
|
|
inference or training step. A ResNet-50 inference requires ~8 GFLOPs; a GPT-3 forward
|
|
pass requires ~350 TFLOPs. Not the same as *FLOP/s* (the rate).
|
|
*Slides:*
|
|
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf)
|
|
|
|
**Forward Pass / Backward Pass**
|
|
: In neural network training, the *forward pass* runs input data through the model to produce
|
|
a prediction. The *backward pass* (backpropagation) computes gradients---the direction
|
|
and magnitude of change needed for each parameter to reduce error. In distributed systems,
|
|
gradients must be synchronized across all devices after each backward pass via *AllReduce*.
|
|
*Slides:*
|
|
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
|
|
[Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf)
|
|
|
|
---
|
|
|
|
## G
|
|
|
|
**GQA** (Grouped Query Attention)
|
|
: A transformer attention variant where multiple query heads share a single key-value head,
|
|
reducing *KV-Cache* memory by a factor equal to the group size without significantly
|
|
affecting model quality. Used in Llama 3 and other modern LLMs. See also *KV-Cache*.
|
|
*Slides:*
|
|
[Vol I Ch 6 -- Network Architectures](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf),
|
|
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
|
|
|
|
---
|
|
|
|
## H
|
|
|
|
**HBM** (High-Bandwidth Memory)
|
|
: Stacked DRAM technology used in modern AI accelerators. Provides far higher bandwidth
|
|
than GDDR (e.g., 2 TB/s on A100, 3.35 TB/s on H100) at the cost of limited capacity
|
|
(40--80 GB per device). The bandwidth ceiling in the *Roofline Model* is set by HBM.
|
|
*Slides:*
|
|
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf)
|
|
|
|
---
|
|
|
|
## I
|
|
|
|
**InfiniBand**
|
|
: A high-throughput, low-latency network fabric commonly used in GPU clusters for
|
|
distributed training. Supports *RDMA* (Remote Direct Memory Access) for zero-copy
|
|
data transfer that bypasses the CPU. NDR InfiniBand provides 400 Gb/s per port.
|
|
See also *NVLink* (intra-node) vs. InfiniBand (inter-node).
|
|
*Slides:*
|
|
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)
|
|
|
|
**Iron Law of ML Systems**
|
|
: The fundamental performance equation:
|
|
$$T = \max\!\left(\frac{\text{FLOPs}}{\text{Peak} \times \eta},\; \frac{\text{Bytes}}{\text{BW}}\right) + \text{Dispatch\_Tax}$$
|
|
The $\max$ captures the *Roofline Model* insight that performance is limited by whichever
|
|
resource---compute or memory bandwidth---is the bottleneck. Named by analogy with the
|
|
Iron Law of processor performance in computer architecture.
|
|
*Slides:*
|
|
[Vol I Ch 8 -- Model Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf),
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
|
|
|
|
**ITL** (Inter-Token Latency)
|
|
: The time to generate each successive token after the first during LLM autoregressive
|
|
decoding. Almost always *Memory-Bound*---each decode step loads the full model
|
|
weights plus the *KV-Cache*. Measured in ms/token. See also *TTFT*.
|
|
*Slides:*
|
|
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
|
|
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
|
|
|
|
---
|
|
|
|
## K
|
|
|
|
**Knowledge Distillation**
|
|
: A model compression technique where a smaller "student" model is trained to match the
|
|
output distribution of a larger "teacher" model. Reduces model size and inference cost
|
|
while retaining much of the teacher's accuracy. See also *Quantization* and *Pruning*.
|
|
*Slides:*
|
|
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
|
|
|
|
**KV-Cache**
|
|
: The cached Key and Value matrices from the transformer attention mechanism, retained
|
|
across decoding steps to avoid recomputation. Memory footprint grows linearly with
|
|
sequence length and batch size:
|
|
$\text{Bytes} = 2 \times L \times B \times d \times \text{layers} \times \text{bytes\_per\_param}$.
|
|
*GQA* reduces KV-Cache size; *PagedAttention* manages it more efficiently.
|
|
*Slides:*
|
|
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
|
|
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
|
|
|
|
---
|
|
|
|
## L
|
|
|
|
**Latency**
|
|
: The wall-clock time to complete one inference or training step. In MLSYSIM, latency
|
|
is the primary output of the *Iron Law* equation. Measured in ms or $\mu$s.
|
|
Maximizing *Throughput* often conflicts with minimizing latency.
|
|
*Slides:*
|
|
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
|
|
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf)
|
|
|
|
**LLM** (Large Language Model)
|
|
: A transformer-based model trained on large text corpora, typically with billions of
|
|
parameters. Examples: GPT-4, Llama 3, Gemini. Key serving metrics: *TTFT* and *ITL*.
|
|
Key memory bottleneck: *KV-Cache*.
|
|
*Slides:*
|
|
[Vol I Ch 6 -- Network Architectures](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_06_nn_architectures.pdf)
|
|
|
|
---
|
|
|
|
## M
|
|
|
|
**Memory-Bound**
|
|
: A workload whose performance is limited by the hardware's memory *Bandwidth*, not its
|
|
peak FLOP/s. Occurs when *Arithmetic Intensity* falls below the *Ridge Point*.
|
|
Remedies include lower *Precision*, *Operator Fusion*, or faster memory (e.g., HBM3).
|
|
Contrast with *Compute-Bound*.
|
|
*Slides:*
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
**MFU** (Model FLOP Utilization)
|
|
: The fraction of theoretical peak FLOP/s actually achieved:
|
|
$\text{MFU} = \text{Achieved FLOP/s} / \text{Peak FLOP/s}$.
|
|
Well-optimized training achieves 30--50% MFU; poorly optimized code may fall below 10%.
|
|
MFU is the single most important efficiency metric for large-scale training runs.
|
|
*Slides:*
|
|
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
**Microbatch**
|
|
: A subdivision of the training batch used in *Pipeline Parallelism*. Increasing the
|
|
number of microbatches $M$ reduces the *Pipeline Bubble* fraction:
|
|
$\text{Bubble} = (P{-}1) / (P{-}1{+}M)$, where $P$ is the pipeline depth.
|
|
*Slides:*
|
|
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
|
|
|
|
**MTBF** (Mean Time Between Failures)
|
|
: The average time a component operates before failing. For a fleet of $N$ identical nodes,
|
|
$\text{MTBF}_\text{fleet} = \text{MTBF}_\text{node} / N$. A 1,024-node cluster with
|
|
100,000-hour node MTBF has a fleet MTBF of ~98 hours. Input to the *Young-Daly Formula*.
|
|
*Slides:*
|
|
[Vol II Ch 7 -- Fault Tolerance](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf)
|
|
|
|
---
|
|
|
|
## N
|
|
|
|
**NVLink**
|
|
: NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication within a server.
|
|
Provides 900 GB/s bidirectional bandwidth per GPU in DGX H100 systems. Required for
|
|
*Tensor Parallelism*, where low-latency intra-node communication is critical.
|
|
Contrast with *InfiniBand* for inter-node communication.
|
|
*Slides:*
|
|
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
|
|
[Vol II Ch 3 -- Network Fabrics](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf)
|
|
|
|
---
|
|
|
|
## O
|
|
|
|
**OpEx** (Operational Expenditure)
|
|
: The ongoing costs of running hardware: electricity, networking, cooling, labor.
|
|
In cloud pricing, OpEx dominates over a 3-year period by 2--5x over *CapEx*.
|
|
*Slides:*
|
|
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
|
|
|
|
**Operator Fusion**
|
|
: Combining multiple small GPU kernels into a single larger one to reduce
|
|
memory transfers between operations. For example, fusing a matrix multiply followed
|
|
by an activation function avoids writing and re-reading the intermediate result
|
|
from *HBM*. A key optimization for reducing *Memory-Bound* overhead.
|
|
*Slides:*
|
|
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
---
|
|
|
|
## P
|
|
|
|
**Pipeline Bubble**
|
|
: The fraction of time a pipeline-parallel system spends idle waiting for the
|
|
first *Microbatch* to propagate through all stages:
|
|
$\text{Bubble} = (P{-}1) / (P{-}1{+}M)$,
|
|
where $P$ is pipeline depth and $M$ is microbatch count.
|
|
*Slides:*
|
|
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
|
|
|
|
**Pipeline Parallelism** (PP)
|
|
: A distributed training strategy that splits the model's layers across devices,
|
|
each device processing a different "stage." Introduces a *Pipeline Bubble* of idle
|
|
time. Complementary to *Data Parallelism* and *Tensor Parallelism* in *3D Parallelism*.
|
|
*Slides:*
|
|
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
|
|
|
|
**Precision**
|
|
: The numerical format used to represent weights and activations. `fp32` (32-bit float)
|
|
is most accurate; `fp16`/`bf16` (16-bit) halves memory and doubles throughput
|
|
on *Tensor Cores*; `int8` and `int4` further reduce memory at the cost of accuracy.
|
|
Lower precision increases *Arithmetic Intensity* by reducing bytes per operation.
|
|
*Slides:*
|
|
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
|
|
|
|
**Progressive Lowering**
|
|
: MLSYSIM's architectural principle: workload specifications (demand) are progressively
|
|
mapped onto hardware specifications (supply) through a chain of analytical transformations.
|
|
The reverse of how hardware is typically specified---starting from the algorithm, not the chip.
|
|
|
|
**Pruning**
|
|
: A model compression technique that removes redundant weights or entire structures
|
|
(channels, attention heads) from a trained model. *Unstructured* pruning zeros out
|
|
individual weights; *structured* pruning removes whole rows/columns for hardware-friendly
|
|
speedups. See also *Quantization* and *Knowledge Distillation*.
|
|
*Slides:*
|
|
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
|
|
|
|
**PUE** (Power Usage Effectiveness)
|
|
: $\text{PUE} = \text{Total Facility Power} / \text{IT Equipment Power}$.
|
|
A PUE of 1.0 is theoretical perfection; hyperscale datacenters achieve 1.1--1.4.
|
|
Higher PUE means more energy wasted on cooling and facility overhead. Used in MLSYSIM's
|
|
sustainability solver alongside *Carbon Intensity* and *WUE*.
|
|
*Slides:*
|
|
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
|
|
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
|
|
|
|
---
|
|
|
|
## Q
|
|
|
|
**Quantization**
|
|
: Reducing the numerical *Precision* of model weights and/or activations (e.g., FP32 to
|
|
INT8 or INT4) to shrink memory footprint and increase throughput. *Post-Training
|
|
Quantization* (PTQ) converts a pre-trained model without retraining; *Quantization-Aware
|
|
Training* (QAT) simulates low-precision during training for higher accuracy.
|
|
*Slides:*
|
|
[Vol I Ch 10 -- Model Compression](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf)
|
|
|
|
---
|
|
|
|
## R
|
|
|
|
**Ridge Point**
|
|
: The *Arithmetic Intensity* at which a workload transitions from *Memory-Bound* to
|
|
*Compute-Bound* on a given hardware platform:
|
|
$I^* = \text{Peak FLOP/s} / \text{Memory BW}$.
|
|
For the A100 at FP16: $I^* = 312 \text{ TFLOP/s} / 2 \text{ TB/s} = 156 \text{ FLOP/byte}$.
|
|
*Slides:*
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
**Roofline Model**
|
|
: A visual and analytical tool that plots hardware performance ceilings (the "roofline")
|
|
and shows where workloads sit relative to them. The sloped region is *Memory-Bound*;
|
|
the flat region is *Compute-Bound*; the inflection point is the *Ridge Point*.
|
|
Introduced by Williams, Waterman, and Patterson (2009). MLSYSIM implements a
|
|
generalized roofline via the *Iron Law*.
|
|
*Slides:*
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
---
|
|
|
|
## S
|
|
|
|
**SLA** (Service Level Agreement)
|
|
: A target performance guarantee, typically specifying maximum acceptable latency and minimum
|
|
throughput. For LLM serving, common SLAs target *TTFT* < 200 ms and *ITL* < 50 ms/token.
|
|
*Slides:*
|
|
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
|
|
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf)
|
|
|
|
**Speculative Decoding**
|
|
: An inference optimization where a small, fast "draft" model generates candidate tokens
|
|
that are then verified in parallel by the full model. Reduces *ITL* by converting
|
|
sequential autoregressive steps into a single parallel verification pass, at the cost
|
|
of occasional rejected tokens.
|
|
*Slides:*
|
|
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
**SSoT** (Single Source of Truth)
|
|
: The principle that each specification (chip peak FLOP/s, grid carbon intensity, etc.)
|
|
has exactly one authoritative location---the MLSys Zoo. All computations derive from
|
|
the Zoo, eliminating inconsistencies from stale copied values.
|
|
|
|
**Systolic Array**
|
|
: A grid of processing elements that rhythmically pass data to their neighbors, performing
|
|
a multiply-accumulate at each step. The dominant dataflow architecture in ML accelerators:
|
|
Google TPUs use systolic arrays for matrix multiplication, and NVIDIA *Tensor Cores*
|
|
implement a similar systolic-like pattern.
|
|
*Slides:*
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
|
|
|
|
---
|
|
|
|
## T
|
|
|
|
**TCO** (Total Cost of Ownership)
|
|
: The full cost of a system over its lifetime:
|
|
$\text{TCO} = \text{CapEx}_{\text{amortized}} + \text{OpEx}$.
|
|
Includes hardware purchase, electricity, cooling, networking, and labor. MLSYSIM's
|
|
TCO solver computes this from hardware registry specs and regional energy costs.
|
|
*Slides:*
|
|
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf),
|
|
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
|
|
|
|
**TDP** (Thermal Design Power)
|
|
: The maximum sustained power a chip is designed to dissipate under load, in Watts.
|
|
Relevant for datacenter cooling capacity planning. An H100 SXM5 has a TDP of 700 W.
|
|
Used in MLSYSIM to compute energy consumption and *TCO*.
|
|
*Slides:*
|
|
[Vol II Ch 2 -- Compute Infrastructure](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf)
|
|
|
|
**Tensor Core**
|
|
: A specialized hardware unit in NVIDIA GPUs designed for matrix-multiply-accumulate
|
|
operations. Achieves much higher throughput than standard CUDA cores for ML workloads.
|
|
The A100's 312 TFLOP/s peak (FP16) comes from its tensor cores, not its CUDA cores.
|
|
Functionally similar to a *Systolic Array*.
|
|
*Slides:*
|
|
[Vol I Ch 5 -- Neural Network Computation](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf),
|
|
[Vol I Ch 11 -- Hardware Acceleration](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf)
|
|
|
|
**Tensor Parallelism** (TP)
|
|
: A distributed training strategy that splits individual matrix multiplications across
|
|
devices within a node. Requires high-bandwidth intra-node connectivity (*NVLink*).
|
|
Combined with *Data Parallelism* and *Pipeline Parallelism* in *3D Parallelism*.
|
|
*Slides:*
|
|
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
|
|
|
|
**3D Parallelism**
|
|
: The combination of *Data Parallelism*, *Tensor Parallelism*, and *Pipeline Parallelism*
|
|
to scale training across hundreds or thousands of GPUs. TP operates within a node
|
|
(over *NVLink*), PP across a small group of nodes, and DP across the remaining replicas.
|
|
The standard recipe for training frontier LLMs.
|
|
*Slides:*
|
|
[Vol II Ch 5 -- Distributed Training](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf)
|
|
|
|
**Throughput**
|
|
: The number of samples (or tokens) processed per unit time:
|
|
$\text{Throughput} = \text{Batch Size} / \text{Latency}$.
|
|
Maximizing throughput often conflicts with minimizing *Latency*---larger batches
|
|
increase throughput but also increase per-request latency.
|
|
*Slides:*
|
|
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf)
|
|
|
|
**TTFT** (Time to First Token)
|
|
: The latency from receiving a user query to generating the first output token in an LLM
|
|
serving system. Determined primarily by the *prefill* phase, which is *Compute-Bound*.
|
|
Target: <200 ms for interactive applications. See also *ITL*.
|
|
*Slides:*
|
|
[Vol I Ch 13 -- Model Serving](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf),
|
|
[Vol II Ch 9 -- Inference at Scale](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf)
|
|
|
|
---
|
|
|
|
## U
|
|
|
|
**Utilization** ($\eta$)
|
|
: The fraction of theoretical peak FLOP/s actually achieved in practice. Typical values:
|
|
30--50% for well-optimized training, 10--30% for inference. MLSYSIM uses $\eta$ as a
|
|
parameter in the *Iron Law*; see the hardware registry for per-device defaults.
|
|
Closely related to *MFU*.
|
|
*Slides:*
|
|
[Vol I Ch 12 -- Benchmarking](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf),
|
|
[Vol II Ch 10 -- Performance Engineering](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf)
|
|
|
|
---
|
|
|
|
## W
|
|
|
|
**WUE** (Water Usage Effectiveness)
|
|
: Liters of water consumed per kilowatt-hour of energy. Relevant for datacenters using
|
|
evaporative cooling. MLSYSIM estimates water usage as:
|
|
$\text{Water (L)} = \text{Energy (kWh)} \times \text{WUE}$.
|
|
*Slides:*
|
|
[Vol II Ch 15 -- Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf)
|
|
|
|
---
|
|
|
|
## Y
|
|
|
|
**Young-Daly Formula**
|
|
: The optimal checkpoint interval for fault-tolerant distributed training:
|
|
$\tau_\text{opt} = \sqrt{2 \times \delta \times \text{MTBF}_\text{fleet}}$,
|
|
where $\delta$ is the time to save one checkpoint and *MTBF* is the mean time between
|
|
failures of the fleet. Named after Young (1974) and Daly (2006).
|
|
*Slides:*
|
|
[Vol II Ch 7 -- Fault Tolerance](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf)
|
|
|
|
---
|
|
|
|
*This glossary is updated with each MLSYSIM release. If a term is missing, please
|
|
[open an issue](https://github.com/harvard-edge/cs249r_book/issues).*
|