mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 10:08:50 -05:00
Wave 4 editorial content across 20 modules + new glossary back matter:
1. Module opener hooks (20 new 2-3 sentence paragraphs between the
chapter heading and Module Info callout). Every hook LEADS with
the systems angle (memory, bandwidth, arithmetic intensity,
cache, HBM, roofline, KV cache, hardware utilization, etc.) and
connects back to the ML story. Reinforces that this is a lab
guide for ML systems, not an ML-theory textbook.
2. Code-listing captions on substantive code blocks (roughly >10
lines, defines a class/function/algorithm). Populates Quarto's
List of Listings front matter. Combined across F1/F2/L/O
subagent waves: roughly 60 listings now carry
'**Listing N.M — Brief description**' captions.
3. Figure alt-text audit across 20 module diagrams. Most already
carried objective specific alt-text; a handful were rewritten
for precision.
4. Glossary at back matter (tinytorch/quarto/glossary.qmd + registered
in pdf/_quarto.yml). 90 alphabetical entries spanning tensor /
memory / autograd / training / architecture / optimization
terms. One-sentence definitions. Module cross-references where
the term is central. Lab-guide voice, not dictionary.
5. Style discipline: no em-dashes in prose (caption templates
'— Description' are the only exception, required by parser).
All agent outputs and the hand-revised hooks audited for em-dash
use.
6. SVG trailing-newline hygiene: 8 SVGs touched by the Gemini style
audit had lost their trailing newline. Restored per the SVG
file-hygiene rule.
224 lines
18 KiB
Plaintext
224 lines
18 KiB
Plaintext
---
|
|
title: "Glossary"
|
|
---
|
|
|
|
Alphabetical reference for the key technical terms introduced across the TinyTorch Lab Guide, framed in the voice the modules use them in. Where a term is central to a specific module, that module is cited as the place to return for depth.
|
|
|
|
## A
|
|
|
|
**Acceleration.** The practice of replacing slow Python loops with vectorized NumPy, cache-aware memory layouts, and BLAS-backed kernels so a computation runs closer to the hardware's peak throughput (Module 17).
|
|
|
|
**Activation Function.** An element-wise nonlinearity such as ReLU, Sigmoid, Tanh, or GELU applied between linear layers; without it, a stack of layers collapses algebraically into a single linear transform (Module 02).
|
|
|
|
**Activation Pinning.** The systems consequence of reverse-mode autograd: every forward activation must stay resident in memory until the backward pass consumes it, which sets a hard floor on training VRAM (Module 06).
|
|
|
|
**Adam.** An adaptive optimizer that maintains per-parameter first and second moment estimates of the gradient, trading extra optimizer state for faster, more robust convergence (Module 07).
|
|
|
|
**AdamW.** Adam with decoupled weight decay, where the L2 penalty is applied directly to the parameter rather than folded into the gradient; the default optimizer for modern transformer training (Module 07).
|
|
|
|
**AllReduce.** A collective communication primitive that sums tensors across devices and returns the result to all participants; inserted by gradient clipping and synchronous data-parallel training, and often the step that bottlenecks distributed pipelines.
|
|
|
|
**Arithmetic Intensity.** The ratio of floating-point operations performed per byte of memory moved; the coordinate along the roofline model's x-axis that determines whether a kernel is memory-bound or compute-bound (Module 17).
|
|
|
|
**Attention.** A mechanism that computes a weighted combination of value vectors using similarities between a query and a set of keys, enabling a model to look at every position in a sequence in parallel (Module 12).
|
|
|
|
**Autograd.** The automatic differentiation engine that records a computational graph during the forward pass and traverses it in reverse to produce gradients with respect to every parameter (Module 06).
|
|
|
|
**Autoregressive Generation.** Producing one token at a time, feeding each output back in as the next input; the decoding pattern that makes the KV cache a necessity rather than a nicety (Module 13).
|
|
|
|
## B
|
|
|
|
**Backward Pass.** The traversal of the computational graph from loss to inputs that applies the chain rule to accumulate gradients into every parameter tensor (Module 06).
|
|
|
|
**Batch.** A group of examples processed together so a single matrix multiply can amortize memory-bandwidth and kernel-launch overhead across many samples (Module 05).
|
|
|
|
**Batch Normalization.** A layer that normalizes activations using batch statistics during training and running statistics during inference, stabilizing deep networks at the cost of a train/eval mode distinction (Module 09).
|
|
|
|
**Benchmarking.** The disciplined measurement of latency, throughput, and memory under controlled conditions, with warmup, multiple runs, and variance reporting, so that a reported speedup is reproducible (Module 19).
|
|
|
|
**BF16.** Brain Floating Point 16, a 16-bit format that keeps FP32's exponent range but halves the mantissa, trading precision for memory bandwidth during training (Module 15).
|
|
|
|
**BLAS.** Basic Linear Algebra Subprograms, the decades-old library interface that underlies every production matmul; its Level 3 routines are tiled and cache-aware so dense linear algebra approaches peak FLOPS (Module 17).
|
|
|
|
**Broadcasting.** The right-to-left shape-alignment rule that lets a `(768,)` bias add into a `(32, 512, 768)` activation tensor without materializing the tiled copy, saving memory proportional to the broadcast dimensions (Module 01).
|
|
|
|
**Byte Pair Encoding (BPE).** A subword tokenization algorithm that greedily merges the most frequent adjacent pairs, compressing text into a fixed vocabulary that stays robust to rare words (Module 10).
|
|
|
|
## C
|
|
|
|
**Cache Tiling.** Breaking a matmul into blocks sized to fit in L1 or L2 cache so each loaded tile is reused many times before eviction, turning an O(n³) operation from memory-bound back into compute-bound (Module 01).
|
|
|
|
**Calibration.** Passing a small representative dataset through a model to collect the activation statistics that post-training quantization needs to pick scale and zero-point parameters (Module 15).
|
|
|
|
**Causal Mask.** An upper-triangular mask applied to attention scores that zeroes out future positions, enforcing the autoregressive constraint that token $t$ cannot attend to token $t+1$ (Module 12).
|
|
|
|
**Chain Rule.** The calculus identity that lets autograd compose per-operation local gradients into a full end-to-end gradient, one multiplication per edge of the computational graph (Module 06).
|
|
|
|
**Checkpointing.** Writing optimizer state, model weights, and training metadata to disk at regular intervals so a multi-day run can survive a preemption or crash (Module 08).
|
|
|
|
**Compression Ratio.** The size of the original model divided by the size after pruning, distillation, or low-rank approximation; the headline number every deployment constraint is measured against (Module 16).
|
|
|
|
**Computational Graph.** The directed acyclic graph of tensor operations autograd records on the forward pass; its topology determines the order and memory cost of the backward pass (Module 06).
|
|
|
|
**Compute-Bound.** A kernel whose runtime is limited by the rate at which the ALUs can issue floating-point operations, not by the rate at which memory can feed them (Module 14).
|
|
|
|
**Contiguous Layout.** A tensor whose elements sit in memory in the order implied by its shape and row-major strides; iteration is sequential, cache lines are reused, and SIMD loads stay aligned (Module 01).
|
|
|
|
**Convergence.** The state in which additional optimizer steps no longer meaningfully reduce training loss; the signal that a run is either done or stuck (Module 07).
|
|
|
|
**Convolution.** A weight-sharing layer that slides a small kernel across spatial input, exploiting the locality and translation-equivariance of image data (Module 09).
|
|
|
|
**Cross-Entropy Loss.** The negative log-likelihood of the true class under the model's predicted distribution, the standard training signal for classification and language modeling (Module 04).
|
|
|
|
## D
|
|
|
|
**DataLoader.** An iterator that streams minibatches from a dataset, optionally shuffling, augmenting, and prefetching so the accelerator is never idle waiting for input (Module 05).
|
|
|
|
**Dropout.** A regularizer that zeroes a random subset of activations during training and rescales the remainder, preventing co-adaptation; active only in train mode (Module 03).
|
|
|
|
**Dtype.** The numerical type of a tensor's elements (float32, float16, bfloat16, int8); the single knob that most directly trades precision for memory and bandwidth (Module 01).
|
|
|
|
## E
|
|
|
|
**Embedding Table.** A learnable matrix of shape `(vocab_size, d_model)` whose rows map discrete token indices to dense vectors, the input side of every modern language model (Module 11).
|
|
|
|
**Epoch.** One full pass of the optimizer over every example in the training set; the unit most schedules (learning rate, weight decay) are expressed in (Module 08).
|
|
|
|
## F
|
|
|
|
**Forward Pass.** The left-to-right evaluation of a model that turns an input batch into predictions and, during training, records the computational graph for the backward pass (Module 01).
|
|
|
|
**FP16.** Half-precision floating point, a 16-bit format with reduced exponent range; halves memory and doubles bandwidth relative to FP32, but requires loss scaling to avoid gradient underflow (Module 15).
|
|
|
|
## G
|
|
|
|
**GEMM.** General Matrix-Matrix Multiplication, the BLAS Level 3 routine that every linear layer's forward and backward passes ultimately dispatch to (Module 03).
|
|
|
|
**Gradient.** The partial derivative of the loss with respect to a parameter; the signal the optimizer consumes to decide which way and how far to step (Module 06).
|
|
|
|
**Gradient Accumulation.** Summing gradients from several micro-batches into a parameter's `.grad` buffer before calling `optimizer.step()`, simulating a larger batch on memory-constrained hardware (Module 06).
|
|
|
|
**Gradient Checkpointing.** Discarding selected forward activations and recomputing them during the backward pass, trading compute for peak memory during training (Module 18).
|
|
|
|
**Gradient Clipping.** Rescaling the gradient vector so its global norm stays below a threshold, preventing occasional outlier batches from blowing up the parameter update (Module 08).
|
|
|
|
## H
|
|
|
|
**HBM.** High Bandwidth Memory, the stacked DRAM directly attached to a GPU package; far faster than host DRAM but far slower than on-chip SRAM, which is why FlashAttention exists.
|
|
|
|
## I
|
|
|
|
**INT8.** An 8-bit integer representation used for quantized weights and activations; reduces model size 4x versus FP32 and lets inference hardware use integer ALUs that are cheaper and faster than floating-point units (Module 15).
|
|
|
|
**In-place Operation.** An operator that overwrites its input buffer rather than allocating a new one, saving memory bandwidth at the cost of breaking autograd whenever the input is still needed for the backward pass (Module 02).
|
|
|
|
## K
|
|
|
|
**Kernel Fusion.** Combining several element-wise operations into a single pass over memory so the activation is read and written once instead of once per op; one of the largest wins available to a framework (Module 17).
|
|
|
|
**Knowledge Distillation.** Training a smaller student network to match the output distribution of a larger teacher, compressing the teacher's learned behavior into a model with fewer parameters (Module 16).
|
|
|
|
**KV Cache.** The memoization of key and value projections across autoregressive steps, converting attention's per-step cost from O(N²) recomputation to O(N) reuse at the cost of O(N) additional memory per token (Module 18).
|
|
|
|
## L
|
|
|
|
**Latency.** The time to complete a single request end to end; the metric that governs user-facing interactive inference (Module 19).
|
|
|
|
**Layer Norm.** Per-example normalization across the feature dimension, the stabilizing ingredient that lets transformers stack to hundreds of layers without exploding or collapsing activations (Module 13).
|
|
|
|
**Learning Rate.** The scalar multiplier on the gradient in each optimizer step; the single hyperparameter most likely to decide whether a run converges at all (Module 07).
|
|
|
|
**Linear Layer.** A parameterized affine transformation $y = xW^\top + b$, the building block every deep network is ultimately assembled from (Module 03).
|
|
|
|
**Little's Law.** The queueing identity $L = \lambda W$ relating concurrency, throughput, and latency; the tool for reasoning about when a serving system is bandwidth-bound versus latency-bound (Modules 19, 20).
|
|
|
|
**Logits.** The raw unnormalized scores a classifier emits before Softmax; the input most numerically stable loss functions expect (Module 04).
|
|
|
|
**Loss.** A scalar measure of how wrong a model's predictions are on a batch; the quantity the optimizer minimizes (Module 04).
|
|
|
|
**Low-Rank Approximation.** Factoring a weight matrix $W$ as $UV^\top$ with intermediate rank $r \ll \min(m, n)$, cutting parameter count and compute while preserving most of the matrix's dominant behavior (Module 16).
|
|
|
|
## M
|
|
|
|
**Matmul.** Matrix multiplication; the O(n³) compute, O(n²) memory operation that dominates training and inference time and that every systems optimization ultimately defends (Module 01).
|
|
|
|
**Memoization.** Caching the result of a deterministic computation so later calls with the same inputs become a table lookup instead of a recompute; the basis for the KV cache (Module 18).
|
|
|
|
**Memory-Bound.** A kernel whose runtime is limited by the rate at which memory can deliver operands, not by the rate at which the ALUs can consume them; the regime where arithmetic intensity is low (Module 14).
|
|
|
|
**Memory Wall.** The widening gap between compute throughput and memory bandwidth that makes data movement, not math, the dominant cost of modern ML workloads (Module 15).
|
|
|
|
**Mini-batch.** A subset of the training set processed in a single forward/backward step; a compromise between the variance of stochastic updates and the memory cost of full-batch gradient descent (Module 05).
|
|
|
|
**MLP.** A multi-layer perceptron: alternating linear layers and nonlinearities; the feedforward sublayer inside every transformer block (Module 13).
|
|
|
|
**Multi-Head Attention.** Running several attention heads in parallel over independently projected query, key, and value subspaces, then concatenating; lets the model attend to multiple relational patterns at once (Module 12).
|
|
|
|
## O
|
|
|
|
**Optimizer.** The stateful object that consumes gradients and produces parameter updates, from plain SGD to Adam to AdamW (Module 07).
|
|
|
|
**Overfitting.** When a model's training loss keeps decreasing while its validation loss rises; the signal that the network has started memorizing instead of generalizing (Module 08).
|
|
|
|
## P
|
|
|
|
**PagedAttention.** An inference-time memory manager that stores the KV cache in fixed-size blocks and indirects through a page table, eliminating the fragmentation that naive contiguous caches suffer under variable sequence lengths (Module 18).
|
|
|
|
**Positional Encoding.** A signal added to token embeddings so attention, which is otherwise permutation-invariant, can distinguish position; sinusoidal, learned, and rotary variants all serve the same purpose (Module 11).
|
|
|
|
**Post-Training Quantization.** Converting a trained FP32 model to a lower-precision format (typically INT8) using a small calibration set, without retraining (Module 15).
|
|
|
|
**Profiling.** Measuring where a program actually spends its time and memory, by instrumenting regions with timers and allocators, so optimization effort is directed at the actual bottleneck (Module 14).
|
|
|
|
**Pruning.** Zeroing out weights deemed unimportant (by magnitude, by gradient, by structured group) to reduce the effective parameter count of a trained model (Module 16).
|
|
|
|
## Q
|
|
|
|
**Q/K/V.** The query, key, and value projections attention produces from the input; Q asks "what am I looking for," K advertises "what do I offer," V carries "what I will hand back" (Module 12).
|
|
|
|
**Quantization.** Replacing FP32 tensors with a lower-precision integer or narrow-float representation, reducing memory, bandwidth, and often compute cost at a small accuracy price (Module 15).
|
|
|
|
## R
|
|
|
|
**ReLU.** The rectified linear unit $\max(0, x)$; cheap, non-saturating, and the activation that made deep networks trainable at scale (Module 02).
|
|
|
|
**Residual Connection.** A shortcut that adds a layer's input to its output, giving the gradient a direct path to earlier parameters and making deep stacks optimizable (Module 13).
|
|
|
|
**Retain Grad.** The opt-in flag that tells autograd to keep a non-leaf tensor's gradient around after the backward pass, at the cost of the memory autograd normally reclaims (Module 06).
|
|
|
|
**Roofline Model.** A performance chart plotting achievable FLOP/s against arithmetic intensity, with a memory-bandwidth ceiling on the left and a peak-compute ceiling on the right; tells you which wall your kernel is hitting (Module 14).
|
|
|
|
## S
|
|
|
|
**Scale and Zero-Point.** The two parameters that map a floating-point range $[x_{\min}, x_{\max}]$ into an integer range during quantization: scale sets the step size, zero-point sets the integer that represents zero (Module 15).
|
|
|
|
**SGD.** Stochastic Gradient Descent, the baseline optimizer: subtract the learning rate times the gradient from each parameter (Module 07).
|
|
|
|
**SIMD.** Single Instruction, Multiple Data; CPU vector instructions that apply one op to a register full of values in a single clock, the reason element-wise NumPy is orders of magnitude faster than a Python loop (Module 01).
|
|
|
|
**Softmax.** The exponentiate-and-normalize function that turns a vector of logits into a probability distribution; numerically stabilized in practice by subtracting the max before exponentiating (Modules 02, 12).
|
|
|
|
**Stride.** The number of bytes (or elements) to skip in memory to move one step along a given dimension; the per-axis metadata that makes transpose an O(1) view and lets reshape stay zero-copy when contiguous (Module 01).
|
|
|
|
## T
|
|
|
|
**Tensor.** The N-dimensional array abstraction every module in the book operates on: data buffer plus shape, dtype, and stride metadata (Module 01).
|
|
|
|
**Throughput.** The number of requests or tokens a system processes per unit time; the metric that governs batch-oriented training and offline inference (Module 19).
|
|
|
|
**Tokenization.** Mapping raw text to a sequence of integer token IDs a model can consume; the interface between the language and the tensors (Module 10).
|
|
|
|
**Train/Eval Mode.** The toggle on the model that switches dropout, batch norm, and other stochastic or batch-statistical layers between their training and inference behaviors (Module 08).
|
|
|
|
**Training Loop.** The outer program that iterates over epochs and batches, running forward, loss, backward, and optimizer step in sequence and periodically logging, checkpointing, and evaluating (Module 08).
|
|
|
|
**Transformer Block.** The residual sandwich of layer norm, multi-head attention, layer norm, and MLP that stacks N times to form a GPT or BERT-style model (Module 13).
|
|
|
|
## V
|
|
|
|
**Vectorization.** Expressing a computation as a whole-array operation so NumPy can dispatch it to SIMD or BLAS instead of interpreting a Python loop element by element (Module 17).
|
|
|
|
**View.** A tensor that shares its data buffer with another tensor and only differs in shape or stride metadata; zero-copy, O(1) to construct, and silently aliased with its source (Module 01).
|
|
|
|
**Vocabulary.** The finite set of tokens a tokenizer knows about, with one integer ID per entry; its size sets the width of the input embedding table and the output logit layer (Module 10).
|