cs249r_book/tinytorch/quarto/glossary.qmd

---
title: "Glossary"
---

Alphabetical reference for the key technical terms introduced across the TinyTorch Lab Guide, framed in the voice the modules use them in. Where a term is central to a specific module, that module is cited as the place to return for depth.

## A

**Acceleration.** The practice of replacing slow Python loops with vectorized NumPy, cache-aware memory layouts, and BLAS-backed kernels so a computation runs closer to the hardware's peak throughput (Module 17).

**Activation Function.** An element-wise nonlinearity such as ReLU, Sigmoid, Tanh, or GELU applied between linear layers; without it, a stack of layers collapses algebraically into a single linear transform (Module 02).

**Activation Pinning.** The systems consequence of reverse-mode autograd: every forward activation must stay resident in memory until the backward pass consumes it, which sets a hard floor on training VRAM (Module 06).

**Adam.** An adaptive optimizer that maintains per-parameter first and second moment estimates of the gradient, trading extra optimizer state for faster, more robust convergence (Module 07).

**AdamW.** Adam with decoupled weight decay, where the L2 penalty is applied directly to the parameter rather than folded into the gradient; the default optimizer for modern transformer training (Module 07).

**AllReduce.** A collective communication primitive that sums tensors across devices and returns the result to all participants; inserted by gradient clipping and synchronous data-parallel training, and often the step that bottlenecks distributed pipelines.

**Arithmetic Intensity.** The ratio of floating-point operations performed per byte of memory moved; the coordinate along the roofline model's x-axis that determines whether a kernel is memory-bound or compute-bound (Module 17).

**Attention.** A mechanism that computes a weighted combination of value vectors using similarities between a query and a set of keys, enabling a model to look at every position in a sequence in parallel (Module 12).

**Autograd.** The automatic differentiation engine that records a computational graph during the forward pass and traverses it in reverse to produce gradients with respect to every parameter (Module 06).

**Autoregressive Generation.** Producing one token at a time, feeding each output back in as the next input; the decoding pattern that makes the KV cache a necessity rather than a nicety (Module 13).

## B

**Backward Pass.** The traversal of the computational graph from loss to inputs that applies the chain rule to accumulate gradients into every parameter tensor (Module 06).

**Batch.** A group of examples processed together so a single matrix multiply can amortize memory-bandwidth and kernel-launch overhead across many samples (Module 05).

**Batch Normalization.** A layer that normalizes activations using batch statistics during training and running statistics during inference, stabilizing deep networks at the cost of a train/eval mode distinction (Module 09).

**Benchmarking.** The disciplined measurement of latency, throughput, and memory under controlled conditions, with warmup, multiple runs, and variance reporting, so that a reported speedup is reproducible (Module 19).

**BF16.** Brain Floating Point 16, a 16-bit format that keeps FP32's exponent range but halves the mantissa, trading precision for memory bandwidth during training (Module 15).

**BLAS.** Basic Linear Algebra Subprograms, the decades-old library interface that underlies every production matmul; its Level 3 routines are tiled and cache-aware so dense linear algebra approaches peak FLOPS (Module 17).

**Broadcasting.** The right-to-left shape-alignment rule that lets a `(768,)` bias add into a `(32, 512, 768)` activation tensor without materializing the tiled copy, saving memory proportional to the broadcast dimensions (Module 01).

**Byte Pair Encoding (BPE).** A subword tokenization algorithm that greedily merges the most frequent adjacent pairs, compressing text into a fixed vocabulary that stays robust to rare words (Module 10).

## C

**Cache Tiling.** Breaking a matmul into blocks sized to fit in L1 or L2 cache so each loaded tile is reused many times before eviction, turning an O(n³) operation from memory-bound back into compute-bound (Module 01).

**Calibration.** Passing a small representative dataset through a model to collect the activation statistics that post-training quantization needs to pick scale and zero-point parameters (Module 15).

**Causal Mask.** An upper-triangular mask applied to attention scores that zeroes out future positions, enforcing the autoregressive constraint that token $t$ cannot attend to token $t+1$ (Module 12).

**Chain Rule.** The calculus identity that lets autograd compose per-operation local gradients into a full end-to-end gradient, one multiplication per edge of the computational graph (Module 06).

**Checkpointing.** Writing optimizer state, model weights, and training metadata to disk at regular intervals so a multi-day run can survive a preemption or crash (Module 08).

**Compression Ratio.** The size of the original model divided by the size after pruning, distillation, or low-rank approximation; the headline number every deployment constraint is measured against (Module 16).

**Computational Graph.** The directed acyclic graph of tensor operations autograd records on the forward pass; its topology determines the order and memory cost of the backward pass (Module 06).

**Compute-Bound.** A kernel whose runtime is limited by the rate at which the ALUs can issue floating-point operations, not by the rate at which memory can feed them (Module 14).

**Contiguous Layout.** A tensor whose elements sit in memory in the order implied by its shape and row-major strides; iteration is sequential, cache lines are reused, and SIMD loads stay aligned (Module 01).

**Convergence.** The state in which additional optimizer steps no longer meaningfully reduce training loss; the signal that a run is either done or stuck (Module 07).

**Convolution.** A weight-sharing layer that slides a small kernel across spatial input, exploiting the locality and translation-equivariance of image data (Module 09).

**Cross-Entropy Loss.** The negative log-likelihood of the true class under the model's predicted distribution, the standard training signal for classification and language modeling (Module 04).

## D

**DataLoader.** An iterator that streams minibatches from a dataset, optionally shuffling, augmenting, and prefetching so the accelerator is never idle waiting for input (Module 05).

**Dropout.** A regularizer that zeroes a random subset of activations during training and rescales the remainder, preventing co-adaptation; active only in train mode (Module 03).

**Dtype.** The numerical type of a tensor's elements (float32, float16, bfloat16, int8); the single knob that most directly trades precision for memory and bandwidth (Module 01).

## E

**Embedding Table.** A learnable matrix of shape `(vocab_size, d_model)` whose rows map discrete token indices to dense vectors, the input side of every modern language model (Module 11).

**Epoch.** One full pass of the optimizer over every example in the training set; the unit most schedules (learning rate, weight decay) are expressed in (Module 08).

## F

**Forward Pass.** The left-to-right evaluation of a model that turns an input batch into predictions and, during training, records the computational graph for the backward pass (Module 01).

**FP16.** Half-precision floating point, a 16-bit format with reduced exponent range; halves memory and doubles bandwidth relative to FP32, but requires loss scaling to avoid gradient underflow (Module 15).

## G

**GEMM.** General Matrix-Matrix Multiplication, the BLAS Level 3 routine that every linear layer's forward and backward passes ultimately dispatch to (Module 03).

**Gradient.** The partial derivative of the loss with respect to a parameter; the signal the optimizer consumes to decide which way and how far to step (Module 06).

**Gradient Accumulation.** Summing gradients from several micro-batches into a parameter's `.grad` buffer before calling `optimizer.step()`, simulating a larger batch on memory-constrained hardware (Module 06).

**Gradient Checkpointing.** Discarding selected forward activations and recomputing them during the backward pass, trading compute for peak memory during training (Module 18).

**Gradient Clipping.** Rescaling the gradient vector so its global norm stays below a threshold, preventing occasional outlier batches from blowing up the parameter update (Module 08).

## H

**HBM.** High Bandwidth Memory, the stacked DRAM directly attached to a GPU package; far faster than host DRAM but far slower than on-chip SRAM, which is why FlashAttention exists.

## I

**INT8.** An 8-bit integer representation used for quantized weights and activations; reduces model size 4x versus FP32 and lets inference hardware use integer ALUs that are cheaper and faster than floating-point units (Module 15).

**In-place Operation.** An operator that overwrites its input buffer rather than allocating a new one, saving memory bandwidth at the cost of breaking autograd whenever the input is still needed for the backward pass (Module 02).

## K

**Kernel Fusion.** Combining several element-wise operations into a single pass over memory so the activation is read and written once instead of once per op; one of the largest wins available to a framework (Module 17).

**Knowledge Distillation.** Training a smaller student network to match the output distribution of a larger teacher, compressing the teacher's learned behavior into a model with fewer parameters (Module 16).

**KV Cache.** The memoization of key and value projections across autoregressive steps, converting attention's per-step cost from O(N²) recomputation to O(N) reuse at the cost of O(N) additional memory per token (Module 18).

## L

**Latency.** The time to complete a single request end to end; the metric that governs user-facing interactive inference (Module 19).

**Layer Norm.** Per-example normalization across the feature dimension, the stabilizing ingredient that lets transformers stack to hundreds of layers without exploding or collapsing activations (Module 13).

**Learning Rate.** The scalar multiplier on the gradient in each optimizer step; the single hyperparameter most likely to decide whether a run converges at all (Module 07).

**Linear Layer.** A parameterized affine transformation $y = xW^\top + b$, the building block every deep network is ultimately assembled from (Module 03).

**Little's Law.** The queueing identity $L = \lambda W$ relating concurrency, throughput, and latency; the tool for reasoning about when a serving system is bandwidth-bound versus latency-bound (Modules 19, 20).

**Logits.** The raw unnormalized scores a classifier emits before Softmax; the input most numerically stable loss functions expect (Module 04).

**Loss.** A scalar measure of how wrong a model's predictions are on a batch; the quantity the optimizer minimizes (Module 04).

**Low-Rank Approximation.** Factoring a weight matrix $W$ as $UV^\top$ with intermediate rank $r \ll \min(m, n)$, cutting parameter count and compute while preserving most of the matrix's dominant behavior (Module 16).

## M

**Matmul.** Matrix multiplication; the O(n³) compute, O(n²) memory operation that dominates training and inference time and that every systems optimization ultimately defends (Module 01).

**Memoization.** Caching the result of a deterministic computation so later calls with the same inputs become a table lookup instead of a recompute; the basis for the KV cache (Module 18).

**Memory-Bound.** A kernel whose runtime is limited by the rate at which memory can deliver operands, not by the rate at which the ALUs can consume them; the regime where arithmetic intensity is low (Module 14).

**Memory Wall.** The widening gap between compute throughput and memory bandwidth that makes data movement, not math, the dominant cost of modern ML workloads (Module 15).

**Mini-batch.** A subset of the training set processed in a single forward/backward step; a compromise between the variance of stochastic updates and the memory cost of full-batch gradient descent (Module 05).

**MLP.** A multi-layer perceptron: alternating linear layers and nonlinearities; the feedforward sublayer inside every transformer block (Module 13).

**Multi-Head Attention.** Running several attention heads in parallel over independently projected query, key, and value subspaces, then concatenating; lets the model attend to multiple relational patterns at once (Module 12).

## O

**Optimizer.** The stateful object that consumes gradients and produces parameter updates, from plain SGD to Adam to AdamW (Module 07).

**Overfitting.** When a model's training loss keeps decreasing while its validation loss rises; the signal that the network has started memorizing instead of generalizing (Module 08).

## P

**PagedAttention.** An inference-time memory manager that stores the KV cache in fixed-size blocks and indirects through a page table, eliminating the fragmentation that naive contiguous caches suffer under variable sequence lengths (Module 18).

**Positional Encoding.** A signal added to token embeddings so attention, which is otherwise permutation-invariant, can distinguish position; sinusoidal, learned, and rotary variants all serve the same purpose (Module 11).

**Post-Training Quantization.** Converting a trained FP32 model to a lower-precision format (typically INT8) using a small calibration set, without retraining (Module 15).

**Profiling.** Measuring where a program actually spends its time and memory, by instrumenting regions with timers and allocators, so optimization effort is directed at the actual bottleneck (Module 14).

**Pruning.** Zeroing out weights deemed unimportant (by magnitude, by gradient, by structured group) to reduce the effective parameter count of a trained model (Module 16).

## Q

**Q/K/V.** The query, key, and value projections attention produces from the input; Q asks "what am I looking for," K advertises "what do I offer," V carries "what I will hand back" (Module 12).

**Quantization.** Replacing FP32 tensors with a lower-precision integer or narrow-float representation, reducing memory, bandwidth, and often compute cost at a small accuracy price (Module 15).

## R

**ReLU.** The rectified linear unit $\max(0, x)$; cheap, non-saturating, and the activation that made deep networks trainable at scale (Module 02).

**Residual Connection.** A shortcut that adds a layer's input to its output, giving the gradient a direct path to earlier parameters and making deep stacks optimizable (Module 13).

**Retain Grad.** The opt-in flag that tells autograd to keep a non-leaf tensor's gradient around after the backward pass, at the cost of the memory autograd normally reclaims (Module 06).

**Roofline Model.** A performance chart plotting achievable FLOP/s against arithmetic intensity, with a memory-bandwidth ceiling on the left and a peak-compute ceiling on the right; tells you which wall your kernel is hitting (Module 14).

## S

**Scale and Zero-Point.** The two parameters that map a floating-point range $[x_{\min}, x_{\max}]$ into an integer range during quantization: scale sets the step size, zero-point sets the integer that represents zero (Module 15).

**SGD.** Stochastic Gradient Descent, the baseline optimizer: subtract the learning rate times the gradient from each parameter (Module 07).

**SIMD.** Single Instruction, Multiple Data; CPU vector instructions that apply one op to a register full of values in a single clock, the reason element-wise NumPy is orders of magnitude faster than a Python loop (Module 01).

**Softmax.** The exponentiate-and-normalize function that turns a vector of logits into a probability distribution; numerically stabilized in practice by subtracting the max before exponentiating (Modules 02, 12).

**Stride.** The number of bytes (or elements) to skip in memory to move one step along a given dimension; the per-axis metadata that makes transpose an O(1) view and lets reshape stay zero-copy when contiguous (Module 01).

## T

**Tensor.** The N-dimensional array abstraction every module in the book operates on: data buffer plus shape, dtype, and stride metadata (Module 01).

**Throughput.** The number of requests or tokens a system processes per unit time; the metric that governs batch-oriented training and offline inference (Module 19).

**Tokenization.** Mapping raw text to a sequence of integer token IDs a model can consume; the interface between the language and the tensors (Module 10).

**Train/Eval Mode.** The toggle on the model that switches dropout, batch norm, and other stochastic or batch-statistical layers between their training and inference behaviors (Module 08).

**Training Loop.** The outer program that iterates over epochs and batches, running forward, loss, backward, and optimizer step in sequence and periodically logging, checkpointing, and evaluating (Module 08).

**Transformer Block.** The residual sandwich of layer norm, multi-head attention, layer norm, and MLP that stacks N times to form a GPT or BERT-style model (Module 13).

## V

**Vectorization.** Expressing a computation as a whole-array operation so NumPy can dispatch it to SIMD or BLAS instead of interpreting a Python loop element by element (Module 17).

**View.** A tensor that shares its data buffer with another tensor and only differs in shape or stride metadata; zero-copy, O(1) to construct, and silently aliased with its source (Module 01).

**Vocabulary.** The finite set of tokens a tokenizer knows about, with one integer ID per entry; its size sets the width of the input embedding table and the output logit layer (Module 10).