cs249r_book/mlsysim/docs/math.qmd

---
title: "Mathematical Foundations"
subtitle: "The First-Principles Equations Behind Every MLSys·im Solver"
---
MLSys·im avoids "black box" heuristics. Every output traces back to one of the equations below, each grounded in a published systems paper. Before diving into code, read this page to understand *what* the 21 solvers resolving the [22 Systems Walls](architecture.qmd#the-22-systems-walls) are computing and *why*.

::: {.callout-tip}
## Reading these equations
Each solver in MLSys·im implements one or more of the models below.
Click any solver name to go directly to its API documentation.
:::

---

## 1. The Iron Law of ML Systems (Single Node)

*Implemented in [`mlsysim.core.solver.SingleNodeSolver`](api/core.solver.SingleNodeSolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: The Roofline Bottleneck**
Hardware has two speed limits—how fast it can compute, and how fast it can move data from memory to the compute units. Your actual throughput is determined by whichever limit you hit first. This is why we take the *maximum* of two terms, not their sum.

**📚 Source:** @williams2009roofline
:::

$$
T = \max \left( \frac{\text{FLOPs}}{\text{Peak\_FLOPs} \times \eta},\ \frac{\text{Bytes}}{\text{Memory\_BW}} \right) + \text{Dispatch\_Tax}
$$

Where:

- $\eta$ is the hardware utilization efficiency (typically 0.25–0.55 in practice; MLSys·im defaults to 0.5, with 0.35 recommended for inference — see [Accuracy & Validation](accuracy.qmd) for guidance).
- $\text{Dispatch\_Tax}$ is the constant kernel-launch overhead (e.g., CUDA overhead, ~0.01–0.1 ms).
- If $\frac{\text{FLOPs}}{\text{Peak\_FLOPs} \times \eta} > \frac{\text{Bytes}}{\text{Memory\_BW}}$: **Compute-bound** — buy faster GPUs or increase arithmetic intensity.
- If $\frac{\text{Bytes}}{\text{Memory\_BW}}$ wins: **Memory-bound** — increase batch size or use operator fusion.

**Arithmetic Intensity** is the key ratio: $I = \text{FLOPs} / \text{Bytes}$.
The *roofline ridge point* is $I^* = \text{Peak\_FLOPs} / \text{Memory\_BW}$.
If $I > I^*$, you are compute-bound. If $I < I^*$, you are memory-bound.

---

## 2. Distributed Training (3D Parallelism)

*Implemented in [`mlsysim.core.solver.DistributedSolver`](api/core.solver.DistributedSolver.qmd).*

**Why an analytical model?** Real distributed training involves complex interactions between
computation, communication, and scheduling. Empirical profiling requires access to expensive
multi-GPU clusters and takes hours per configuration. MLSys·im instead decomposes the problem
into three independent overheads — data parallelism (gradient synchronization), tensor
parallelism (intra-layer communication), and pipeline parallelism (bubble idle time) — each
governed by a closed-form equation. This lets you evaluate thousands of parallelism
configurations in seconds to identify the best strategy *before* reserving cluster time.

The key insight is that each parallelism dimension introduces a **communication tax** that
can be modeled from first principles: message size, network bandwidth, and topology. The
single-GPU compute time comes from the roofline model (Section 1), and the distributed
overhead is additive on top.

### 2.1 Scaling Efficiency

The solver computes an overall **scaling efficiency** — the fraction of ideal linear speedup
actually achieved:

$$
\eta_{\text{scale}} = \frac{T_{\text{single}}}{T_{\text{single}} + T_{\text{dp}} + T_{\text{tp}} + T_{\text{bubble}}}
$$

Where $T_{\text{single}}$ is the per-GPU compute time (from the roofline model), and the
remaining terms are the communication and scheduling overheads derived below. An efficiency
of 80% on 256 GPUs means you get the equivalent throughput of ~205 GPUs — the rest is
spent on communication.

### 2.2 Ring All-Reduce (Data Parallelism)

After each training step, every GPU must synchronize its gradients with every other GPU.
The standard algorithm is **ring all-reduce**, which arranges GPUs in a logical ring and
passes gradient chunks around it in two phases.

For a model of size $M$ bytes distributed across $N$ accelerators connected in a ring topology
with inter-node bandwidth $BW$ and latency $L$:

$$
T_{\text{dp}} = 2(N-1) \cdot \left( \frac{M / N}{BW} + L \right)
$$

The factor of 2 arises because ring all-reduce has two phases: scatter-reduce and all-gather,
each requiring $N-1$ communication steps. Each step transfers $M/N$ bytes (one chunk of the
gradient), so the total data transferred per GPU approaches $2M$ as $N$ grows — meaning the
bandwidth cost is nearly independent of cluster size, which is why ring all-reduce scales
well.

**Implication**: All-reduce cost grows linearly with model size $M$ but is asymptotically **constant** in $N$ — the factor $2(N-1)/N$ approaches 2 as $N$ grows, meaning adding more GPUs barely increases per-GPU communication time.
For very large models (70B+ parameters = ~140 GB gradients in fp16), communication dominates
at low batch sizes. Upgrading from 100 Gb Ethernet to InfiniBand NDR (400 Gb/s) can recover
10–30% scaling efficiency.

### 2.3 Pipeline Parallelism Bubble

**Pipeline parallelism** splits a model's layers across multiple stages (nodes). Stage 1 processes layers 1–20, stage 2 processes layers 21–40, and so on. This allows models too large for a single GPU to be trained across multiple nodes.

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: Shrinking the Pipeline Bubble**
In standard 1F1B pipeline parallelism, GPUs sit idle waiting for microbatches to traverse the network. You can't change the speed of light, but you *can* change the software schedule. By assigning multiple "virtual stages" ($V$) to a single GPU, we interleave the execution. While a GPU is waiting for the next microbatch of its *first* virtual stage, it can compute a microbatch for its *second* virtual stage, effectively hiding the network latency behind useful compute.

**📚 Source:** @narayanan2021efficient
:::

The cost of pipelining is a **pipeline bubble**: at the start of each batch, downstream stages sit idle while waiting for upstream stages to produce output. When a pipeline of depth $P$ processes $M$ microbatches with $V$ virtual stages per GPU, the fraction of time spent idle is:

$$
\text{Bubble Fraction} = \frac{P - 1}{V \times M + P - 1}
$$

The intuition: with $P$ stages and $M$ microbatches, the pipeline takes time to fill and drain. The solution is to either increase $M$ (more microbatches) or increase $V$ (interleaved schedules). Both make the startup and drain phases a smaller fraction of total time.

**Implication**: To keep the bubble below 5% using standard 1F1B ($V=1$), you need $M \geq 19 \cdot (P-1)$ microbatches. With a 4-stage pipeline ($P=4$), you need at least 57 microbatches. By using $V=2$ virtual stages, you cut the required microbatches in half.

### 2.4 Expert Parallelism (Mixture of Experts)

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: Breaking the Iron Law**
Standard dense Transformers obey a strict "Iron Law": if you double the parameters, you double the memory *and* the compute FLOPs. Mixture of Experts (MoE) breaks this law. It routes tokens only to specific "expert" subnetworks. This means your **Memory Bound** is dictated by the massive *Total Parameters*, but your **Compute Bound** is dictated only by the much smaller *Active Parameters*. The physical tradeoff is a massive network bandwidth tax (All-to-All communication) to route tokens to the right experts across the cluster.

**📚 Source:** @shazeer2017outrageously
:::

To model MoE, we move from 3D to **4D Parallelism**:

$$
\text{Data Parallelism} = \frac{\text{Total GPUs}}{TP \times PP \times EP}
$$

Where $EP$ is Expert Parallelism. If $EP > 1$, the solver adds an All-to-All communication penalty for token routing:

$$
T_{\text{all-to-all}} = \frac{N-1}{N} \times \frac{\text{Message Size}}{\text{Bandwidth}} + (N-1) \times \text{Latency}
$$

---

## 3. LLM Serving Lifecycle

*Implemented in [`mlsysim.core.solver.ServingSolver`](api/core.solver.ServingSolver.qmd).*

LLM autoregressive inference has two physically distinct phases. Understanding which phase
dominates is critical for capacity planning.

### 3.1 Pre-fill Phase (Compute-Bound)

The initial forward pass over the full prompt is compute-bound because all tokens are processed in parallel:

$$
\text{TTFT} = \frac{2 \times \text{Parameters} \times \text{Seq\_Len} \times \text{Batch}}{\text{Peak\_FLOPs} \times \eta} + \text{Dispatch\_Tax}
$$

The factor of 2 counts both the multiply and the add in each multiply-accumulate (MAC) operation.

### 3.2 Decoding Phase (Memory-Bound)

Each token decode step requires loading the entire model weight matrix plus the accumulated KV-cache:

$$
\text{ITL} = \frac{\text{Model\_Bytes} + \text{KV\_Cache\_Bytes}}{\text{Memory\_BW}}
$$

This phase is almost always **memory-bound** on current hardware because generating one token
requires the same memory load as a full matrix-vector product, but performs far fewer FLOPs.

### 3.3 KV-Cache Size

**Simplified form** (contiguous allocation):

$$
\text{KV\_Bytes} = 2 \times L \times H_{\text{kv}} \times d \times S \times B \times b
$$

where $L$ is the number of layers, $H_{\text{kv}}$ is the number of KV heads (equals $H$ for MHA, less for GQA), $d$ is the head dimension, $S$ is the sequence length, $B$ is the batch size, and $b$ is the bytes per parameter.

**PagedAttention form** (paged allocation, as in vLLM):

$$
\text{KV\_Bytes} = 2 \times L \times H_{\text{kv}} \times d \times \lceil S/p \rceil \times p \times B \times b
$$

where $p$ is the page size in tokens. When $p = S$ (no paging), this reduces to the simplified form.

The factor of 2 counts both the K and V matrices. At fp16 (2 bytes/param), a 70B model with
a 4096-token context at batch=32 requires approximately **540 GB** of KV-cache — more than
a single H100 node can hold.

---

## 4. The Data Wall

*Implemented in [`mlsysim.core.solver.DataSolver`](api/core.solver.DataSolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: Data Ingestion Bottlenecks**
The "Data Wall" occurs when the storage or IO interconnect cannot supply data to the compute units fast enough to keep them saturated. This is common in high-throughput training (e.g., video data for autonomous driving) or when training from slow remote storage.

**📚 Source:** @mlsysbook2025, Chapter 4 (Data Engineering).
:::

The data supply bandwidth is limited by the minimum bandwidth across the storage hierarchy (SSD/HDD) and the IO interconnect (PCIe):

$$
\text{Supply\_BW} = \min(\text{Storage\_BW}, \text{IO\_Interconnect\_BW})
$$

The system is stalled if the required data rate exceeds the supply bandwidth:

$$
\text{Utilization} = \frac{\text{Required\_Data\_Rate}}{\text{Supply\_BW}}
$$

If $\text{Utilization} > 1.0$, the compute units will stall waiting for data, reducing effective throughput regardless of the model's arithmetic intensity.

---

## 5. Scaling Physics (Chinchilla Laws)

*Implemented in [`mlsysim.core.solver.ScalingSolver`](api/core.solver.ScalingSolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: Compute-Optimal Training**
Scaling laws describe how model performance improves with more compute ($C$), parameters ($P$), and data ($D$). The Chinchilla study found that most models are "undertrained" and that for a fixed compute budget, $P$ and $D$ should be scaled in equal proportions.

**📚 Source:** @hoffmann2022chinchilla
:::

The total floating-point operations required for training a Transformer is approximately:

$$
C \approx 6 \times P \times D
$$

The compute-optimal point (Chinchilla point) occurs when the number of training tokens is roughly 20 times the number of parameters:

$$
D \approx 20 \times P
$$

Given a compute budget $C$ (in FLOPs), the optimal parameter count is:

$$
P_{\text{opt}} = \sqrt{\frac{C}{120}}
$$

---

## 6. Cluster Orchestration (Little's Law)

*Implemented in [`mlsysim.core.solver.OrchestrationSolver`](api/core.solver.OrchestrationSolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: The Wait Wall**
In a shared cluster, the time a researcher waits for a job to start is determined by the arrival rate of new jobs ($\lambda$) and the processing rate of the cluster ($\mu$). As utilization approaches 100%, wait times explode non-linearly.

**📚 Source:** @little1961proof
:::

Cluster utilization $\rho$ is the ratio of the arrival rate to the service rate:

$$
\rho = \frac{\lambda}{\mu}
$$

Using an M/D/1 queue model (Poisson arrivals, fixed job durations), the average wait time in the queue is:

$$
T_{\text{wait}} = \frac{\rho}{2\mu(1 - \rho)}
$$

As $\rho \to 1.0$, $T_{\text{wait}} \to \infty$. This illustrates why maintaining some "headroom" in cluster capacity is essential for researcher productivity.

---

## 7. Model Compression (The Compression Tax)

*Implemented in [`mlsysim.core.solver.CompressionSolver`](api/core.solver.CompressionSolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: Accuracy vs. Efficiency**
Compressing a model via quantization or pruning reduces its memory footprint and increases inference speed, but often at the cost of task accuracy. This "Compression Tax" must be evaluated to ensure the model remains functional for its intended use.

**📚 Source:** @han2015deep
:::

The **Compression Ratio** ($R$) is the ratio of the original model size to the compressed size:

$$
R = \frac{\text{Size}_{\text{original}}}{\text{Size}_{\text{compressed}}}
$$

For **Quantization**, $R$ is determined by the bit-width ($b$):

$$
R_{\text{quant}} = \frac{32}{b}
$$

For **Pruning**, $R$ is determined by the sparsity ratio ($s$):

$$
R_{\text{pruning}} = \frac{1}{1 - s}
$$

The estimated accuracy impact ($\Delta A$) is non-linear and typically follows empirical heuristics derived from benchmark studies (e.g., INT8 quantization often yields <1% drop, while 4-bit quantization can yield 2–5% drop).

---

## 8. Datacenter Sustainability

*Implemented in [`mlsysim.core.solver.SustainabilitySolver`](api/core.solver.SustainabilitySolver.qmd).*

### 8.1 Total Energy

$$
E = \text{IT\_Power} \times \text{Hours} \times \text{PUE}
$$

Power Usage Effectiveness (PUE) accounts for cooling and facility overhead. A PUE of 1.0 is
theoretical perfect efficiency; hyperscale datacenters typically achieve 1.1–1.4.

### 8.2 Carbon Footprint

$$
C = E \times \text{Carbon\_Intensity}
$$

Where $C$ is in $\text{kg CO}_2\text{e}$ and $\text{Carbon\_Intensity}$ is in $\text{g CO}_2\text{e/kWh}$,
sourced from IEA regional grid data. This value varies from ~20 g/kWh (Quebec hydro) to
~820 g/kWh (Poland coal)—a **~41× difference** for identical ML workloads.

---

## 9. Total Cost of Ownership (TCO)

*Implemented in [`mlsysim.core.solver.EconomicsSolver`](api/core.solver.EconomicsSolver.qmd).*

$$
\text{TCO} = \text{CapEx}_{\text{amortized}} + \text{OpEx}_{\text{power}} + \text{OpEx}_{\text{networking}} + \text{OpEx}_{\text{labor}}
$$

Where:

- $\text{CapEx}_{\text{amortized}} = \text{Hardware\_Cost} / \text{Depreciation\_Years}$
- $\text{OpEx}_{\text{power}} = E \times \text{Electricity\_Rate}$

---

## 10. Cluster Reliability (The Young-Daly Model)

*Implemented in [`mlsysim.core.solver.ReliabilitySolver`](api/core.solver.ReliabilitySolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**💡 Intuition: The Cost of Checkpointing**
When training massive models on thousands of GPUs for months, hardware failures are not a possibility; they are a statistical certainty. If a node fails, the job crashes and you lose all progress since the last checkpoint. You want to save checkpoints frequently to minimize lost work, but writing a 140GB checkpoint to remote storage takes time, pausing the training. The Young-Daly model calculates the optimal balance between *time wasted saving checkpoints* and *time wasted re-computing after a failure*.

**📚 Source:** @young1974first and @daly2006higher
:::

The optimal checkpoint interval $\tau_{\text{opt}}$ is defined by the Mean Time Between Failures ($M$) and the time it takes to write a single checkpoint ($\delta$):

$$
\tau_{\text{opt}} = \sqrt{2 \times \delta \times M}
$$

For a cluster, the collective $M$ drops linearly with the number of components. If a single node has an MTBF of 10,000 hours, a cluster of 1,000 nodes will have an MTBF of just 10 hours ($10,000 / 1000$).

---

## 11. Tail Latency (Erlang-C Queueing)

*Implemented in [`mlsysim.core.solver.TailLatencySolver`](api/core.solver.TailLatencySolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**Intuition: Why P99 Matters More Than Median**
Deploying inference at scale requires not just median performance but P99 tail latency guarantees. As server utilization increases from 80% to 95%, tail latency doesn't increase linearly — it explodes. This non-linearity surprises engineers accustomed to thinking of "high utilization" as desirable.

**Source:** @dean2013tail
:::

For an inference server farm modeled as an M/M/$c$ queue with $c$ replicas and per-server utilization $\rho = \lambda / (c\mu)$:

$$
\mathbb{P}[\text{wait}] = \frac{(c\rho)^c / c! \cdot (1-\rho)^{-1}}{\sum_{k=0}^{c-1}(c\rho)^k/k! + (c\rho)^c/c! \cdot (1-\rho)^{-1}}
$$

P99 latency grows non-linearly as $\rho \to 1$. The TailLatencySolver computes P50 and P99 wait times and SLO violation probability as a function of arrival rate, service latency, and replica count.

---

## 12. Weight Streaming (Wafer-Scale Inference)

*Implemented in [`mlsysim.core.solver.WeightStreamingSolver`](api/core.solver.WeightStreamingSolver.qmd).*

::: {.callout-note appearance="simple" icon=false}
**Intuition: Inverting the Memory Wall**
Wafer-scale architectures (e.g., Cerebras CS-3) invert the GPU memory wall: activations reside on-wafer in SRAM while model weights stream from external MemoryX nodes. The bottleneck shifts from HBM bandwidth to injection interconnect bandwidth.
:::

Per-layer execution time is:

$$
T_{\text{layer}} = \max\!\left(\frac{|W_{\ell}|}{BW_{\text{inject}}},\; \frac{2|W_{\ell}| \times B}{\text{Peak} \times \eta}\right)
$$

The optimal batch size where injection and compute perfectly overlap:

$$
B^{*} = \frac{BW_{\text{inject}} \times \text{Peak} \times \eta}{2|W_{\ell}|}
$$

---

## 13. Responsible Engineering (DP-SGD Overhead)

*Implemented in [`mlsysim.core.solver.ResponsibleEngineeringSolver`](api/core.solver.ResponsibleEngineeringSolver.qmd).*

Privacy and fairness guarantees impose measurable computational overhead. Differential privacy via DP-SGD requires per-sample gradient clipping and calibrated noise addition:

$$
\sigma \propto \frac{1}{\epsilon}
$$

The per-step clipping and noise addition incur a training slowdown of approximately 2–10×. Fairness constraints require sufficient representation of minority subgroups, demanding additional data proportional to $O(1/p_{\min})$ where $p_{\min}$ is the smallest subgroup prevalence.

---

## 14. Sensitivity Analysis (Binding Constraints)

*Implemented in [`mlsysim.core.solver.SensitivitySolver`](api/core.solver.SensitivitySolver.qmd).*

The SensitivitySolver identifies the binding constraint by computing partial derivatives of end-to-end latency with respect to each hardware parameter:

$$
\frac{\partial T}{\partial BW_{\text{mem}}}, \quad \frac{\partial T}{\partial \text{Peak}_{\text{FLOPS}}}, \quad \frac{\partial T}{\partial BW_{\text{net}}}, \quad \ldots
$$

The parameter with the largest (most negative) sensitivity is the binding constraint — the single upgrade that would yield the greatest performance improvement. This transforms "where should I invest?" from intuition into calculation.

---

## 15. Inverse Roofline Synthesis

*Implemented in [`mlsysim.core.solver.SynthesisSolver`](api/core.solver.SynthesisSolver.qmd).*

The SynthesisSolver inverts the analysis: given a workload and a service-level agreement, it derives the minimum hardware specifications required:

$$
BW_{\text{required}} = \frac{|W|}{T_{\text{target}}}, \qquad \text{FLOPS}_{\text{required}} = \frac{\text{OPs}}{T_{\text{target}} \times \eta}
$$

This enables hardware–software co-design: rather than asking "how fast is my system?" the practitioner asks "what system do I need?"

---

## 16. Software Efficiency (Wall 3: MFU Decomposition)

*Implemented in [`mlsysim.core.solver.EfficiencySolver`](api/core.solver.EfficiencySolver.qmd).*

The gap between peak and achieved FLOP/s is captured by the utilization parameter $\eta$:

$$
\eta = \frac{\text{Achieved\_FLOP/s}}{\text{Peak\_FLOP/s}}
$$

In practice, $\eta$ decomposes into multiplicative factors:

$$
\eta = \eta_{\text{occupancy}} \times \eta_{\text{fusion}} \times \eta_{\text{precision}} \times \eta_{\text{memory}}
$$

where occupancy captures warp-level parallelism, fusion captures kernel launch elimination, precision captures tensor core utilization, and memory captures data reuse. Well-optimized training (Megatron-LM, DeepSpeed) achieves $\eta \approx 0.35\text{–}0.55$; unoptimized inference may drop below 0.10.

---

## 17. Continuous Batching (Wall 5: PagedAttention)

*Implemented in [`mlsysim.core.solver.ContinuousBatchingSolver`](api/core.solver.ContinuousBatchingSolver.qmd).*

Static batching wastes memory by pre-allocating contiguous KV-cache for the maximum sequence length. PagedAttention allocates KV-cache in fixed-size pages:

$$
\text{KV\_paged} = 2 \times L \times H_{\text{kv}} \times d \times \lceil S/p \rceil \times p \times B \times b
$$

The memory savings come from eliminating internal fragmentation. The effective batch size $B_{\text{eff}}$ under a memory budget $M_{\text{budget}}$ is:

$$
B_{\text{eff}} = \left\lfloor \frac{M_{\text{budget}} - |W|}{2 \times L \times H_{\text{kv}} \times d \times S \times b} \right\rfloor
$$

where $|W|$ is the model weight footprint. Continuous batching dynamically inserts and removes requests as they complete, keeping $B_{\text{eff}}$ near maximum at all times.

---

## 18. Data Transformation (Wall 9: CPU Preprocessing)

*Implemented in [`mlsysim.core.solver.TransformationSolver`](api/core.solver.TransformationSolver.qmd).*

CPU preprocessing (decode, resize, tokenize, augment) must keep pace with GPU consumption:

$$
T_{\text{transform}} = \frac{B \times S}{C_{\text{throughput}} \times W}
$$

where $B$ is the batch size, $S$ is the per-sample processing cost (e.g., JPEG decode + resize), $C_{\text{throughput}}$ is the single-core throughput, and $W$ is the number of CPU workers. The pipeline is stalled when:

$$
T_{\text{transform}} > T_{\text{compute}}
$$

Common mitigations: increase $W$ (more workers), cache preprocessed data, or use GPU-accelerated decoding (DALI).

---

## 19. Network Topology (Wall 10: Bisection Bandwidth)

*Implemented in [`mlsysim.core.solver.TopologySolver`](api/core.solver.TopologySolver.qmd).*

The effective bandwidth available for collective communication depends on the network topology and its oversubscription ratio:

$$
\text{BW}_{\text{eff}} = \frac{\text{BW}_{\text{link}} \times \beta}{\text{oversubscription}}
$$

where $\beta$ is the bisection bandwidth ratio (1.0 for full bisection in fat-tree topologies, <1.0 for oversubscribed networks). For a $k$-ary fat-tree with oversubscription ratio $r$:

$$
\text{BW}_{\text{bisection}} = \frac{k^2}{4r} \times \text{BW}_{\text{link}}
$$

Fat-tree topologies (Leiserson, 1985) provide full bisection bandwidth ($r=1$), making them ideal for AllReduce-heavy workloads. Dragonfly and torus topologies trade bisection bandwidth for lower cost at scale.

---

## 20. Inference-Time Scaling (Wall 12: Reasoning Compute)

*Implemented in [`mlsysim.core.solver.InferenceScalingSolver`](api/core.solver.InferenceScalingSolver.qmd).*

Chain-of-thought, tree search, and other inference-time scaling strategies multiply the compute cost by the number of reasoning steps $K$:

$$
T_{\text{reasoning}} = K \times T_{\text{step}}
$$

where $T_{\text{step}}$ is the latency of a single forward pass (from the roofline model). For best-of-$N$ sampling:

$$
T_{\text{best-of-N}} = N \times T_{\text{generate}} + T_{\text{verify}}
$$

The key insight from Snell et al. (2024) is that inference-time compute can substitute for training-time compute: a smaller model with $K$ reasoning steps may match a larger model with a single pass, at the cost of $K\times$ inference latency.

---

## 21. Checkpoint I/O (Wall 19: MFU Penalty)

*Implemented in [`mlsysim.core.solver.CheckpointSolver`](api/core.solver.CheckpointSolver.qmd).*

Periodic checkpointing imposes an I/O burst that pauses training:

$$
\text{MFU\_penalty} = \frac{T_{\text{write}}}{T_{\text{interval}}}
$$

where $T_{\text{write}}$ is the time to write one checkpoint (model weights + optimizer state) and $T_{\text{interval}}$ is the checkpoint interval. For Adam optimizer with fp16 weights:

$$
\text{Checkpoint\_Size} = |W| \times (1 + 2 + 2) = 5|W|
$$

The factor of 5 accounts for fp16 weights (1×), fp32 master weights (2×), and fp32 momentum + variance (2×). The optimal checkpoint interval from the Young-Daly formula (Section 10) balances checkpoint overhead against expected rework from failures.

---

::: {.callout-note}
## Limitations of First-Order Models
These equations are first-order analytical models. They assume:
(1) uniform memory access patterns, (2) no cache effects, (3) no network contention under
heavy load, and (4) linear scaling of throughput with batch size.
Real systems deviate from these assumptions. MLSys·im predictions are typically accurate
within ±20% of measured hardware performance — sufficient for systems intuition and
capacity planning, but not a substitute for empirical profiling.

For a detailed discussion of accuracy, limitations, and when to use MLSys·im vs. empirical profiling, see [Accuracy & Validation](accuracy.qmd).
:::