Files
cs249r_book/book/quarto/contents/vol1/model_serving/model_serving.qmd
Vijay Janapa Reddi 2100099efb Standardize LaTeX subscripts to \text{} across both volumes.
Replace D_{vol}, R_{peak}, L_{lat} with D_{\text{vol}},
R_{\text{peak}}, L_{\text{lat}} in all QMD files and notation.qmd
to match the canonical notation convention. Also escape bare
FLOPs/$ to FLOPs/\$ in vol1 introduction. 288 replacements
across 24 files.
2026-03-05 11:04:34 -05:00

4417 lines
386 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
quiz: serving_quizzes.json
concepts: serving_concepts.yml
glossary: serving_glossary.json
engine: jupyter
---
# Model Serving {#sec-model-serving}
::: {layout-narrow}
::: {.column-margin}
\chapterminitoc
:::
\noindent
:::
## Purpose {.unnumbered}
\begin{marginfigure}
\mlsysstack{35}{30}{25}{15}{90}{40}{20}{20}
\end{marginfigure}
_Why does serving invert every optimization priority that made training successful?_
Training and serving demand opposite physics. Training maximizes throughput (\(T\), in samples per second): large batches and long epochs where latency spikes get absorbed invisibly. Serving minimizes latency (\(L_{\text{lat}}\), in milliseconds per request): individual requests answered fast enough that a single slow response is a *broken product*. Training amortizes hardware costs across billions of examples; serving pays a tax on every request, where small inefficiencies compound into operational debt. This inversion is why models that train beautifully often serve poorly: the batch-heavy architectures and memory-intensive optimizations designed to saturate accelerators during training are fundamentally ill-suited for the bursty, latency-critical, cost-sensitive reality of production traffic. But serving is more than a latency problem. A serving system must handle traffic that varies by orders of magnitude between peak and trough, route requests across model versions during progressive rollouts, degrade gracefully when upstream dependencies fail, and do all of this continuously—not for the duration of a training run but for the lifetime of the product. Every model that proved its value during training and survived compression and benchmarking eventually arrives at the serving layer—the deployment and integration stage of the ML lifecycle—where the question shifts from "does it work?" to "does it work *reliably, at scale, under production conditions, every second of every day*?" The serving infrastructure is where ML systems finally meet users, and the engineering that sustains that meeting is qualitatively different from the engineering that created the model.
::: {.content-visible when-format="pdf"}
\newpage
:::
::: {.callout-tip title="Learning Objectives"}
- Explain the inversion from throughput optimization to latency minimization that distinguishes serving from training
- Decompose request latency into preprocessing, inference, and postprocessing phases to identify bottlenecks using the **latency budget framework**
- Apply **queuing theory** (**Little's Law**, **M/M/1 models**) and capacity planning to meet percentile latency SLOs
- Identify sources of **training-serving skew** and **cold start latency**, and select appropriate prevention and mitigation strategies
- Select batching and runtime strategies based on traffic patterns (**Server**, **SingleStream**, **MultiStream**, **Offline**), latency constraints, and cost requirements
- Evaluate the memory-bandwidth and **KV-cache** constraints unique to LLM serving, including **TTFT**/**TPOT** metrics, **continuous batching**, and **PagedAttention**
- Evaluate deployment tradeoffs across precision, runtime selection, and infrastructure cost
:::
## Serving Paradigm {#sec-model-serving-serving-paradigm-9634}
Serving\index{Serving!production deployment}\index{Model Serving!paradigm shift} marks the transition from model development to production deployment. The four deployment paradigms introduced in @sec-ml-systems (Cloud, Edge, Mobile, and TinyML) each impose distinct serving challenges, but all share a common inversion: the throughput-to-latency shift introduced in the Purpose. This inversion has concrete engineering implications that ripple through every technique established in prior chapters. The Iron Law of ML Systems (@sec-introduction-iron-law-ml-systems-c32a) undergoes a decisive shift: the latency term\index{Latency!serving constraint} ($L_{\text{lat}}$), representing the irreducible overhead of request scheduling, network round-trips, and system orchestration, becomes the dominant constraint rather than a rounding error. @sec-benchmarking measured performance under controlled conditions, but serving faces traffic patterns that no benchmark could anticipate; @sec-model-compression provided quantization methods that reduced model size, but serving must confirm those optimizations preserve accuracy under real traffic distributions. These revalidations define the *serving inversion*\index{Serving Inversion!throughput to latency}.
::: {.callout-perspective title="The Serving Inversion"}
Applying the **D·A·M taxonomy** reveals how deployment inverts the engineering priorities:
* **Data (Information)**: In training, the goal is **Volume** (shuffling billions of samples). In serving, the goal is **Freshness** (processing one request *right now*).
* **Algorithm (Logic)**: In training, the math is **Mutable** (updating weights via backprop). In serving, the math is **Frozen** (fixed weights, forward pass only).
* **Machine (Physics)**: In training, the goal is **Utilization** (keeping GPUs at 100% to saturate throughput). In serving, the goal is **Headroom** (keeping GPUs at 4060% to absorb traffic spikes before tail latency explodes; observe the exponential rise in @fig-tail-latency-explosion).
:::
::: {#fig-tail-latency-explosion fig-env="figure" fig-pos="htb" fig-cap="**The Tail Latency Explosion**: Request Latency vs. System Utilization ($\\rho$). While mean latency (Blue) remains moderate, tail latency (Red, p99) explodes once utilization passes the 'Knee' at ~70%. This uses a simple M/M/1 approximation (p99 ≈ 4.6× mean), so the curve is illustrative rather than workload-specific." fig-alt="Line plot showing latency growing with utilization. Blue line (Mean) rises gradually then steeply. Red line (Tail, p99) curves upward sharply at 70% utilization. Shaded regions indicate 'Safe Zone' and 'Danger Zone'."}
```{python}
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ TAIL LATENCY EXPLOSION FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-tail-latency-explosion — illustrates why tail latency diverges
# │ from mean latency as system utilization approaches 100%
# │
# │ Goal: Visually demonstrate the nonlinear explosion of p99 latency near
# │ saturation using an M/M/1 queueing model approximation.
# │ Show: Mean vs. p99 latency curves, with Safe Zone and Danger Zone annotations.
# │ How: Compute mean = 1/(1-ρ) and p99 ≈ 4.6× mean; plot on shared axes.
# │
# │ Imports: numpy (np), mlsysim.core.viz (viz)
# │ Exports: (figure only, no prose variables)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
from mlsysim import viz
fig, ax, COLORS, plt = viz.setup_plot()
# =============================================================================
# PLOT: The Tail Latency Explosion
# =============================================================================
utilization = np.linspace(0, 0.95, 100)
mean_latency = 1 / (1 - utilization)
p99_latency = mean_latency * 4.6
ax.plot(utilization, mean_latency, '--', color=COLORS['BlueLine'], label='Mean Latency', linewidth=2)
ax.plot(utilization, p99_latency, '-', color=COLORS['RedLine'], label='Tail Latency (p99)', linewidth=2.5)
ax.set_xlabel('System Utilization (%)')
ax.set_ylabel('Request Latency (normalized to service time)')
ax.set_xlim(0, 1.0)
ax.set_ylim(0, 50)
ax.axvspan(0, 0.5, color=COLORS['GreenL'], alpha=0.2)
ax.text(0.25, 5, "Safe Zone", color=COLORS['GreenLine'], fontweight='bold', ha='center', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.axvspan(0.7, 1.0, color=COLORS['RedL'], alpha=0.2)
ax.text(0.85, 40, "Danger Zone\n(Queue Explosion)", color=COLORS['RedLine'], fontweight='bold', ha='center', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.annotate("The Knee", xy=(0.7, 15), xytext=(0.5, 25),
arrowprops=dict(facecolor=COLORS['primary'], arrowstyle='->', lw=1.5), fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_xticklabels(['0%', '20%', '40%', '60%', '80%', '100%'])
ax.legend(loc='upper left', fontsize=8)
plt.show()
```
:::
The consequences of ignoring this inversion become apparent during a *traffic spike* that pushes the system beyond what it was designed to handle.
```{python}
#| label: black-friday-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BLACK FRIDAY TRAFFIC SPIKE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The 'Black Friday' Traffic Spike"
# │
# │ Goal: Demonstrate the nonlinear failure mode of serving systems under load.
# │ Show: That a 10× traffic spike causes system collapse, not just 10× latency.
# │ How: Model queue explosion using Little's Law parameters.
# │
# │ Imports: (none)
# │ Exports: bf_latency_ms_str, bf_qps_normal_str, bf_qps_spike_str,
# │ bf_spike_factor_str, bf_collapse_latency_s_str
# └─────────────────────────────────────────────────────────────────────────────
class BlackFridayCalc:
"""Models the nonlinear failure mode of serving systems under a 10× traffic spike."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
bf_latency_ms_value = 50 # normal operation latency (ms)
bf_qps_normal_value = 1000 # normal queries per second
bf_qps_spike_value = 10000 # Black Friday peak QPS
bf_spike_factor_value = 10 # spike multiplier (10x)
bf_collapse_latency_s_value = 10 # latency during collapse (seconds)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
bf_latency_ms_str = f"{bf_latency_ms_value}" # e.g. "50" ms
bf_qps_normal_str = f"{bf_qps_normal_value:,}" # e.g. "1,000" QPS
bf_qps_spike_str = f"{bf_qps_spike_value:,}" # e.g. "10,000" QPS
bf_spike_factor_str = f"{bf_spike_factor_value}" # e.g. "10" x
bf_collapse_latency_s_str = f"{bf_collapse_latency_s_value}" # e.g. "10" seconds
```
::: {.callout-example title="The 'Black Friday' Traffic Spike"}
**The Scenario**: An e-commerce recommendation system runs comfortably at `{python} BlackFridayCalc.bf_latency_ms_str` ms latency with `{python} BlackFridayCalc.bf_qps_normal_str` queries per second (QPS).
**The Event**: On Black Friday, traffic spikes `{python} BlackFridayCalc.bf_spike_factor_str`$\times$ to `{python} BlackFridayCalc.bf_qps_spike_str` QPS.
**The Failure**: The system does not slow down `{python} BlackFridayCalc.bf_spike_factor_str`$\times$. It **collapses**. Latency hits `{python} BlackFridayCalc.bf_collapse_latency_s_str` seconds, then requests start timing out. The servers are 100% utilized, but *useful* throughput drops to near zero because most completed requests have already timed out from the client's perspective.
**The Physics**: This is Little's Law and queueing theory in action. As utilization approaches 100%, queue lengths grow exponentially, not linearly. The system spends more time managing the queue (context switching, thrashing) than doing useful work.
**The Fix**:
1. **Load Shedding**: Reject excess requests immediately to keep the queue short.
2. **Autoscaling**: Spin up more replicas *before* utilization hits the "knee" of the curve.
3. **Degradation**: Serve cached/dumber recommendations to reduce compute cost per query.
:::
Trace the curve in @fig-tail-latency-explosion and notice how latency remains manageable until utilization crosses roughly 70%, then explodes—this is *why* production systems must run at relatively low utilization (4060%) to guarantee stable tail latency\index{Tail Latency!utilization threshold} (p99). For a mathematical treatment of long-tailed distributions and why P99 latency becomes the *median* user experience at scale, see @sec-data-foundations-distributions-long-tail-901f. The curve is a simple queueing approximation intended for intuition rather than a specific workload.
Beyond the technical limits of latency, the economics of serving have undergone a radical transformation. As models become more efficient and hardware becomes more specialized, the cost of "intelligence" is collapsing[^fn-jevons-paradox]. To grasp the speed of this collapse, examine the log-scale price trajectory in @fig-intelligence-deflation, which tracks public API list prices as a market proxy.
[^fn-jevons-paradox]: **Jevons Paradox**: William Stanley Jevons observed in 1865 that efficiency improvements in coal-powered steam engines *increased* total coal consumption by making steam power economically viable for applications previously too costly. The same dynamic governs AI inference: each 10$\times$ cost reduction opens application classes that were economically infeasible at the previous price point, expanding aggregate demand by more than the efficiency gain. This is why cheaper inference reliably increases, not decreases, total GPU fleet demand — efficiency and demand are complements in AI, not substitutes. \index{Jevons Paradox!inference demand}
::: {#fig-intelligence-deflation fig-env="figure" fig-pos="htb" fig-cap="**Intelligence Deflation**: Cost per 1M output tokens (USD) over time (Log Scale). Prices are based on public API list prices (20202025) and are intended as a market trend indicator, not a controlled comparison. The cost of token generation has collapsed by multiple orders of magnitude, transforming the economics of automated AI workflows." fig-alt="Line plot showing token pricing collapsing from \$20/M tokens in 2020 to <\$0.10/M tokens in 2025. Log scale highlights the deflationary trend with models from OpenAI, Anthropic, Google, and DeepSeek."}
```{python}
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ INTELLIGENCE DEFLATION FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-intelligence-deflation — log-scale scatter plot of public API
# │ token prices from 2020 to 2025, annotated with model names
# │
# │ Goal: Visualize the ~10× per-18-month collapse in inference token cost that
# │ is transforming AI serving economics.
# │ Show: Named model price points on a log scale with deflation trend line.
# │ How: Fit a log-linear trend through representative price points; scatter
# │ remaining models; annotate each with model name.
# │
# │ Imports: numpy (np), mlsysim.core.viz (viz)
# │ Exports: (figure only, no prose variables)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
from mlsysim import viz
fig, ax, COLORS, plt = viz.setup_plot()
# =============================================================================
# DATA: Token pricing over time
# =============================================================================
data = [
(2020.5, 20.0, "GPT-3 (Davinci)"), (2023.1, 2.0, "GPT-3.5 Turbo"),
(2023.2, 30.0, "GPT-4 (Original)"), (2024.2, 15.0, "Claude 3 Opus"),
(2024.2, 0.25, "Claude 3 Haiku"), (2024.3, 5.0, "GPT-4o"),
(2024.4, 0.075, "Gemini 1.5 Flash"), (2024.6, 0.15, "GPT-4o-mini"),
(2024.9, 0.27, "DeepSeek-V3")
]
data.sort(key=lambda x: x[0])
years = np.array([d[0] for d in data])
prices = np.array([d[1] for d in data])
labels = [d[2] for d in data]
# =============================================================================
# PLOT: Intelligence Deflation
# =============================================================================
trend_years = np.array([2020.5, 2023.1, 2024.2, 2024.4, 2024.9])
trend_prices = np.array([20.0, 2.0, 0.25, 0.075, 0.27])
slope, intercept = np.polyfit(trend_years, np.log10(trend_prices), 1)
line_years = np.linspace(2020, 2025.5, 100)
line_prices = 10**(slope * line_years + intercept)
ax.plot(line_years, line_prices, '--', color=COLORS['grid'], linewidth=1.5, label='Deflation Trend', zorder=1)
ax.scatter(years, prices, color=COLORS['GreenLine'], s=50, zorder=3, edgecolors='white', linewidth=1.5)
for y, p, l in zip(years, prices, labels):
off_x, off_y, ha, va = 5, 5, 'left', 'bottom'
if "Haiku" in l: off_x, off_y, ha = -8, 8, 'right'
elif "Flash" in l: off_x, off_y, ha, va = -8, -15, 'right', 'top'
ax.annotate(l, (y, p), xytext=(off_x, off_y), textcoords='offset points',
fontsize=8, fontweight='bold', ha=ha, va=va, color=COLORS['primary'],
bbox=dict(facecolor='white', alpha=0.7, edgecolor='none', pad=1))
ax.set_yscale('log')
ax.set_yticks([100, 10, 1, 0.1, 0.01])
ax.set_yticklabels(['$100', '$10', '$1', '$0.10', '$0.01'])
ax.set_xlabel('Year')
ax.set_ylabel('Price per 1M Tokens (USD)')
ax.set_ylim(0.01, 500)
ax.set_xlim(2020, 2025.5)
ax.text(2021, 0.05, "Trend: ~10× Cheaper\nEvery 18 Months", color=COLORS['grid'], fontsize=9, style='italic', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
plt.show()
```
:::
These priorities motivate a formal definition of model serving.
::: {.callout-definition title="Model Serving"}
***Model Serving***\index{Model Serving!definition} is the operational phase that provides model predictions to end-users or downstream systems under strict latency constraints.
1. **Significance (Quantitative):** It inverts the throughput priority ($\eta$) of training into a **Latency Constraint ($L_{\text{lat}}$)**, requiring an architectural stack designed to minimize the **Tail Latency** (p99) of individual inferences.
2. **Distinction (Durable):** Unlike **Model Training**, which processes large, predictable batches of data, Model Serving must handle **Stochastic Request Patterns** and unpredictable load.
3. **Common Pitfall:** A frequent misconception is that serving is "just the forward pass." In reality, it is a **Distributed System Problem**: the model execution is only one component of a stack that includes request routing, load balancing, and data transformation.
:::
The SLO[^fn-slo-sla-serving] defines the latency target that shapes every architectural decision in the serving stack.
[^fn-slo-sla-serving]: **Service Level Objective (SLO) vs. Service Level Agreement (SLA)**: An SLO is an *internal* target (e.g., "p99 latency under 50 ms"); an SLA is an *external* contractual commitment with financial penalties for violation. SLOs are set tighter than SLAs to provide a safety margin. For ML serving, both model accuracy and inference latency contribute to SLOs, creating multi-dimensional optimization targets where improving one dimension (e.g., deploying a larger model for accuracy) can violate the other (latency). \index{SLO (Service Level Objective)!vs. SLA}
Serving systems must execute a complete inference pipeline under latency constraints, not just the neural network computation. A common misconception is that "inference time" equals "serving time," but the neural network is only one stage in a longer pipeline. Follow the stages in @fig-serving-inference-pipeline from left to right: raw inputs pass through preprocessing (traditional computing), neural network inference (deep learning), and postprocessing (traditional computing) before producing final outputs. Any of these stages can become the latency bottleneck. @sec-model-serving-latency-budget-ef40 quantifies exactly where time goes, revealing a counterintuitive result about which stages dominate.
::: {#fig-serving-inference-pipeline fig-env="figure" fig-pos="htb" fig-cap="**The Inference Pipeline**: ML serving systems transform raw inputs into final outputs through sequential stages: preprocessing, neural network computation, and postprocessing. The neural network represents just one component; preprocessing and postprocessing rely on traditional computing and often dominate total latency in optimized systems." fig-alt="Flow diagram showing six connected boxes: Raw Input, Preprocessing, Neural Network, Raw Output, Postprocessing, Final Output. Preprocessing and postprocessing are labeled Traditional Computing; neural network is labeled Deep Learning."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n},line width=0.75pt]
\tikzset{%
Line/.style={line width=1.0pt,black!50,text=black},
Box/.style={inner xsep=3pt,
node distance=0.6,
draw=GreenLine, line width=0.75pt,
fill=GreenL,
align=flush center,
minimum width=15mm,
minimum height=10mm
},
}
%
\node[Box](B1){Raw\\ Input};
\node[Box,right=of B1](B2){Pre-processing};
\node[Box,node distance=1, right=of B2,fill=BlueL,draw=BlueLine](B3){Neural\\ Network};
\node[Box,node distance=1, right=of B3,fill=VioletL2,draw=VioletLine2](B4){Raw\\ Output};
\node[Box,right=of B4,fill=VioletL2,draw=VioletLine2](B5){Post-processing};
\node[Box, right=of B5,fill=VioletL2,draw=VioletLine2](B6){Final\\ Output};
%
\draw[Line,-latex](B1)--(B2);
\draw[Line,-latex](B2)--(B3);
\draw[Line,-latex](B3)--(B4);
\draw[Line,-latex](B4)--(B5);
\draw[Line,-latex](B5)--(B6);
%
\scoped[on background layer]
\node[draw=BackLine,inner xsep=3mm,inner ysep=5mm,yshift=2mm,
fill=BackColor,fit=(B1)(B2),line width=0.75pt](BB){};
\node[below=3pt of BB.north,anchor=north]{Traditional Computing};
%
\scoped[on background layer]
\node[draw=OrangeLine,inner xsep=4mm,inner ysep=5mm,yshift=2mm,
fill=OrangeL!70!red!10,fit=(B3),line width=0.75pt](BB){};
\node[below=3pt of BB.north,anchor=north]{Deep Learning};
%
\scoped[on background layer]
\node[draw=BackLine,inner xsep=3mm,inner ysep=5mm,yshift=2mm,
fill=BackColor,fit=(B4)(B6),line width=0.75pt](BB){};
\node[below=3pt of BB.north,anchor=north]{Traditional Computing};
\end{tikzpicture}
```
:::
This chapter develops the engineering principles needed to orchestrate this pipeline under production constraints. It first establishes the system fundamentals: serving architectures, server anatomy, and the protocols connecting clients to models. It then traces the request lifecycle to reveal where latency accumulates, and turns to the optimization strategies that maximize throughput under these constraints.
### Static vs Dynamic Inference {#sec-model-serving-static-vs-dynamic-inference-e864}
The preceding examples explain *why* serving systems must maintain capacity headroom. However, before optimizing *how* to reduce inference latency, a prior question must be addressed: *when* should predictions be computed at all? The first architectural decision in any serving system is whether predictions happen before or during user requests [@google2024staticdynamic]. This choice shapes system design, cost structure, and capability boundaries.
#### Static Inference {#sec-model-serving-static-inference-35f4}
Static inference\index{Static Inference!pre-computed predictions} (also called offline or batch inference) pre-computes predictions for anticipated inputs and stores them for retrieval. Consider a recommendation system that generates predictions for all user-item pairs nightly. When a user requests recommendations, the system retrieves pre-computed results from a lookup table rather than running inference. This approach eliminates inference latency entirely since results already exist, enables quality verification before deployment, and reduces serving costs. However, static inference cannot handle novel inputs that were not anticipated during the batch computation and introduces hours or days of latency when models update.
#### Dynamic Inference {#sec-model-serving-dynamic-inference-d2d5}
Dynamic inference\index{Dynamic Inference!real-time prediction} (also called online or real-time inference) computes predictions on demand when requests arrive. This handles any input, including rare edge cases and novel combinations, and immediately reflects model updates. The cost is strict latency requirements that constrain model complexity and demand robust monitoring infrastructure.
```{python}
#| label: static-batch-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ STATIC VS DYNAMIC INFERENCE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Static vs Dynamic inference narrative (photo organization example)
# │
# │ Goal: Contrast the economics of static vs. dynamic inference.
# │ Show: That static pre-computation is superior for predictable inputs.
# │ How: Compare total batch time to per-request latency for photo classification.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: n_photos_str, inference_ms_str, batch_total_s_str,
# │ dynamic_latency_budget_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class StaticBatchCalc:
"""Contrasts static vs. dynamic inference economics for photo classification."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
n_photos_value = 10_000 # photos in user library
inference_ms_value = 5 # ResNet-50 inference time (ms)
dynamic_latency_budget_ms_value = 100 # real-time latency budget (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
batch_total_s_value = n_photos_value * inference_ms_value / 1000
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
n_photos_str = f"{n_photos_value:,}" # e.g. "10,000" photos
inference_ms_str = f"{inference_ms_value}" # e.g. "5" ms
batch_total_s_str = fmt(batch_total_s_value, precision=0, commas=False)# e.g. "50" seconds
dynamic_latency_budget_ms_str = f"{dynamic_latency_budget_ms_value}" # e.g. "100" ms
```
For our ResNet-50 image classifier, consider two deployment scenarios. A **static approach** suits a photo organization app that pre-classifies all images in a user's library overnight. With `{python} StaticBatchCalc.n_photos_str` photos and `{python} StaticBatchCalc.inference_ms_str` ms inference each, batch processing takes ~`{python} StaticBatchCalc.batch_total_s_str` seconds total, and users see instant classification when browsing. A **dynamic approach** suits a content moderation API that must classify user-uploaded images in real-time, with each image requiring the full preprocessing→inference→postprocessing pipeline and a `{python} StaticBatchCalc.dynamic_latency_budget_ms_str`ms latency budget. Most production image classification systems use a **hybrid approach**: frequently requested images (popular products, known memes) are pre-classified and cached, while novel uploads trigger dynamic inference.
The choice between static and dynamic serving has direct economic implications. Stricter latency requirements directly translate into higher infrastructure costs, and quantifying *the cost of latency* in dollar terms reveals how much infrastructure premium each millisecond of latency reduction demands.
```{python}
#| label: cost-latency-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COST OF LATENCY CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Cost of Latency" (Serving Paradigm section)
# │
# │ Goal: Quantify the economic tradeoff between response time and hardware bill.
# │ Show: That reducing latency by 50% can increase costs by 4x.
# │ How: Calculate cost per million queries across different batch sizes.
# │
# │ Imports: mlsysim.core.constants, mlsysim.book
# │ Exports: gpu_cost_per_hour_str, latency_a_ms_str, throughput_a_rps_str,
# │ latency_b_ms_str, throughput_b_rps_str, cost_a_str, cost_b_str,
# │ cost_increase_str, cost_ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.core.constants import SEC_PER_HOUR, MILLION
from mlsysim.fmt import fmt, check
class CostLatencyCalc:
"""Quantifies the economic tradeoff between latency and hardware cost per million queries."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
gpu_cost_per_hour_value = 4.0 # GPU rental cost ($/hour)
latency_a_ms_value = 5 # Scenario A: low latency (ms)
throughput_a_rps_value = 200 # Scenario A: throughput (req/s)
latency_b_ms_value = 10 # Scenario B: higher latency (ms)
throughput_b_rps_value = 800 # Scenario B: throughput (req/s)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
queries_per_hour_a_value = throughput_a_rps_value * SEC_PER_HOUR
cost_per_million_a_value = gpu_cost_per_hour_value / (queries_per_hour_a_value / MILLION)
queries_per_hour_b_value = throughput_b_rps_value * SEC_PER_HOUR
cost_per_million_b_value = gpu_cost_per_hour_value / (queries_per_hour_b_value / MILLION)
cost_increase_pct_value = (cost_per_million_a_value / cost_per_million_b_value - 1) * 100
cost_ratio_value = cost_per_million_a_value / cost_per_million_b_value
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
check(cost_ratio_value > 1, "Scenario A (low latency) must cost more per query than Scenario B.")
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
gpu_cost_per_hour_str = fmt(gpu_cost_per_hour_value, precision=0, commas=False)
latency_a_ms_str = f"{latency_a_ms_value}"
throughput_a_rps_str = f"{throughput_a_rps_value}"
latency_b_ms_str = f"{latency_b_ms_value}"
throughput_b_rps_str = f"{throughput_b_rps_value}"
cost_a_str = fmt(cost_per_million_a_value, precision=2, commas=False)
cost_b_str = fmt(cost_per_million_b_value, precision=2, commas=False)
cost_increase_str = fmt(cost_increase_pct_value, precision=0, commas=False)
cost_ratio_str = fmt(cost_ratio_value, precision=0, commas=False)
```
::: {.callout-notebook #notebook-cost-latency title="The Cost of Latency"}
Latency constraints directly dictate infrastructure costs. Consider a GPU server renting for USD `{python} CostLatencyCalc.gpu_cost_per_hour_str`/hour.
**Scenario A (Low Latency):** Batch size 1.
* Latency: `{python} CostLatencyCalc.latency_a_ms_str` ms.
* Throughput: `{python} CostLatencyCalc.throughput_a_rps_str` req/s.
* Cost per million queries: **USD `{python} CostLatencyCalc.cost_a_str`**.
**Scenario B (High Throughput):** Batch size 8.
* Latency: `{python} CostLatencyCalc.latency_b_ms_str` ms (doubled due to batching overhead).
* Throughput: `{python} CostLatencyCalc.throughput_b_rps_str` req/s (quadrupled due to parallel efficiency).
* Cost per million queries: **USD `{python} CostLatencyCalc.cost_b_str`**.
**The Trade-off:** Reducing latency from `{python} CostLatencyCalc.latency_b_ms_str` ms to `{python} CostLatencyCalc.latency_a_ms_str` ms increases the hardware bill by **`{python} CostLatencyCalc.cost_increase_str`%**. Engineers must quantify whether that `{python} CostLatencyCalc.latency_a_ms_str` ms speedup generates enough business value to justify the `{python} CostLatencyCalc.cost_ratio_str`$\times$ cost increase.
:::
Most production systems combine both approaches. Common queries hit a cache populated by batch inference while uncommon requests trigger dynamic computation. Understanding this spectrum matters because it determines which subsequent optimization strategies apply. Static inference optimizes for throughput during batch computation and storage efficiency for serving. Dynamic inference optimizes for per-request latency under concurrent load, which requires understanding *where* time goes within each request.
The static-versus-dynamic decision is the first of several architectural choices that shape serving system design. Equally important is *where* the model executes, since deployment context constrains every subsequent optimization.
::: {.callout-perspective title="Looking Ahead: The Rise of Inference-Time Compute (System 2)"}
Traditional serving optimizes for minimizing latency ($L_{\text{lat}} \to 0$). Emerging "Reasoning Models" (like OpenAI o1) invert this goal, deliberately spending more compute cycles ("thinking") to improve answer quality. Individual token generation remains memory-bandwidth-bound, but these models generate far more tokens per request (often 10--100$\times$ more internal reasoning tokens), dramatically increasing the total compute and energy spent per query. The aggregate effect brings "Training-like" compute budgets into the Serving phase, even though each token is still governed by the memory wall.
:::
### The Spectrum of Serving Architectures {#sec-model-serving-spectrum-serving-architectures-8966}
Although "serving" often implies a networked server processing API requests, the architectural pattern varies drastically by deployment environment. @sec-ml-systems-deployment-paradigm-framework-0d25 introduced the four deployment paradigms (Cloud, Edge, Mobile, and TinyML) and the physical constraints (the light barrier, the power wall, and the memory wall) that give rise to them. Those constraints do not disappear at serving time; they *intensify*, because serving adds latency SLOs and cost pressure on top of the hardware limits that training could absorb through patience. The same model may require radically different serving strategies depending on *where* it executes.
#### Networked Serving (Cloud/Datacenter) {#sec-model-serving-networked-serving-clouddatacenter-0328}
The model\index{Serving!cloud/datacenter}\index{Microservice!model serving} runs as a standalone service (microservice), the deployment paradigm @sec-ml-systems-cloud-ml-maximizing-computational-power-a338 characterized as trading latency for virtually unlimited compute. The primary interface is the network (HTTP/gRPC). Optimization focuses on **throughput** (batching) and **concurrency**.
* *Key Constraint:* Network bandwidth and serialization cost.
* *Typical Hardware:* NVIDIA GPUs (V100, A100, H100), Google TPUs, AWS Inferentia.
* *Cold Start:*\index{Cold Start!cloud serving} Seconds to minutes (container startup, model loading, warmup).
#### Application-Embedded Serving (Mobile/Edge) {#sec-model-serving-applicationembedded-serving-mobileedge-8bd1}
The model\index{Serving!mobile/edge}\index{Edge Inference!embedded serving} runs within the user application process (e.g., a smartphone app using CoreML or TensorFlow Lite), the embedded paradigm @sec-ml-systems-edge-ml-reducing-latency-privacy-risk-2625 and @sec-ml-systems-mobile-ml-personal-offline-intelligence-0983 analyzed for its latency, privacy, and offline advantages. There is no "server." The interface is a function call. Optimization focuses on **energy** and **responsiveness** (SingleStream).
* *Key Advantage:* **Zero-Copy Inference**\index{Zero-Copy Inference!mobile optimization}. When data moves through a system, each copy consumes CPU cycles and memory bandwidth. In cloud serving, a camera frame might be copied four times: from network buffer to application memory, then to a preprocessing buffer, then to GPU-accessible memory, and finally to GPU VRAM. Mobile NPUs can eliminate most of these copies by sharing memory directly with the camera hardware. The camera writes pixels into a buffer that the NPU reads directly, avoiding the CPU entirely. This reduces both latency (no copy operations) and energy (memory copies consume significant power). The mechanism requires hardware support: the camera, CPU, and NPU must share a unified memory architecture, which modern mobile SoCs like Apple's M-series and Qualcomm Snapdragon provide.
* *Typical Hardware:* Mobile NPUs (Apple Neural Engine, Qualcomm Hexagon), embedded GPUs (Jetson).
* *Cold Start:* Milliseconds (model already in app memory); first inference may trigger JIT compilation (100500 ms).
* *Power Budget:* 15 W sustained, with thermal throttling after prolonged inference.
#### Bare-Metal Serving (TinyML) {#sec-model-serving-baremetal-serving-tinyml-28cf}
The model\index{Serving!TinyML}\index{TinyML!bare-metal serving} is compiled into the firmware of a microcontroller, the extreme end of the deployment spectrum @sec-ml-systems-tinyml-ubiquitous-sensing-scale-a67b introduced as ubiquitous sensing at microwatt power budgets. There is no operating system or dynamic memory allocator. "Serving" is a tight loop reading sensors and invoking the interpreter. Optimization focuses on **static memory usage** (fitting in SRAM).
* *Key Difference:* All memory is pre-allocated (Tensor Arena)\index{Tensor Arena!TinyML memory}. Dynamic batching is impossible.
* *Typical Hardware:* ARM Cortex-M series, ESP32, specialized TinyML accelerators.
* *Cold Start:* Microseconds (model weights in flash, tensor arena pre-allocated).
* *Power Budget:* Microwatts to milliwatts; battery operation for months or years.
@tbl-serving-spectrum summarizes *how* these deployment contexts shape serving system design:
| **Characteristic** | **Cloud/Datacenter** | **Mobile/Edge** | **TinyML** |
|:---------------------|:---------------------|:---------------------|:-----------------|
| **Latency Target** | 10100 ms | 2050 ms | 1100 ms |
| **Batch Size** | 1128 (dynamic) | 1 (fixed) | 1 (fixed) |
| **Memory** | 1680 GB VRAM | 28 GB shared | 256 KB2 MB SRAM |
| **Power** | 300700 W | 110 W | 1100 mW |
| **Update Mechanism** | Container deploy | App store update | Firmware OTA |
| **Failure Mode** | Retry/failover | Graceful degradation | Silent or reset |
| **Monitoring** | Full telemetry | Limited analytics | Heartbeat only |
: **Serving Architecture Spectrum**: The deployment paradigm selected in @sec-ml-systems-comparative-analysis-paradigm-selection-bf66 shapes every aspect of serving system design. Cloud systems optimize for throughput with dynamic batching; mobile systems optimize for energy with fixed batch-1; TinyML systems operate under extreme memory and power constraints with no dynamic allocation. The physical walls (light, power, memory) that created these paradigms now dictate the serving constraints each must satisfy. {#tbl-serving-spectrum}
To make these architectural differences concrete, consider *how* a single model must adapt to each deployment context:
```{python}
#| label: resnet-spectrum-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESNET-50 ACROSS THE SERVING SPECTRUM
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50 Across the Serving Spectrum"
# │
# │ Goal: Contrast serving requirements across Cloud, Mobile, and TinyML.
# │ Show: That the same model requires different formats and architectures.
# │ How: Calculate model sizes and compare NPU vs. CPU efficiency.
# │
# │ Imports: mlsys, mlsysim.constants, mlsysim.book
# │ Exports: cloud_*, mobile_*, tiny_* formatted strings
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim import Models, Systems, Archetypes
from mlsysim.core.constants import BYTES_FP16, BYTES_INT8
from mlsysim.fmt import fmt, check
# ┌── LEGO ───────────────────────────────────────────────
class ResNetServingSpectrum:
"""
Namespace for ResNet-50 Serving Spectrum comparison.
Scenario: Mapping the same architecture (or alternatives) to Cloud, Mobile, TinyML.
"""
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
m_resnet = Models.ResNet50
m_mobilenet = Models.MobileNetV2
s_cloud = Archetypes.Cloud_V100
s_mobile = Systems.Mobile
s_tiny = Archetypes.TinyML_M7
# Cloud (V100) Performance - Source: MLPerf/Vendor reports
cloud_inf_b1_ms = 1.4
cloud_inf_b16_ms = 14.0
cloud_throughput = 1143
cloud_vram_gb = 2
# Mobile (Smartphone) Performance
mobile_inf_npu_ms = 12.0
mobile_inf_cpu_ms = 45.0
mobile_throughput = 80
mobile_energy_npu_mj = 0.8
mobile_energy_cpu_mj = 4.2
# TinyML (Cortex-M7) Performance
tiny_inf_ms = 120.0
tiny_energy_mj = 12.0
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
# Step 1: Calculate sizes using the Digital Twins
cloud_size_mb = m_resnet.size_in_bytes(BYTES_FP16).m_as('MB')
mobile_size_mb = m_resnet.size_in_bytes(BYTES_INT8).m_as('MB')
tiny_original_mb = m_resnet.size_in_bytes(BYTES_INT8).m_as('MB')
tiny_alt_mb = m_mobilenet.size_in_bytes(BYTES_INT8).m_as('MB')
# Step 2: TinyML feasibility check
tiny_limit_mb = s_tiny.ram.m_as('MB')
tiny_feasibility = tiny_original_mb < tiny_limit_mb
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
check(not tiny_feasibility,
f"ResNet-50 ({tiny_original_mb:.1f}MB) should NOT fit on TinyML (<{tiny_limit_mb:.1f}MB).")
check(mobile_energy_cpu_mj >= mobile_energy_npu_mj * 3,
"NPU should be significantly more energy efficient than CPU.")
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
# System names
cloud_name = s_cloud.name
mobile_name = s_mobile.name
tiny_name = s_tiny.name
cloud_model_mb_str = fmt(cloud_size_mb, precision=0)
cloud_inf_b1_ms_str = f"{cloud_inf_b1_ms:.1f}"
cloud_inf_b16_ms_str = f"{cloud_inf_b16_ms:.0f}"
cloud_throughput_str = f"{cloud_throughput:,}"
cloud_vram_gb_str = f"{cloud_vram_gb}"
mobile_model_mb_str = fmt(mobile_size_mb, precision=0)
mobile_inf_npu_ms_str = f"{mobile_inf_npu_ms:.0f}"
mobile_inf_cpu_ms_str = f"{mobile_inf_cpu_ms:.0f}"
mobile_throughput_str = f"{mobile_throughput}"
mobile_energy_npu_mj_str = f"{mobile_energy_npu_mj:.1f}"
mobile_energy_cpu_mj_str = f"{mobile_energy_cpu_mj:.1f}"
mobile_mem_mb_str = "150"
tiny_model_mb_str = fmt(tiny_original_mb, precision=0)
tiny_alt_mb_str = fmt(tiny_alt_mb, precision=1)
tiny_inf_ms_str = f"{tiny_inf_ms:.0f}"
tiny_throughput_str = "8"
tiny_arena_kb_str = "320"
tiny_sram_kb_str = fmt(s_tiny.ram.m_as('KiB'), precision=0)
tiny_energy_mj_str = f"{tiny_energy_mj:.0f}"
```
::: {.callout-perspective #perspective-resnet-serving title="ResNet-50 Across the Serving Spectrum"}
The same ResNet-50 architecture requires dramatically different serving strategies across deployment contexts:
**`{python} ResNetServingSpectrum.cloud_name`:**
- Model format: TensorRT FP16 engine (`{python} ResNetServingSpectrum.cloud_model_mb_str`MB)
- Inference: `{python} ResNetServingSpectrum.cloud_inf_b1_ms_str`ms at batch-1, `{python} ResNetServingSpectrum.cloud_inf_b16_ms_str`ms at batch-16
- Throughput: `{python} ResNetServingSpectrum.cloud_throughput_str` images/second (batched)
- Memory: `{python} ResNetServingSpectrum.cloud_vram_gb_str`GB VRAM (model + activations for batch-32)
**`{python} ResNetServingSpectrum.mobile_name`:**
- Model format: TensorFlow Lite INT8 (`{python} ResNetServingSpectrum.mobile_model_mb_str`MB)
- Inference: `{python} ResNetServingSpectrum.mobile_inf_npu_ms_str`ms at batch-1 (NPU), `{python} ResNetServingSpectrum.mobile_inf_cpu_ms_str`ms (CPU fallback)
- Throughput: ~`{python} ResNetServingSpectrum.mobile_throughput_str` images/second (single-stream)
- Memory: `{python} ResNetServingSpectrum.mobile_mem_mb_str`MB peak (shared with app)
- Energy: `{python} ResNetServingSpectrum.mobile_energy_npu_mj_str`mJ per inference (NPU), `{python} ResNetServingSpectrum.mobile_energy_cpu_mj_str`mJ (CPU)
**`{python} ResNetServingSpectrum.tiny_name`:**
- Model format: Not feasible; ResNet-50 requires `{python} ResNetServingSpectrum.tiny_model_mb_str`MB weights
- Alternative: MobileNetV2-0.35 quantized to INT8 (`{python} ResNetServingSpectrum.tiny_alt_mb_str`MB)
- Inference: `{python} ResNetServingSpectrum.tiny_inf_ms_str`ms at batch-1
- Throughput: ~`{python} ResNetServingSpectrum.tiny_throughput_str` images/second
- Memory: `{python} ResNetServingSpectrum.tiny_arena_kb_str`KB tensor arena (fits in `{python} ResNetServingSpectrum.tiny_sram_kb_str`KB SRAM)
- Energy: `{python} ResNetServingSpectrum.tiny_energy_mj_str`mJ per inference
**Key insight**: The "same model" claim is misleading: each deployment requires different optimization and often different architectures entirely. TinyML serving cannot use ResNet-50; it requires architectures designed for the constraints from the start.
:::
### The Load Balancer Layer {#sec-model-serving-load-balancer-layer-9c4d}
The preceding spectrum focused on *how* deployment context shapes serving constraints, from datacenter GPUs to microcontroller SRAM. When traffic exceeds what a single machine can handle, cloud and datacenter deployments that run multiple replicas of the same model require an additional infrastructure layer: the load balancer. Production serving systems place load balancers\index{Load Balancer!serving infrastructure} between clients and model servers, providing three essential functions for serving infrastructure.
Request distribution, the first function, routes incoming requests to available model replicas using algorithms like round-robin or least-connections. For latency-sensitive ML serving, algorithms that route away from slow or overloaded replicas improve tail latency. The second, health monitoring\index{Health Monitoring!replica readiness}, continuously verifies that replicas are ready to serve, routing traffic away from unhealthy instances. For ML systems, health checks must verify both process liveness and model readiness, confirming that weights are loaded and warmup is complete. The third, deployment support, enables safe model updates by gradually shifting traffic between versions. @sec-ml-operations examines deployment strategies including canary testing, blue-green deployments, and shadow mode validation.
For single-machine serving with multiple model instances, such as running several ONNX Runtime sessions, the framework and operating system handle request queuing. The full complexity of load balancing becomes necessary when scaling to distributed inference systems, where multiple machines serve the same model. The implementation details of request distribution algorithms and multi-replica architectures belong to that distributed context.
When capacity planning considers "the server" in this chapter, it means the single machine's model serving capacity. The queuing dynamics analyzed in @sec-model-serving-queuing-theory-tail-latency-29a6 apply to understanding single-machine behavior and determining when scaling to multiple machines becomes necessary.
While load balancers distribute requests across replicas, achieving predictable latency also requires controlling what happens *within* each machine. The operating system environment introduces its own sources of variability.
### Deterministic Latency and Resource Isolation {#sec-model-serving-deterministic-latency-resource-isolation-4d1c}
An inference server does not operate in isolation. On a single machine, the operating system manages multiple competing processes (logging agents, monitoring tools, and system interrupts) that can intermittently steal CPU cycles from the inference pipeline. These "noisy neighbors" are a primary source of **latency jitter**, where the time required to process identical requests varies significantly, causing the 99th percentile (P99) latency to spike even when the hardware is under-utilized. The tail latency explosion from @fig-tail-latency-explosion illustrates the same spike, but here the trigger is resource contention rather than queuing.
Achieving deterministic performance\index{Latency!deterministic}\index{Resource Isolation!serving} on a single node requires isolating the inference process from the operating system's normal resource-sharing behavior. The most impactful technique is CPU affinity (pinning)\index{CPU Affinity!latency reduction}, which restricts the inference server's threads to specific physical cores. Without pinning, the OS freely migrates threads between cores, evicting warm cache lines and introducing 1050 μs context-switch penalties that appear as latency jitter. Pinning eliminates this migration, ensuring that preprocessing always has immediate access to computational resources and that the CPU cache remains warm between requests.
Memory locking (`mlock`)\index{Memory Locking!mlock} addresses a related but distinct source of jitter. By default, the OS can page any memory region to disk under memory pressure. If the GPU's DMA engine begins reading model weights from a region that has been paged out, the transfer stalls until the data is faulted back into RAM, a penalty measured in milliseconds rather than microseconds. Locking model weights and KV caches in physical RAM guarantees consistent access times, though the trade-off is that pinned memory cannot be reclaimed by other processes.
The third technique, interrupt shielding\index{Interrupt Shielding!latency isolation}, completes the isolation picture. Network and storage interrupts routed to inference cores can preempt GPU command submission at unpredictable moments. Steering these interrupts to non-inference cores ensures that bursts of incoming traffic do not disrupt the GPU's command stream, which is particularly important for maintaining stable tail latency under load.
These isolation principles transform a simple "model script" into a **deterministic service**, a transition essential for safety-critical applications like autonomous driving or real-time industrial control. The deployment spectrum, load balancing, and resource isolation define *where* models serve and *what* infrastructure supports them. The remaining question is *how* the serving software itself is organized, specifically what components comprise an inference server and how they coordinate to turn irregular user traffic into efficient hardware utilization.
## Serving System Architecture {#sec-model-serving-serving-system-architecture-4879}
User requests arrive in unpredictable bursts while accelerators demand steady, uniformly-sized batches. Bridging this gap requires more than a Python script calling `model.predict()`; it requires a specialized software architecture that absorbs traffic variability, forms efficient batches, and keeps hardware saturated without violating latency SLOs.
### Internal Architecture and Request Flow {#sec-model-serving-anatomy-inference-server-f12e}
Model optimization focuses on the mathematical artifact, while model serving requires a specialized software architecture to manage high-frequency request streams and hardware utilization. An inference server\index{Inference Server!architecture}[^fn-inference-server-serving] (such as NVIDIA Triton, TensorFlow Serving\index{TensorFlow Serving}, or TorchServe) is not a simple wrapper around a model script; it is a high-performance scheduler that manages concurrency, memory, and data movement.
[^fn-inference-server-serving]: **Inference Server**: Google's TensorFlow Serving (open-sourced February 2016) pioneered the separation of model logic from serving infrastructure; NVIDIA's Triton (GA March 2019) extended this to multi-framework support. The critical design insight is that dynamic batching within these servers improves GPU utilization by up to 70% compared to naive single-request serving, transforming the GPU from an idle-waiting device into a throughput engine. Without this scheduler layer, serving a ResNet-50 at batch size 1 wastes 85% of available compute. \index{Inference Server!architecture}
The internal anatomy of these servers reveals *how* they bridge the gap between irregular user traffic and the highly regular, batch-oriented requirements of accelerators. The core challenge is that user requests arrive unpredictably (one millisecond apart, then five seconds of silence), while GPUs perform best with steady streams of uniformly-sized batches.
Every request traverses a multi-stage pipeline designed to maximize hardware throughput while minimizing latency overhead. Walk through the six stages in @fig-server-anatomy to see how each component absorbs a different source of complexity.
::: {#fig-server-anatomy fig-env="figure" fig-pos="htb" fig-cap="**Inference Server Anatomy**: A modern inference server decouples network handling from accelerator execution through a staged pipeline. Each stage isolates a concern, from absorbing bursty traffic to forming efficient batches, so the hardware accelerator stays highly utilized despite irregular arrival patterns." fig-alt="Flowchart showing 6-stage inference server pipeline: Client to Network Ingress to Request Queue (cylinder) to Dynamic Batcher, then down to Inference Runner to Accelerator. Arrows connect stages sequentially."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
Box/.style={draw=none,minimum width=20mm, minimum height=15mm, node distance=18mm},
Arr/.style={-{Triangle[width=10pt,length=8pt]}, line width=5pt,cyan!40,shorten >=1pt, shorten <=2pt},
Box2/.style={align=flush center, inner xsep=2pt,draw=OrangeLine,
font=\footnotesize\usefont{T1}{phv}{m}{n},
line width=0.75pt, rounded corners,fill=OrangeL!30, text width=16mm,
minimum width=16mm, minimum height=10mm},
LineA/.style = {violet!60,{Circle[line width=1.0pt,fill=white,length=5.5pt]}-,line width=1.5pt,shorten <=-3pt}
}
%laptop
\tikzset{
pics/laptop/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
\begin{scope}[shift={($(0,0)+(0,0)$)},scale=\scalefac,every node/.append style={transform shape}]
\node[rounded corners=2pt,rectangle,minimum width=60,minimum height=37,
fill=\filllcolor!60,line width=\Linewidth,draw=black](EKV)at(0,0.53){};
%
\node[draw=black,rounded corners=2pt,rectangle,minimum width=53,minimum height=30,
fill=\filllcolor!10,line width=\Linewidth,](EK)at(0,0.53){};
\coordinate(SM1)at($(EK.south west)+(0.15,0.5)$);
\coordinate(SM2)at($(EK.south east)+(-1.1,0.5)$);
\coordinate(OK1)at($(EK.220)+(0,0.7)$);
\coordinate(OK2)at($(EK.240)+(0,0.7)$);
\node[fill=black,inner sep=0pt,ellipse,minimum width=2pt,minimum height=3pt](OKO1)at(OK1){};
\node[fill=black,inner sep=0pt,ellipse,minimum width=2pt,minimum height=3pt](OKO2)at(OK2){};
\draw[line width=1.4pt](SM1)to [bend right=45](SM2);
%%
\coordinate(4BL)at($(EK.south west)+(0.95,0.3)$);
\def\n{5} % broj boksova
\def\w{0.12} % širina boksa (mm)
\def\h{0.5} % visina boksa (mm)
\def\gap{0.05} % razmak između boksova (mm)
% niz boksova
\foreach \i in {0,...,4} {
\pgfmathsetmacro{\x}{\i*(\w+\gap)}
% popuna (klipujemo unutar ivica)
\begin{scope}
\clip[] ($(4BL)+(\x,0)$) rectangle ++(\w,\h);
\fill[gray!10]($(4BL)+(\x,0)$) rectangle ++(\w,\h*1);
\fill[fill=\filllcirclecolor]($(4BL)+(\x,0)$) rectangle ++(\w,\h*\Level);
\end{scope}
% kontura preko
\draw[line width=0.6pt,draw=black]($(4BL)+(\x,0)$) rectangle ++(\w,\h);
}
%
\draw[fill=\filllcolor!60!black!30,line width=\Linewidth](-1.00,-0.1)--(1.0,-0.1)--(1.28,-0.6)--(-1.28,-0.6)--cycle;
\draw[fill=\filllcolor!60!black!30,line width=\Linewidth](1.28,-0.6)--(-1.28,-0.6)arc[start angle=180, end angle=270, radius=4pt]--(1.14,-0.73)
arc[start angle=270, end angle=355, radius=4pt]--cycle;
\draw[fill=\filllcolor!30!black!10,line width=\Linewidth](-0.95,-0.17)--(0.95,-0.17)--(1.03,-0.34)--(-1.03,-0.34)--cycle;
\draw[fill=\filllcolor!30!black!20,line width=\Linewidth](-0.16,-0.52)--(0.16,-0.52)--(0.14,-0.42)--(-0.14,-0.42)--cycle;
\end{scope}
}
}
}
\tikzset {
pics/gatewey/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box=GAT,scale=\scalefac, every node/.append style={transform shape}]
\def\rI{4mm}
\def\rII{2.8mm}
\def\rIII{1.6mm}
\draw[draw=\drawcolor,line width=0.8*\Linewidth](0,0)--(0,0.38)--(1.2,0.38)--(1.2,0)--cycle;
\draw[draw=\drawcolor,line width=\Linewidth](0.6,0.4)--(0.6,0.9);
\draw[draw=\drawcolor, line width=\Linewidth] (0.6,0.9)+(60:\rI) arc[start angle=60, end angle=-60, radius=\rI];
\draw[draw=\drawcolor, line width=\Linewidth] (0.6,0.9)+(50:\rII) arc[start angle=50, end angle=-50, radius=\rII];
\draw[draw=\drawcolor, line width=\Linewidth] (0.6,0.9)+(30:\rIII) arc[start angle=30, end angle=-30, radius=\rIII];
%
\draw[draw=\drawcolor, line width=\Linewidth] (0.6,0.9)+(120:\rI) arc[start angle=120, end angle=240, radius=\rI];
\draw[draw=\drawcolor, line width=\Linewidth] (0.6,0.9)+(130:\rII) arc[start angle=130, end angle=230, radius=\rII];
\draw[draw=\drawcolor, line width=\Linewidth] (0.6,0.9)+(150:\rIII) arc[start angle=150, end angle=210, radius=\rIII];
\fill[fill=\filllcolor](0.6,0.9)circle (1.5pt);
\foreach\i in{0.15,0.3,0.45,0.6}{
\fill[fill=\filllcolor](\i,0.19)circle (1.5pt);
}
\fill[fill=\filllcolor](1,0.19)circle (2pt);
\end{scope}
}
}
}
\tikzset {
pics/cpu/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box = CPU,scale=\scalefac, every node/.append style={transform shape}]
\node[fill=\filllcolor,minimum width=66, minimum height=66,
rounded corners=2,outer sep=2pt] (C1) {};
\node[fill=\filllcirclecolor,minimum width=54, minimum height=54] (C2) {\bfseries\Large GPU};
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=\filllcolor,minimum width=4, minimum height=15,
inner sep=0pt,anchor=south](GO\y)at($(C1.north west)!\x!(C1.north east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=\filllcolor,minimum width=4, minimum height=15,
inner sep=0pt,anchor=north](DO\y)at($(C1.south west)!\x!(C1.south east)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=\filllcolor,minimum width=15, minimum height=4,
inner sep=0pt,anchor=east](LE\y)at($(C1.north west)!\x!(C1.south west)$){};
}
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
\node[fill=\filllcolor,minimum width=15, minimum height=4,
inner sep=0pt,anchor=west](DE\y)at($(C1.north east)!\x!(C1.south east)$){};
}
\end{scope}
}
}
}
\tikzset{pics/brain/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box=BRAIN,scale=\scalefac, every node/.append style={transform shape}]
\fill[fill=\filllcolor!50](0.1,-0.5)to[out=0,in=180](0.33,-0.5)
to[out=0,in=270](0.45,-0.38)to(0.45,-0.18)
to[out=40,in=240](0.57,-0.13)to[out=110,in=310](0.52,-0.05)
to[out=130,in=290](0.44,0.15)to[out=90,in=340,distance=8](0.08,0.69)
to[out=160,in=80](-0.42,-0.15)to(-0.48,-0.7)to(0.07,-0.7)to(0.1,-0.5)
to(-0.10,-0.42)to[out=310,in=180](0.1,-0.5);
\draw[draw=\drawcolor,line width=\Linewidth](0.1,-0.5)to[out=0,in=180](0.33,-0.5)
to[out=0,in=270](0.45,-0.38)to(0.45,-0.18)
to[out=40,in=240](0.57,-0.13)to[out=110,in=310](0.52,-0.05)
to[out=130,in=290](0.44,0.15)to[out=90,in=340,distance=8](0.08,0.69)
to(-0.42,-0.15)to(-0.48,-0.7)
(0.07,-0.7)to(0.1,-0.5)
(-0.10,-0.42)to[out=310,in=180](0.1,-0.5);
\draw[fill=\filllcolor,line width=\Linewidth](-0.3,-0.10)to(0.08,0.60)
to[out=60,in=50,distance=3](-0.1,0.69)to[out=160,in=80](-0.26,0.59)to[out=170,in=90](-0.46,0.42)
to[out=170,in=110](-0.54,0.25)to[out=210,in=150](-0.54,0.04)
to[out=240,in=130](-0.52,-0.1)to[out=300,in=240]cycle;
\draw[fill=\filllcolor,line width=\Linewidth]
(-0.04,0.64)to[out=120,in=0](-0.1,0.69)(-0.19,0.52)to[out=120,in=330](-0.26,0.59)
(-0.4,0.33)to[out=150,in=280](-0.46,0.42)
%
(-0.44,-0.03)to[bend left=30](-0.34,-0.04)
(-0.33,0.08)to[bend left=40](-0.37,0.2) (-0.37,0.12)to[bend left=40](-0.45,0.14)
(-0.26,0.2)to[bend left=30](-0.24,0.13)
(-0.16,0.32)to[bend right=30](-0.27,0.3)to[bend right=30](-0.29,0.38)
(-0.13,0.49)to[bend left=30](-0.04,0.51);
\draw[rounded corners=0.8pt,line width=\Linewidth,\drawcircle,-{Circle[fill=\filllcirclecolor,length=3.5pt]}](-0.23,0.03)--(-0.15,-0.03)--(-0.19,-0.18)--(-0.04,-0.28);
\draw[rounded corners=0.8pt,line width=\Linewidth,\drawcircle,-{Circle[fill=\filllcirclecolor,length=3.5pt]}](-0.17,0.13)--(-0.04,0.05)--(-0.06,-0.06)--(0.14,-0.11);
\draw[rounded corners=0.8pt,line width=\Linewidth,\drawcircle,-{Circle[fill=\filllcirclecolor,length=3.5pt]}](-0.12,0.23)--(0.31,0.0);
\draw[rounded corners=0.8pt,line width=\Linewidth,\drawcircle,-{Circle[fill=\filllcirclecolor,length=3.5pt]}](-0.07,0.32)--(0.06,0.26)--(0.16,0.33)--(0.34,0.2);
\draw[rounded corners=0.8pt,line width=\Linewidth,\drawcircle,-{Circle[fill=\filllcirclecolor,length=3.5pt]}](-0.01,0.43)--(0.06,0.39)--(0.18,0.51)--(0.31,0.4);
\end{scope}
}
}
}
%inbox
\tikzset{%
pics/inbox/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box=INBOX,scale=\scalefac, every node/.append style={transform shape}]
\node[line width=\Linewidth,draw=\drawcolor,fill=\filllcolor!50,
rectangle,rounded corners=3pt,minimum width=14mm,minimum height=10mm]at(0,0.3){};
\node[line width=\Linewidth,draw=\drawcolor,fill=\filllcolor,
rectangle,rounded corners=3pt,minimum width=15mm,minimum height=10mm]at(0,0.1){};
\node[line width=\Linewidth,draw=\drawcolor,fill=\filllcolor!50,
rectangle,rounded corners=3pt,minimum width=17mm,minimum height=10mm]
at(0,-0.1){};
\draw[line width=\Linewidth,draw=\drawcolor,fill=\filllcirclecolor,rounded corners=2pt](-0.92,0.05)--
(-0.92,-0.78)--(0.92,-0.78)--(0.92,0.05)--(0.40,0.05)--(0.32,-0.2)--(-0.29,-0.2)--(-0.40,0.05)--cycle;
\node[single arrow, line width=\Linewidth,draw=black,fill=green!80!black!50, rotate=270,
minimum width = 15pt, single arrow head extend=6pt,
minimum height=10mm]at(0,0.5) {}; % length of arrow
\end{scope}
}
}
}
%funnel
\tikzset{%
pics/funnel/.style = {
code = {
\pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box=FUNNEL,scale=\scalefac, every node/.append style={transform shape}]
\draw[fill=\filllcolor!50,line width=\Linewidth,draw=\drawcolor](-0.12,-0.81)--(-0.19,-0.25)--(-0.7,0.41)--(0.7,0.41)--(0.19,-0.25)--(0.12,-0.81)--cycle;
\draw[fill=\filllcolor!50,line width=\Linewidth,draw=\drawcolor](-0.19,-0.25)--(0.08,-0.25);
\draw[fill=\filllcolor!50,line width=\Linewidth,draw=\drawcolor](0.16,-0.09)--(0.41,0.31);
%
\node[line width=\Linewidth,draw=\drawcolor,fill=\filllcolor!50,inner sep=1pt,
rectangle,rounded corners=2pt,minimum width=16mm,minimum height=5pt]at(0,0.5){};
%
\foreach \i in{-0.5,0,0.5}{
\node[single arrow, line width=0.8*\Linewidth,draw=black,fill=cyan!90!black!30, rotate=270,inner sep=1pt,
minimum width =9pt, single arrow head extend=2pt,
minimum height=3.5mm]at(\i,0.83) {}; % length of arrow
}
\node[single arrow,line width=0.8*\Linewidth,draw=black,fill=\filllcirclecolor, rotate=270,inner sep=1pt,
minimum width =11pt, single arrow head extend=2pt,
minimum height=4mm]at(0,-1.05) {}; % length of arrow
\end{scope}
}
}
}
\pgfkeys{
/channel/.cd,
Depth/.store in=\Depth,
Height/.store in=\Height,
Width/.store in=\Width,
Level/.store in=\Level,
filllcirclecolor/.store in=\filllcirclecolor,
filllcolor/.store in=\filllcolor,
drawcolor/.store in=\drawcolor,
drawcircle/.store in=\drawcircle,
scalefac/.store in=\scalefac,
Linewidth/.store in=\Linewidth,
picname/.store in=\picname,
filllcolor=BrownLine,
filllcirclecolor=cyan!40,
drawcolor=black,
drawcircle=violet,
scalefac=1,
Level=0.52,
Linewidth=0.5pt,
Depth=1.3,
Height=0.8,
Width=1.1,
picname=C
}
\node[Box, fill=white](B1){};
\pic[shift={(0,-0.10)}] at (B1){laptop={scalefac=0.67,picname=1,drawcolor=GreenD,
filllcolor=GreenD!70!,Linewidth=0.75pt, filllcirclecolor=red!80}};
\node[Box, fill=white,right=of B1,minimum width=16mm](B2){};
\pic[shift={(-0.68,-0.57)}] at (B2){gatewey={scalefac=1.1,picname=1,drawcolor=green!50!black,
filllcolor=red!,Linewidth=1.5pt, filllcirclecolor=red!80}};
\node[Box, fill=white,right=of B2,minimum width=14mm](B3){};
\pic[shift={(0,0.07)}] at (B3){brain={scalefac=1.0,picname=1,filllcolor=orange!30!, Linewidth=1pt}};
\node[Box, fill=white,right=of B3,minimum width=16mm](B4){};
\pic[shift={(0,-0.06)}] at (B4){inbox={scalefac=0.7,picname=1,Linewidth=1.0pt,
filllcolor=BrownL,drawcolor=black,filllcirclecolor=orange!70!yellow!80}};
\node[Box, fill=white,below=1 of B4,minimum width=17mm,minimum height=22mm](B5){};
\pic[shift={(0,0.17)}] at (B5){funnel={scalefac=0.9,picname=1,Linewidth=1.0pt,
filllcolor=BrownL,drawcolor=black,filllcirclecolor=green!70!yellow!80}};
\node[Box, fill=white,below=0.8 of B5,minimum width=17mm](B6){};
\pic[shift={(0,0)}] at (B6){cpu={scalefac=0.4,picname=1,drawcolor=GreenD,
filllcolor=BlueD!70!,Linewidth=0.75pt, filllcirclecolor=brown!20}};
\draw[Arr](B1)--(B2);
\draw[Arr](B2)--(B3);
\draw[Arr](B3)--(B4);
\draw[Arr,shorten <=5pt](B4)--(B5);
\draw[Arr](B5)--(B6);
\draw[violet,line width=1.5pt](B1.south west)--coordinate(S1)(B1.south east);
\draw[violet,line width=1.5pt](B2.south west)--coordinate(S2)(B2.south east);
\draw[violet,line width=1.5pt](B3.south west)--coordinate(S3)(B3.south east);
\draw[violet,line width=1.5pt](B4.south west)--coordinate(S4)(B4.south east);
\draw[violet,line width=1.5pt](B5.south west)--coordinate[pos=0.25](S5)(B5.north west);
\draw[violet,line width=1.5pt](B6.south west)--coordinate(S6)(B6.north west);
\node[Box2,anchor=north,below= 0.6of S1](CR){Client\\(Request)};
\draw[LineA](S1)--(CR);
\node[Box2,text width=25mm,right=0.75 of CR](NI){Network Ingress\\(HTTP/gRPC)};
\draw[LineA](S2)--(NI);
\draw[LineA](S3)--++(210:1.15);
\node[Box2,right=0.55 of NI](RQT){Request\\Queue};
\node[Box2,right=0.4 of RQT](DBT){Dynamic\\Batcher};
\draw[LineA](S4)--(DBT);
\node[Box2,left=of S5,text width=26mm](IRT){Inference Runner\\(TensorRT/ONNX)};
\draw[LineA](S5)--(IRT);
\draw[LineA](S6)--++(180:1.0)coordinate(AC);
\node[Box2,text width=26mm,left=of S6]{Accelerator\\(GPU/TPU)};
% Labels
\node[above=0pt of B3, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] {Request Buffering};
\node[above=0pt of B4, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] {Throughput Opt.};
\node[right=12pt of B5.center, align=left,font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] {Execution\\ Opt.};
\end{tikzpicture}
```
:::
This architecture serves three functions. First, *concurrency management*: servers use asynchronous event loops or thread pools to handle thousands of concurrent client connections without blocking, ensuring that network I/O wait times do not idle the accelerator. Second, *request transformation*\index{Request Transformation!tensor formats}: the server converts network payloads (JSON/Protobuf) into the specific tensor formats required by the optimized model runtime. Image tensors, for example, can be stored as NCHW[^fn-nchw-nhwc-serving]\index{NCHW!tensor layout} (batch, channels, height, width) or NHWC\index{NHWC!tensor layout} (batch, height, width, channels). PyTorch and TensorRT prefer NCHW because it places channel data contiguously, enabling efficient convolution on GPUs. TensorFlow defaults to NHWC, which is more efficient on CPUs.
[^fn-nchw-nhwc-serving]: **NCHW and NHWC (Tensor Memory Layouts)**: These acronyms encode the memory layout order of 4D image tensors: N (batch), C (channels), H (height), W (width). NCHW places all values for one channel contiguously, enabling vectorized convolution on GPUs; NHWC interleaves channels at each spatial position, aligning better with CPU SIMD instructions. A format mismatch between client and server silently corrupts inference: the model interprets pixel rows as color channels, producing garbage outputs without raising errors. \index{NCHW!tensor layout}
Third, *model management*: inference servers manage the lifecycle of models, including loading weights into VRAM, managing versioning, and ensuring that warmup inferences are completed before exposing the model to live traffic.
Of these components, the scheduler deserves special attention because it embodies the core serving tradeoff between throughput and latency.
### The Scheduler: Where Throughput Meets Latency {#sec-model-serving-scheduler-throughput-meets-latency-d022}
The **Scheduler**\index{Scheduler!inference server} is the "brain" of the inference server. It implements the dynamic batching logic discussed in @sec-model-serving-throughput-optimization-18d1. The scheduler must decide whether to run a single request immediately to minimize its latency or wait 5 milliseconds for a second request and process them together to maximize throughput.
Systems designers use the **Batching Window**\index{Batching Window!latency-throughput tradeoff} parameter to tune this trade-off. A window of 0 ms optimizes for pure latency (no batching), while a window of 1050 ms is common for high-throughput cloud services. This decision determines the "duty cycle" of the GPU, the percentage of time the hardware is actually computing versus waiting for work.
### Interface Protocols and Serialization {#sec-model-serving-interface-protocols-serialization-5510}
The mechanism used to transport data between client and server directly affects the latency budget. Model inference is often highly optimized, yet the cost of moving data into the model (serialization and network protocol overhead) can become the dominant bottleneck, especially for lightweight models where inference time is small.
#### The Serialization Bottleneck {#sec-model-serving-serialization-bottleneck-aaa0}
Text-based\index{Serialization!overhead} formats like JSON are ubiquitous but computationally expensive. Parsing a JSON object requires reading every byte, validating syntax, and converting text representations into machine-native types. For high-throughput systems, this consumes CPU cycles that could otherwise be used for request handling or preprocessing.
\index{FlatBuffers!zero-copy serialization}
Binary formats like Protocol Buffers[^fn-protobuf-serialization] (Protobuf) or FlatBuffers[^fn-flatbuffers-zerocopy] reduce this overhead by designing the wire format to map directly to in-memory data structures. This enables "zero-copy" deserialization in optimal cases, where the network buffer can be used directly without allocating new memory.
[^fn-protobuf-serialization]: **Protocol Buffers (Protobuf)**: Protobuf uses a pre-defined schema (from a `.proto` file) to encode data into a compact binary format, eliminating the need to parse field names or type information. While this makes deserialization 20--100$\times$ faster than with text formats like JSON, its wire format is not identical to a C++ object's in-memory layout. This distinction means it still requires a final parsing step and cannot achieve the true "zero-copy" access that FlatBuffers enables. \index{Protocol Buffers!serialization}
\index{FlatBuffers!etymology}
[^fn-flatbuffers-zerocopy]: **FlatBuffers**: The "flat" in the name describes the design: the binary buffer serves simultaneously as the serialized and in-memory representation, requiring no parsing or unpacking. For ML inference, this enables true zero-copy access to tensor metadata---the serving system reads tensor shapes and data pointers directly from the network buffer without allocating new memory, reducing per-request serialization overhead to near zero. TensorFlow Lite adopted FlatBuffers as its model format for exactly this reason. \index{FlatBuffers!zero-copy serialization}
#### REST vs gRPC {#sec-model-serving-rest-vs-grpc-c7b7}
Two dominant paradigms define modern serving interfaces, each with distinct system characteristics. REST (Representational State Transfer)\index{REST!HTTP/1.1 protocol} typically uses HTTP/1.1 and JSON. It is universally supported, human-readable, and stateless, making it the default choice for public-facing APIs. However, REST's statelessness forces re-sending context with every call; for LLM serving, where a conversation context can exceed 10 KB of token IDs, this per-request overhead compounds at high QPS. Standard HTTP/1.1 also requires a new TCP handshake for each request (unless keep-alive is carefully tuned), and JSON serialization adds significant latency for numerical data like tensors.
In contrast, gRPC (gRPC Remote Procedure Call)\index{gRPC!inference protocol}[^fn-grpc-inference] uses HTTP/2 and Protobuf\index{Protocol Buffers!serialization}. HTTP/2 enables multiplexing multiple requests over a single persistent TCP connection, eliminating handshake latency and allowing efficient binary streaming. Protobuf provides strict type safety and efficient binary serialization, making it the standard for internal service-to-service communication where latency is critical.
[^fn-grpc-inference]: **gRPC (gRPC Remote Procedure Call)**: Evolved from Google's internal Stubby framework, gRPC was designed to minimize the overhead of the billions of inter-service calls made per second. It achieves this by pairing HTTP/2 for persistent connection multiplexing with Protobuf for efficient binary serialization, directly addressing the handshake and parsing latencies inherent to REST/JSON. This combination yields a ~$10\times$ reduction in serialization overhead, making it the standard for internal APIs where such costs are a significant fraction of the latency budget. \index{gRPC!inference protocol}
The following example compares *JSON vs Protobuf serialization*.
```{python}
#| label: serialization-comparison-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ JSON VS PROTOBUF SERIALIZATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "JSON vs Protobuf Serialization"
# │
# │ Goal: Quantify the serialization tax in high-throughput inference.
# │ Show: The 10× efficiency gain of Protobuf over JSON for vector data.
# │ How: Calculate parsing overhead and wire size for a 1000-float payload.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: serial_floats_str, json_size_str, json_parse_str, protobuf_size_str,
# │ protobuf_parse_str, requests_per_sec_str, efficiency_gain_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
# ┌── LEGO ───────────────────────────────────────────────
class SerializationEfficiency:
"""
Namespace for Serialization Efficiency calculation.
Scenario: Comparing JSON vs Protobuf for a 1000-float payload.
"""
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
floats_count = 1000
json_parse_us = 50.0
proto_parse_us = 5.0
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
efficiency_gain = json_parse_us / proto_parse_us
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
check(efficiency_gain >= 5, f"Protobuf gain ({efficiency_gain:.1f}x) is too small to justify switching.")
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
serial_floats_str = f"{floats_count:,}"
json_size_str = "9"
json_parse_str = f"{int(json_parse_us)}"
protobuf_size_str = "4"
protobuf_parse_str = f"{int(proto_parse_us)}"
requests_per_sec_str = "10,000"
efficiency_gain_str = fmt(efficiency_gain, precision=0, commas=False)
```
::: {.callout-notebook title="JSON vs Protobuf Serialization"}
Consider a request payload containing `{python} SerializationEfficiency.serial_floats_str` floating point numbers (e.g., an embedding vector).
* **JSON**: Uses ~`{python} SerializationEfficiency.json_size_str` KB on the wire. Requires ~`{python} SerializationEfficiency.json_parse_str` μs to parse.
* **Protobuf**: Uses ~`{python} SerializationEfficiency.protobuf_size_str` KB on the wire. Requires ~`{python} SerializationEfficiency.protobuf_parse_str` μs to parse.
For a system processing `{python} SerializationEfficiency.requests_per_sec_str` requests per second, switching to Protobuf saves nearly half a core of CPU time in serialization overhead alone. This `{python} SerializationEfficiency.efficiency_gain_str`$\times$ efficiency gain makes gRPC essential for high-throughput internal microservices.
:::
The system choice is clear: use REST for public APIs to maximize developer accessibility, and use gRPC for high-performance internal communication to minimize the serialization tax.
The architectural components and protocols examined so far describe *how* serving systems are built. Understanding *why* certain configurations perform better requires analyzing what happens to individual requests as they traverse these components.
## Request Lifecycle {#sec-model-serving-request-lifecycle-d9c6}
A single HTTP request carrying a 224$\times$224 JPEG image arrives at an inference server. Between the moment the first byte enters the network stack and the moment the classification result leaves, that request traverses six pipeline stages, each consuming milliseconds that the user experiences as wait time. Understanding *where* time goes within each request is essential for effective optimization: one cannot improve what one does not measure.
### The Latency Budget {#sec-model-serving-latency-budget-ef40}
For dynamic inference systems\index{Latency Budget!optimization objectives}, the serving inversion established in @sec-model-serving-serving-paradigm-9634 has concrete implications for system design [@gujarati2020serving]. A serving system with 1000 ms per-request latency has failed, even if it achieves excellent throughput.
```{python}
#| label: tail-latency-ratio-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ TAIL LATENCY RATIO
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Latency Budget introduction paragraph
# │
# │ Goal: Demonstrate why mean latency is a misleading metric for user experience.
# │ Show: That p99 users can wait 40× longer than the median.
# │ How: Calculate the ratio between mean and tail response times.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: tail_ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class TailLatencyRatioCalc:
"""Demonstrates why mean latency misleads by showing the p99-to-mean ratio."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
mean_latency_ms_value = 50 # mean latency (ms)
p99_latency_ms_value = 2000 # p99 latency (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
tail_ratio_value = p99_latency_ms_value / mean_latency_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
tail_ratio_str = fmt(tail_ratio_value, precision=0, commas=False) # e.g. "40" times
```
The metrics that matter change from aggregate throughput to latency distributions. Mean latency reveals little about user experience; p50\index{p50 Latency!median response time}, p95\index{p95 Latency!percentile target}, and p99 latencies\index{Latency!percentiles (p50, p95, p99)} reveal *how* the system performs across the full range of requests. If the mean latency is 50 ms but p99 is 2 seconds, one in a hundred users waits `{python} TailLatencyRatioCalc.tail_ratio_str` times longer than average. For consumer-facing applications, these tail latencies often determine user satisfaction and retention.[^fn-tail-latency-serving]
[^fn-tail-latency-serving]: **Tail Latency**: Unlike averages, percentile latencies reveal the performance impact of system outliers common in ML serving, such as model cache misses or garbage collection pauses. These rare, high-latency requests disproportionately harm user satisfaction and directly impact revenue. Foundational studies at Google and Amazon quantified this relationship, finding that 100 ms of added latency cost ~1% in sales, establishing percentile targets (p95, p99) as the critical metrics for service quality. \index{Tail Latency!revenue impact}
Managing these percentile constraints requires decomposing the total allowed response time into a *latency budget*\index{Latency Budget!request lifecycle breakdown} that allocates time across each processing phase.
::: {.callout-definition title="Latency Budget"}
***Latency Budget***\index{Latency Budget!definition} is the **Time Capital** allocated to a request, strictly bounded by the end-to-end **Service Level Objective (SLO)**.
1. **Significance (Quantitative):** It acts as a **Zero-Sum Constraint System** where any milliseconds consumed by serialization or network overhead directly reduce the computational budget ($L_{\text{lat}}$) available for model inference.
2. **Distinction (Durable):** Unlike **Average Latency**, which hides variance, a Latency Budget is a **Hard Bound** that must be maintained for the slowest requests (e.g., p99).
3. **Common Pitfall:** A frequent misconception is that the "model" has the entire budget. In reality, the model often has less than **50% of the total budget**; the remainder is consumed by the **Request Lifecycle** (DNS, TLS, Load Balancing, Serialization).
:::
Before computing a full budget, we pose the foundational *latency analysis questions* that every serving engineer must answer.
::: {.callout-notebook title="ResNet-50: Latency Analysis Questions"}
Serving is about optimizing the **Tail Latency** under load.
**The Physics of Latency**
Consider these foundational questions:
1. **Queuing Theory**: Why do latency spikes occur non-linearly as utilization approaches 100%? The M/M/1 queue model explains this behavior.
2. **Batching Trade-offs**: Why does increasing batch size improve throughput (images/sec) yet degrade latency (ms/request)?
**Optimization Targets**
3. **The Bottleneck**: In a highly optimized inference server, why does **Preprocessing** often consume more time than the model itself?
:::
Every serving request decomposes into three phases that each consume part of the latency budget. Preprocessing\index{Preprocessing!latency impact} transforms raw input such as image bytes or text strings into model-ready tensors. Inference\index{Inference!pipeline phase} executes the model computation. Postprocessing\index{Postprocessing!response formatting} transforms model outputs into user-facing responses.
Faster hardware does not automatically mean faster serving\index{Amdahl's Law!preprocessing bottleneck}. In practice, preprocessing and postprocessing often dominate total latency. Studies of production systems show preprocessing consuming 60 to 70 percent of total request time when inference runs on optimized accelerators [@nvidia_triton]. Optimizing only the inference phase yields diminishing returns when the surrounding pipeline remains bottlenecked on CPU operations.
### Latency Distribution Analysis {#sec-model-serving-latency-distribution-analysis-b0f8}
Understanding *where* time goes requires instrumenting each phase independently. A *ResNet-50 latency budget breakdown* reveals exactly how each millisecond is spent when our classifier receives a JPEG image:
```{python}
#| label: latency-table-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LATENCY BUDGET BREAKDOWN TABLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Latency Budget Breakdown" table
# │
# │ Goal: Decompose the request lifecycle into processing phases.
# │ Show: That non-inference tasks (JPEG decode, resize) consume 50% of the latency budget.
# │ How: Sum millisecond-scale components for a standard vision inference request.
# │
# │ Imports: (none)
# │ Exports: l_jpeg_str, l_resize_str, l_norm_str, l_transfer_str, l_inf_str,
# │ l_post_str, l_total_str, p_jpeg_str, p_resize_str, p_norm_str,
# │ p_transfer_str, p_inf_str, p_post_str
# └─────────────────────────────────────────────────────────────────────────────
class LatencyTableCalc:
"""Decomposes the ResNet-50 request lifecycle into processing phases with percentages."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
l_jpeg_value = 3.0 # JPEG decode (ms)
l_resize_value = 1.0 # resize to 224×224 (ms)
l_norm_value = 0.5 # normalize (mean/std) (ms)
l_transfer_value = 0.5 # CPU→GPU transfer (ms)
l_inf_value = 5.0 # ResNet-50 forward pass (ms)
l_post_value = 0.1 # softmax + top-5 (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
l_total_value = l_jpeg_value + l_resize_value + l_norm_value + l_transfer_value + l_inf_value + l_post_value
p_jpeg_value = l_jpeg_value / l_total_value * 100
p_resize_value = l_resize_value / l_total_value * 100
p_norm_value = l_norm_value / l_total_value * 100
p_transfer_value = l_transfer_value / l_total_value * 100
p_inf_value = l_inf_value / l_total_value * 100
p_post_value = l_post_value / l_total_value * 100
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
l_jpeg_str = f"{l_jpeg_value:.1f}ms"
l_resize_str = f"{l_resize_value:.1f}ms"
l_norm_str = f"{l_norm_value:.1f}ms"
l_transfer_str = f"{l_transfer_value:.1f}ms"
l_inf_str = f"{l_inf_value:.1f}ms"
l_post_str = f"{l_post_value:.1f}ms"
l_total_str = f"{l_total_value:.1f}ms"
p_jpeg_str = f"{p_jpeg_value:.0f}%"
p_resize_str = f"{p_resize_value:.0f}%"
p_norm_str = f"{p_norm_value:.0f}%"
p_transfer_str = f"{p_transfer_value:.0f}%"
p_inf_str = f"{p_inf_value:.0f}%"
p_post_str = f"~{p_post_value:.0f}%"
```
```{python}
#| label: latency-budget-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PREPROCESSING SHARE OF LATENCY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Latency distribution narrative (key insight paragraph)
# │
# │ Goal: Demonstrate the shifting bottleneck from inference to preprocessing.
# │ Show: That optimized inference (TensorRT) makes preprocessing 68% of total latency.
# │ How: Compare phase durations before and after model acceleration.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: preprocess_ms_str, cpu_gpu_ms_str, resnet_inference_ms_str,
# │ tensorrt_inference_ms_str, model_10x_ms_str, total_latency_str,
# │ preprocess_pct_str, tensorrt_preprocess_pct_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class LatencyBudgetCalc:
"""Demonstrates the shifting bottleneck from inference to preprocessing after TensorRT optimization."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
jpeg_decode_ms_value = 3.0 # JPEG decode (ms)
resize_ms_value = 1.0 # resize (ms)
normalize_ms_value = 0.5 # normalize (ms)
cpu_gpu_ms_value = 0.5 # CPU→GPU transfer (ms)
resnet_inference_ms_value = 5.0 # PyTorch inference (ms)
postprocess_ms_value = 0.1 # postprocessing (ms)
tensorrt_inference_ms_value = 2.0 # TensorRT optimized inference (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
preprocess_ms_value = jpeg_decode_ms_value + resize_ms_value + normalize_ms_value
total_latency_ms_value = preprocess_ms_value + cpu_gpu_ms_value + resnet_inference_ms_value + postprocess_ms_value
preprocess_pct_value = preprocess_ms_value / total_latency_ms_value * 100
tensorrt_total_ms_value = preprocess_ms_value + cpu_gpu_ms_value + tensorrt_inference_ms_value + postprocess_ms_value
tensorrt_preprocess_pct_value = preprocess_ms_value / tensorrt_total_ms_value * 100
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
preprocess_ms_str = fmt(preprocess_ms_value, precision=1, commas=False)
cpu_gpu_ms_str = fmt(cpu_gpu_ms_value, precision=1, commas=False)
resnet_inference_ms_str = fmt(resnet_inference_ms_value, precision=0, commas=False)
tensorrt_inference_ms_str = fmt(tensorrt_inference_ms_value, precision=0, commas=False)
model_10x_ms_str = fmt(resnet_inference_ms_value / 10, precision=1, commas=False)
total_latency_str = fmt(total_latency_ms_value, precision=1, commas=False)
preprocess_pct_str = fmt(preprocess_pct_value, precision=0, commas=False)
tensorrt_preprocess_pct_str = fmt(tensorrt_preprocess_pct_value, precision=0, commas=False)
```
::: {.callout-notebook title="ResNet-50: Latency Budget Breakdown"}
A typical serving request for our ResNet-50 classifier shows the following latency distribution:
| **Phase** | **Operation** | **Time** | **Percentage** |
|:-------------------|:---------------------------|:--------------------------------------------|:-------------------------------------------|
| **Preprocessing** | JPEG decode | `{python} LatencyTableCalc.l_jpeg_str` | `{python} LatencyTableCalc.p_jpeg_str` |
| **Preprocessing** | Resize to $224\times224$ | `{python} LatencyTableCalc.l_resize_str` | `{python} LatencyTableCalc.p_resize_str` |
| **Preprocessing** | Normalize (mean/std) | `{python} LatencyTableCalc.l_norm_str` | `{python} LatencyTableCalc.p_norm_str` |
| **Data Transfer** | CPU→GPU copy | `{python} LatencyTableCalc.l_transfer_str` | `{python} LatencyTableCalc.p_transfer_str` |
| **Inference** | **ResNet-50 forward pass** | **`{python} LatencyTableCalc.l_inf_str`** | **`{python} LatencyTableCalc.p_inf_str`** |
| **Postprocessing** | Softmax + top-5 | `{python} LatencyTableCalc.l_post_str` | `{python} LatencyTableCalc.p_post_str` |
| **Total** | | **`{python} LatencyTableCalc.l_total_str`** | **100%** |
Key insight: preprocessing consumes `{python} LatencyBudgetCalc.preprocess_pct_str`% of latency despite model inference being the computationally intensive phase. With TensorRT optimization reducing inference to `{python} LatencyBudgetCalc.tensorrt_inference_ms_str` ms, preprocessing would dominate at `{python} LatencyBudgetCalc.tensorrt_preprocess_pct_str`%.
:::
The ResNet example represents compute-bound inference where math dominates. Recommendation systems exhibit a different bottleneck profile entirely.
```{python}
#| label: dlrm-latency-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ DLRM SERVING LATENCY (IO-BOUND EXAMPLE)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Lighthouse example "DLRM Serving"
# │
# │ Goal: Contrast compute-bound and I/O-bound serving bottlenecks.
# │ Show: That embedding lookups consume 67% of recommendation latency.
# │ How: Model DLRM latency across parsing, embedding, and MLP phases.
# │
# │ Imports: (none)
# │ Exports: dlrm_input_str, dlrm_embed_str, dlrm_mlp_str, dlrm_post_str,
# │ dlrm_total_str
# └─────────────────────────────────────────────────────────────────────────────
class DlrmLatencyCalc:
"""Models DLRM serving latency to contrast I/O-bound vs. compute-bound bottlenecks."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
dlrm_input_ms_value = 0.5 # request parsing (CPU) (ms)
dlrm_embed_ms_value = 6.0 # embedding lookups (memory BW) (ms)
dlrm_mlp_ms_value = 1.5 # MLP forward pass (compute) (ms)
dlrm_post_ms_value = 1.0 # ranking & filtering (CPU) (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
dlrm_total_ms_value = dlrm_input_ms_value + dlrm_embed_ms_value + dlrm_mlp_ms_value + dlrm_post_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
dlrm_input_str = f"{dlrm_input_ms_value}ms"
dlrm_embed_str = f"{dlrm_embed_ms_value}ms"
dlrm_mlp_str = f"{dlrm_mlp_ms_value}ms"
dlrm_post_str = f"{dlrm_post_ms_value}ms"
dlrm_total_str = f"{dlrm_total_ms_value}ms"
```
::: {.callout-lighthouse title="Lighthouse Example: DLRM Serving"}
**The Scenario**: Serving a Recommendation System (DLRM) with a 10 ms P99 latency budget.
**The Contrast**: While ResNet-50 serving is limited by math (CNN ops), DLRM serving is strictly limited by I/O and memory capacity.
| **Phase** | **Operation** | **Time** | **Bottleneck** |
|:-------------------|:-----------------------------|:----------------------------------------------|:---------------|
| **Input Parsing** | Request parsing | `{python} DlrmLatencyCalc.dlrm_input_str` | CPU |
| **Embedding Look** | **Fetch 100+ dense vectors** | **`{python} DlrmLatencyCalc.dlrm_embed_str`** | **Memory BW** |
| **Inference** | MLP forward pass | `{python} DlrmLatencyCalc.dlrm_mlp_str` | Compute |
| **Postprocessing** | Ranking & Filtering | `{python} DlrmLatencyCalc.dlrm_post_str` | CPU |
| **Total** | | **`{python} DlrmLatencyCalc.dlrm_total_str`** | |
**Key Systems Insight**:
In DLRM, the "Inference" (MLP) is only ~15% of the latency. The majority of time is spent in embedding lookups, retrieving massive 128-dim vectors from terabyte-scale tables. This is an IO-bound workload where adding more GPUs does not help unless memory bandwidth and capacity also increase.
:::
This breakdown reveals why straightforward optimization efforts often fail. Engineers focus on model optimization (quantization, pruning) because that is where ML expertise applies, but the actual bottleneck is image decoding running on CPU. Adopting *the quantitative approach to serving* exposes these hidden bottlenecks before engineering effort is misallocated.
```{python}
#| label: amdahl-serving-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ AMDAHL'S LAW IN SERVING
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Quantitative Approach to Serving"
# │
# │ Goal: Demonstrate why model-only optimization yields diminishing returns.
# │ Show: That a 10× model speedup produces only 1.8× end-to-end improvement.
# │ How: Apply Amdahl's Law using the non-inference latency share.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: non_model_pct_str, optimized_total_str, amdahl_speedup_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class AmdahlServingCalc:
"""Applies Amdahl's Law to show that 10× model speedup yields only ~1.8× end-to-end improvement."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
# Re-derived from latency-budget-calc constants
preprocess_ms_value = 3.0 + 1.0 + 0.5 # JPEG decode + resize + normalize
cpu_gpu_ms_value = 0.5 # CPU→GPU transfer
resnet_inference_ms_value = 5.0 # PyTorch inference
postprocess_ms_value = 0.1 # postprocessing
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
total_latency_ms_value = preprocess_ms_value + cpu_gpu_ms_value + resnet_inference_ms_value + postprocess_ms_value
non_model_ms_value = preprocess_ms_value + cpu_gpu_ms_value
non_model_pct_value = non_model_ms_value / total_latency_ms_value * 100
model_10x_ms_value = resnet_inference_ms_value / 10
optimized_total_ms_value = non_model_ms_value + model_10x_ms_value + postprocess_ms_value
amdahl_speedup_value = total_latency_ms_value / optimized_total_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
non_model_pct_str = fmt(non_model_pct_value, precision=0, commas=False)
optimized_total_str = fmt(optimized_total_ms_value, precision=1, commas=False)
amdahl_speedup_str = fmt(amdahl_speedup_value, precision=1, commas=False)
```
::: {.callout-notebook title="The Quantitative Approach to Serving"}
**Amdahl's Law at Work** (see @sec-machine-foundations-amdahls-law-gustafsons-law-b741 for the formal derivation): preprocessing (`{python} LatencyBudgetCalc.preprocess_ms_str` ms) and data transfer (`{python} LatencyBudgetCalc.cpu_gpu_ms_str` ms) consume `{python} AmdahlServingCalc.non_model_pct_str`% of total latency. Optimizing the model 10$\times$ faster (`{python} LatencyBudgetCalc.resnet_inference_ms_str` ms → `{python} LatencyBudgetCalc.model_10x_ms_str` ms) yields only `{python} AmdahlServingCalc.amdahl_speedup_str`$\times$ end-to-end speedup (from `{python} LatencyBudgetCalc.total_latency_str` ms to `{python} AmdahlServingCalc.optimized_total_str` ms). This is why focusing exclusively on model optimization (quantization, pruning) often disappoints: the bottleneck is elsewhere.
**DSA Efficiency**: General-purpose CPUs achieve only 12% of peak performance at batch-1 because instruction overhead dominates. DSAs like TPUs and Tensor Cores replace complex logic with dense MAC arrays, achieving 10--100$\times$ higher arithmetic intensity. This makes hardware acceleration a requirement for economically viable serving.
**Engineering Implication**: Profile before optimizing. If preprocessing dominates, GPU-accelerated pipelines (NVIDIA DALI) may outperform model quantization.
:::
Moving preprocessing to GPU\index{GPU Preprocessing!accelerated pipelines} can reduce total latency by 6$\times$ in some pipelines by eliminating CPU-GPU data transfers between stages [@nvidia_triton].
Effective optimization targets the largest time consumers first.
#### The Serving Tax Bill {#sec-model-serving-serving-tax-bill-dc6c}
Beyond the model execution itself, every request pays a "tax" to the serving infrastructure. @tbl-serving-tax quantifies these overheads for a typical high-performance inference request (e.g., ResNet-50 classification).
| **Tax Component** | **Typical Cost** | **Scaling Behavior** | **Tax Evasion Strategy** |
|:------------------|----------------------:|:---------------------|:--------------------------------|
| **Network I/O** | 1-5 ms | Linear with payload | Compression, Region Colocation |
| **Serialization** | 50500 $\mu\text{s}$ | Linear with payload | gRPC/Protobuf (vs JSON) |
| **Queuing** | 0.1-10 ms | Exponential w/ load | Dynamic Batching, Autoscaling |
| **Dispatch** | 1050 $\mu\text{s}$ | Constant per batch | Kernel Fusion (reduce launches) |
| **Data Copy** | 100500 $\mu\text{s}$ | Linear with tensor | Zero-Copy / Shared Memory |
: **The Serving Tax Bill**: A breakdown of non-inference latency sources. While individual components like serialization seem small ($<1$ ms), they compound. In a 5 ms inference service, this "tax" can easily consume 50% of the latency budget. The primary engineering goal is to drive these costs to zero through architectural choices like gRPC and Zero-Copy data paths. {#tbl-serving-tax}
#### The Killer Microseconds Problem {#sec-model-serving-killer-microseconds-problem-bc00}
Barroso, Patterson, and colleagues identified a critical gap in *how* systems handle latency at different time scales\index{Killer Microseconds!latency gap} [@barroso2017attack]. Operations in the microsecond range are too short for traditional OS scheduling (which operates at millisecond granularity) yet too long to simply spin-wait without wasting CPU cycles. This "killer microseconds" regime dominates modern serving workloads. Consider the compound effect visible in @tbl-serving-tax: serialization at 50 μs, dispatch at 1050 μs, and data copy at 100500 μs are each individually negligible, but for a 5 ms inference service, these microsecond-scale overheads collectively consume half the latency budget. No single overhead justifies optimization in isolation, yet together they determine whether the system meets its SLO.
The latency budget framework provides a systematic approach to this compound problem. Measurement comes first: without per-phase instrumentation, engineers cannot distinguish a preprocessing bottleneck from a serialization bottleneck, and optimization effort gets misallocated to the most visible component (the model) rather than the most expensive one. Once measurement reveals the true distribution of time, engineering effort should flow proportionally—a phase consuming 50% of latency deserves more attention than one consuming 5%, regardless of which feels more tractable. Architectural changes such as GPU-accelerated preprocessing or aggressive batching can shift work between phases entirely, sometimes eliminating a bottleneck rather than merely reducing it.
### Resolution and Input Size Tradeoffs {#sec-model-serving-resolution-input-size-tradeoffs-155d}
Input resolution affects both preprocessing and inference latency, but the relationship differs depending on whether the system is compute-bound\index{Compute-Bound!resolution scaling} (limited by arithmetic throughput) or memory-bound\index{Memory-Bound!activation tensors} (limited by data movement). A compute-bound system slows proportionally to increased computation; a memory-bound system may show minimal slowdown if activation tensors still fit in fast memory. @sec-hardware-acceleration covers this distinction in depth through roofline model analysis; understanding it is essential for making informed resolution decisions.
For compute-bound models, @eq-resolution-throughput formalizes how throughput ($X$) scales inversely with resolution squared:
$$\frac{X(r_2)}{X(r_1)} = \left(\frac{r_1}{r_2}\right)^2$$ {#eq-resolution-throughput}
```{python}
#| label: resolution-scaling-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESOLUTION SCALING SLOWDOWN
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Resolution and input size tradeoffs narrative
# │
# │ Goal: Quantify the relationship between input resolution and latency.
# │ Show: The quadratic relationship between resolution and computation time.
# │ How: Calculate theoretical slowdown for doubled resolution.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: r1_str, r2_str, theoretical_str, measured_slowdown_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class ResolutionScalingCalc:
"""Quantifies the quadratic relationship between input resolution and inference slowdown."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
r1_value = 224 # original resolution
r2_value = 448 # doubled resolution
measured_slowdown_value = 3.6 # actual measured slowdown
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
theoretical_slowdown_value = (r2_value / r1_value) ** 2
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
r1_str = f"{r1_value}"
r2_str = f"{r2_value}"
theoretical_str = fmt(theoretical_slowdown_value, precision=0, commas=False)
measured_slowdown_str = f"{measured_slowdown_value:.1f}"
```
Doubling resolution from `{python} ResolutionScalingCalc.r1_str` to `{python} ResolutionScalingCalc.r2_str` theoretically yields `{python} ResolutionScalingCalc.theoretical_str`$\times$ slowdown (measured: `{python} ResolutionScalingCalc.measured_slowdown_str`$\times$ due to fixed overhead amortization). However, at high resolutions, models transition from compute-bound to memory-bound as activation tensors exceed cache capacity. @tbl-resolution-bottleneck quantifies this transition for ResNet-50, showing how arithmetic intensity decreases with resolution:
```{python}
#| label: resolution-bottleneck-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESOLUTION AND COMPUTE BOTTLENECK TABLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @tbl-resolution-bottleneck (Resolution and Compute Bottleneck)
# │
# │ Goal: Demonstrate the shift from compute-bound to memory-bound operation.
# │ Show: That increasing resolution decreases arithmetic intensity.
# │ How: Compare activation sizes and FLOPs per element against the ridge point.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: act_*_mb_str, ai_*_str, ridge_point_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class ResolutionBottleneckCalc:
"""Shows that increasing resolution decreases arithmetic intensity, shifting from compute- to memory-bound."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
act_224_mb_value = 12.5 # 224×224 activation size (MB)
act_384_mb_value = 36.8 # 384×384 activation size (MB)
act_512_mb_value = 65.5 # 512×512 activation size (MB)
act_640_mb_value = 102.4 # 640×640 activation size (MB)
ai_224_value = 85 # 224×224 arithmetic intensity (FLOPs/byte)
ai_384_value = 49 # 384×384 arithmetic intensity (FLOPs/byte)
ai_512_value = 28 # 512×512 arithmetic intensity (FLOPs/byte)
ai_640_value = 18 # 640×640 arithmetic intensity (FLOPs/byte)
ridge_point_value = 16 # V100 ridge point (FLOPs/byte)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
act_224_mb_str = f"{act_224_mb_value:.1f}"
act_384_mb_str = f"{act_384_mb_value:.1f}"
act_512_mb_str = f"{act_512_mb_value:.1f}"
act_640_mb_str = f"{act_640_mb_value:.1f}"
ai_224_str = f"{ai_224_value}"
ai_384_str = f"{ai_384_value}"
ai_512_str = f"{ai_512_value}"
ai_640_str = f"{ai_640_value}"
ridge_point_str = f"{ridge_point_value}"
```
The resulting shift from compute-bound to memory-bound operation is evident in @tbl-resolution-bottleneck:
| **Resolution** | **Activation Size** | **Arith. Intensity** | **Bottleneck** |
|:-------------------|-----------------------------------------------------:|----------------------------------------------------------:|:---------------|
| **$224\times224$** | `{python} ResolutionBottleneckCalc.act_224_mb_str`MB | `{python} ResolutionBottleneckCalc.ai_224_str` FLOPs/byte | Compute |
| **$384\times384$** | `{python} ResolutionBottleneckCalc.act_384_mb_str`MB | `{python} ResolutionBottleneckCalc.ai_384_str` FLOPs/byte | Transitional |
| **$512\times512$** | `{python} ResolutionBottleneckCalc.act_512_mb_str`MB | `{python} ResolutionBottleneckCalc.ai_512_str` FLOPs/byte | Memory BW |
| **$640\times640$** | `{python} ResolutionBottleneckCalc.act_640_mb_str`MB | `{python} ResolutionBottleneckCalc.ai_640_str` FLOPs/byte | Memory BW |
: **Resolution and Compute Bottleneck**: ResNet-50 arithmetic intensity decreases with resolution as activation sizes grow. For a V100 PCIe (15.7 TFLOPS FP32, 900 GB/s bandwidth), the ridge point is approximately 16 FLOPs/byte. At $224\times224,$ compute dominates; by $512\times512,$ memory bandwidth becomes the limiting factor. {#tbl-resolution-bottleneck}
#### Resolution Strategies in Production {#sec-model-serving-deploymentspecific-resolution-decisions-1d76}
Different deployment contexts impose distinct resolution requirements shaped by their dominant constraints. Mobile applications often accept lower resolution ($224\times224$) for object detection in camera viewfinders, where latency and battery life outweigh marginal accuracy gains. Medical imaging sits at the opposite extreme, requiring $512\times512$ or higher for diagnostic accuracy, with relaxed latency requirements that permit the additional compute. Autonomous vehicles split the difference by using multiple resolutions for different tasks: low resolution for rapid detection across wide fields of view and high-resolution crops for fine-grained recognition of detected objects. Cloud APIs face yet another challenge—they typically receive images at whatever resolution the client uploads and must handle the resulting range gracefully. This variability makes cloud APIs ideal candidates for adaptive resolution strategies, where the system selects resolution dynamically based on content characteristics.
```{python}
#| label: adaptive-resolution-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ADAPTIVE RESOLUTION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Adaptive Resolution paragraph (content-based selection)
# │
# │ Goal: Demonstrate the throughput gain from content-aware resolution.
# │ Show: A 1.4× throughput improvement while maintaining high accuracy.
# │ How: List benchmark results for adaptive vs. static high resolution.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: adaptive_throughput_improvement_str, adaptive_accuracy_retention_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class AdaptiveResolutionCalc:
"""Demonstrates 1.4× throughput gain from content-aware resolution selection."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
adaptive_throughput_improvement_value = 1.4 # throughput gain factor
adaptive_accuracy_retention_value = 99.2 # accuracy retention (%)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
adaptive_throughput_improvement_str = fmt(adaptive_throughput_improvement_value, precision=1, commas=False)
adaptive_accuracy_retention_str = fmt(adaptive_accuracy_retention_value, precision=1, commas=False)
```
#### Adaptive Resolution {#sec-model-serving-adaptive-resolution-cb4e}
Production systems\index{Adaptive Resolution!content-based selection} can select resolution dynamically based on content. One approach runs a lightweight classifier at $128\times128$ to categorize content type, then selects task-appropriate resolution with documents at $512\times512,$ landscapes at $224\times224,$ and faces at $384\times384.$ This achieves `{python} AdaptiveResolutionCalc.adaptive_throughput_improvement_str`$\times$ throughput improvement with `{python} AdaptiveResolutionCalc.adaptive_accuracy_retention_str` percent accuracy retention versus fixed high resolution. This pattern trades preprocessing cost from running the lightweight classifier for inference savings on the main model.
The latency analysis so far has focused on sequential processing: one request completing before the next begins. The preprocessing, inference, and postprocessing stages use different hardware resources. This separation creates an opportunity to process multiple requests simultaneously.
### Hardware Utilization and Request Pipelining {#sec-model-serving-utilization-request-pipelining-c61c}
The preceding analysis examined where time goes within individual pipeline stages. Optimizing each stage in isolation, however, misses a critical opportunity: the stages use different hardware resources. The latency budget analysis in @sec-model-serving-latency-budget-ef40 reveals that model inference is only one component of the request lifecycle. From a hardware perspective, the primary goal of a serving system is to maximize the **duty cycle** of the accelerator, the percentage of time the GPU is performing useful computation.
In a serialized serving system, the hardware sits idle during network I/O and CPU-based preprocessing. High-performance serving systems use **Request Pipelining**\index{Request Pipelining!GPU utilization} to overlap these stages, ensuring the GPU is fed a continuous stream of tensors.
#### Overlapping I/O and Compute {#sec-model-serving-overlapping-io-compute-966c}
The two timing diagrams in @fig-serving-pipeline-timing illustrate the impact of pipelining. In the serial case (A), each request must complete its entire lifecycle (Network $\rightarrow$ CPU Preprocessing $\rightarrow$ GPU Inference $\rightarrow$ Postprocessing) before the next request begins, and the grey idle gaps leave the GPU unused for more than 50% of the time. In the pipelined case (B), those gaps disappear.
::: {#fig-serving-pipeline-timing fig-env="figure" fig-pos="htb" fig-cap="**Request Pipelining**: Pipelining hides latency by overlapping independent operations across different hardware resources. In pipelined execution (B), the CPU processes the next request's data while the GPU executes the current request's inference. This increases the GPU duty cycle toward 100%, effectively doubling or tripling throughput on the same hardware without changing the model." fig-alt="Two timing diagrams. A (Serial): alternating CPU preprocessing, GPU inference, and idle blocks in sequence. B (Pipelined): two parallel rows where CPU preprocessing overlaps with GPU inference, eliminating idle time."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.8]
\definecolor{CPUColor}{RGB}{173,216,230}
\definecolor{GPUColor}{RGB}{144,238,144}
\definecolor{WaitColor}{RGB}{240,240,240}
\tikzset{
Pre/.style={align=flush center, draw=black,
font=\small\usefont{T1}{phv}{m}{n}, node distance=-1pt,
line width=0.75pt, fill=cyan!20, text width=15mm,
minimum width=16mm, minimum height=7mm},
Gpu/.style={Pre,fill=GPUColor!60},
Idle/.style={Pre,fill=WaitColor}
}
\node[Pre](R1){Pre};
\node[Gpu,right=of R1](R2){GPU};
\node[Idle,right=of R2](R3){Idle};
\node[Pre,right=of R3](R4){Pre};
\node[Gpu,right=of R4](R5){GPU};
%
\node[Pre,below=1.25of R1](R11){Pre1};
\node[Pre,right=of R11](R12){Pre 2};
\node[Pre,right=of R12](R13){Pre 3};
\node[Pre,right=of R13](R14){Pre 4};
%
\node[Gpu,below=of R12](R21){GPU 1};
\node[Gpu,right=of R21](R22){GPU 2};
\node[Gpu,right=of R22](R23){GPU 3};
\node[Gpu,right=of R23](R24){GPU 4};
%
\node[draw=none,fit=(R1)(R5)](T1){};
\node[above=0pt of T1]{\textbf{A. Serial Execution} (Low Utilization)};
\node[draw=none,fit=(R11)(R24)](T2){};
\node[above=0pt of T2]{\textbf{B. Pipelined Execution} (High Utilization)};
\end{tikzpicture}
```
:::
Pipelining is enabled by **Asynchronous I/O**\index{Asynchronous I/O!pipelining} and **Concurrency Models**\index{Concurrency!serving models}. Instead of waiting for a GPU kernel to finish, the server's CPU thread submits the work to the GPU's command queue and immediately begins preprocessing the next incoming request.
#### The Systems Metric: Hardware Duty Cycle {#sec-model-serving-systems-metric-hardware-duty-cycle-7530}
In the "Quantitative Approach" to ML systems, we define the efficiency of a serving system by its ability to saturate the bottleneck resource. For most ML systems, this is the GPU's compute cores or memory bandwidth. We quantify this in @eq-system-efficiency:
$$\text{System Efficiency} = \frac{\sum T_{\text{compute}}}{\text{Wall Clock Time} \times \text{Resource Count}}$$ {#eq-system-efficiency}
If a ResNet-50 request takes 10 ms total (5 ms GPU, 5 ms CPU), a serial system achieves only 50% efficiency. By pipelining just two requests, efficiency approaches 100% (assuming the CPU can keep up with the GPU). If the CPU is too slow to feed the GPU, the system becomes CPU-bound, and further model optimization provides zero throughput gain—a direct application of Amdahl's Law (introduced in @sec-ml-systems) to serving: if preprocessing consumes 50% of latency, maximum speedup is 2$\times$ regardless of how fast the model runs.
### Postprocessing {#sec-model-serving-postprocessing-3b24}
The request lifecycle concludes with postprocessing\index{Postprocessing!logits to predictions}, the phase that transforms model outputs into actionable results. A neural network produces raw tensors (floating-point arrays that carry no inherent meaning to applications or users). A 0.95 probability becomes a confident "dog" label only after postprocessing converts it; a sequence of token IDs becomes readable text; a bounding box tensor becomes a highlighted region in an image. Postprocessing significantly impacts both latency and the usefulness of predictions.
#### From Logits to Predictions {#sec-model-serving-logits-predictions-09df}
Classification models output logits\index{Logits!classification output} or probabilities across classes. Converting these raw outputs to predictions involves several steps. The simplest is argmax selection\index{Argmax!prediction selection}, which returns the highest-probability class. Thresholding applies a confidence cutoff, returning predictions only when the model is sufficiently certain. Top-k extraction returns multiple high-probability classes with their scores, useful when applications need ranked alternatives. Calibration adjusts raw probabilities to better reflect true likelihoods—a step that adds computation but is essential when downstream systems make decisions based on confidence scores.
For ResNet-50 image classification, typical postprocessing includes transforming logits to probabilities, extracting top predictions, and formatting responses. @lst-resnet-postprocessing shows a complete postprocessing pipeline with timing annotations, demonstrating each step from raw logits to API-ready response. Total postprocessing time is approximately 0.1 ms, negligible compared to preprocessing and inference.
::: {#lst-resnet-postprocessing lst-cap="**ResNet-50 Postprocessing**: Transforms raw logits to calibrated probabilities, extracts top-k predictions, and formats the API response."}
```{.python}
# Transform raw logits to calibrated probabilities
# Input: logits tensor of shape (batch_size, 1000) - one score per
# ImageNet class
probs = torch.softmax(
logits, dim=-1
) # Normalize to sum=1; ~0.05ms on GPU
# Extract top-5 predictions for multi-class response
# topk returns (values, indices) sorted by probability
top5_probs, top5_indices = probs.topk(5) # ~0.02ms; GPU operation
# Map class indices to human-readable labels
# IMAGENET_CLASSES: list of 1000 class names from synset mapping
labels = [
IMAGENET_CLASSES[i] for i in top5_indices
] # ~0.01ms; CPU lookup
# Format response with predictions and metadata for API contract
response = {
"predictions": [
{"label": label, "confidence": float(prob)}
for label, prob in zip(labels, top5_probs)
],
"model_version": "resnet50-v2.1", # Client-side version tracking
"inference_time_ms": 5.2, # Observability for latency monitoring
}
```
:::
Each step adds latency but improves response utility. Calibration in particular can add significant computation but is necessary when downstream systems make decisions based on confidence scores.
#### Output Formatting {#sec-model-serving-output-formatting-753f}
Production systems rarely return raw predictions. Outputs must conform to API contracts that specify JSON serialization schemas, confidence score formatting, and thresholding rules. Error handling must address edge cases: the system must define behavior when no prediction exceeds the confidence threshold or when the input appears out-of-distribution. Response metadata (model version, inference time, feature attributions) enables downstream monitoring and debugging.
The latency budget analysis reveals *where* time goes within a single request. Production systems, however, do not process requests in isolation: they must handle hundreds or thousands of concurrent requests competing for finite resources. Understanding this concurrency requires a different analytical framework.
## Queuing Theory {#sec-model-serving-queuing-theory-tail-latency-29a6}
The preceding lifecycle analysis assumed sequential processing. In production, concurrent requests compete for finite resources, and queuing theory\index{Queuing Theory!serving systems} predicts how this competition affects latency. These principles explain the counterintuitive behavior that causes well-provisioned systems to violate latency SLOs when load increases modestly.
### Queuing Fundamentals {#sec-model-serving-queuing-fundamentals-10d3}
Serving engineers routinely face a concrete question: given a latency SLO\index{SLO (Service Level Objective)!capacity planning} and an expected request rate, *how* many GPUs must be provisioned? Answering this question requires predicting *how* latency changes as load increases, which is precisely what queuing theory provides. Two mathematical foundations govern serving system behavior. Little's Law (@sec-machine-foundations-littles-law-21a3) relates queue depth to throughput. The M/M/1 model predicts how latency degrades under load. Together, they provide the quantitative framework for capacity planning.
### Little's Law {#sec-model-serving-littles-law-9352}
```{python}
#| label: littles-law-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LITTLE'S LAW CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "Little's Law" (Capacity Planning section)
# │
# │ Goal: Connect observable request rates to hardware capacity requirements.
# │ Show: The required concurrency (memory) to sustain a 1000 QPS target.
# │ How: Apply Little's Law (L = λW) using throughput and latency SLO.
# │
# │ Imports: mlsysim.core.constants, mlsysim.book
# │ Exports: littles_lambda_str, littles_w_ms_str, littles_w_str, littles_l_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.core.constants import MS_PER_SEC
from mlsysim.fmt import fmt, check
# ┌── LEGO ───────────────────────────────────────────────
class CapacityPlanningAnchor:
"""
Namespace for serving capacity anchor.
"""
qps_target = 1000
slo_ms = 50
concurrency_slots = int(qps_target * (slo_ms / 1000))
qps_str = f"{qps_target}"
slo_str = f"{slo_ms}ms"
slots_str = f"{concurrency_slots}"
# ┌── LEGO ───────────────────────────────────────────────
class CapacityPlanning:
"""
Namespace for Little's Law Capacity calculation.
Scenario: Determining concurrency requirements for a 1000 QPS target.
"""
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
lambda_qps = 1000.0
latency_slo_s = 0.050 # 50ms
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
# Step 1: L = lambda * W
concurrency = lambda_qps * latency_slo_s
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
check(concurrency == 50, f"Math broken: 1000 * 0.05 should be 50, got {concurrency}")
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
littles_lambda_str = f"{lambda_qps:,.0f}"
littles_w_ms_str = f"{int(latency_slo_s * MS_PER_SEC)}"
littles_w_str = fmt(latency_slo_s, precision=2, commas=False)
littles_l_str = fmt(concurrency, precision=0, commas=False)
```
Serving engineers need a tool that connects observable metrics to capacity requirements. The most celebrated result in queuing theory is Little's Law,\index{Little's Law!concurrency calculation}\index{Little's Law!L=λW formula}[^fn-littles-law-serving] which @eq-littles-law expresses as a simple relationship between three quantities in any stable system: Concretely, a server targeting `{python} CapacityPlanningAnchor.qps_str` QPS with a `{python} CapacityPlanningAnchor.slo_str` SLO requires `{python} CapacityPlanningAnchor.slots_str` concurrent request slots, which sets the hard memory floor for activation storage on that node.
[^fn-littles-law-serving]: **Little's Law**: John D.C. Little proved in 1961 that $L = \lambda W$ holds for *any* stable system regardless of arrival distribution, service distribution, or scheduling discipline. This universality is why it anchors ML capacity planning: the formula requires no assumptions about whether requests arrive in bursts, whether inference times vary, or whether the scheduler batches aggressively. The only requirement is stability ($\lambda < \mu$), and when that condition breaks, no amount of optimization prevents queue divergence. \index{Little's Law!capacity planning}
$$L = \lambda \cdot W$$ {#eq-littles-law}
where $L$ is the average number of requests in the system, $\lambda$ is the arrival rate (requests per second), and $W$ is the average time each request spends in the system.
::: {.callout-perspective title="Notation Alert: L vs. Latency"}
In queuing theory, $L$ traditionally denotes the *length* of the queue (number of items in the system), and $W$ denotes *wait time* (time in system per request). Elsewhere in this book, we use $L_{\text{lat}}$ for latency with descriptive subscripts ($L_{\text{lat,wait}}$, $L_{\text{lat,compute}}$) to denote latency components. To preserve standard queuing notation, we retain $L$ for queue length and $W$ for time in system in this section. In the batching analysis that follows (@sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d), $L_{\text{lat,wait}}$ corresponds to the queueing wait component $W_q$, and $L_{\text{lat,compute}}$ includes inference time.
:::
This relationship holds regardless of arrival distribution, service time distribution, or scheduling policy. The following notebook quantifies this capacity relationship through a practical application of *Little's Law*.
::: {.callout-theorem #notebook-littles-law title="Little's Law"}
**The Capacity Physics**: How much memory does a system need to serve 1,000 queries per second?
**The Law**: $L = \lambda W$ (Concurrency = Throughput$\times$ Latency) (see @sec-machine-foundations-littles-law-21a3 for the derivation).
**Scenario**:
* **Throughput Target (lambda)**: `{python} CapacityPlanning.littles_lambda_str` requests/sec.
* **Latency Target (W)**: `{python} CapacityPlanning.littles_w_ms_str` ms (0.05 s).
**The Calculation**:
L = `{python} CapacityPlanning.littles_lambda_str`$\times$ `{python} CapacityPlanning.littles_w_str` = **`{python} CapacityPlanning.littles_l_str` concurrent requests**
**The Constraint**: The server *must* have enough RAM to hold `{python} CapacityPlanning.littles_l_str` requests simultaneously (batch size + queue).
* If the GPU runs out of memory at Batch Size 32, the system physically **cannot** hit 1,000 QPS at 50 ms latency.
* The only options are to reduce latency ($W$) or add more memory ($L$).
:::
```{python}
#| label: batching-tax-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ THE BATCHING TAX CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "The Batching Tax" section
# │
# │ Goal: Quantify the wait-time cost of larger batches using queuing theory.
# │ Show: That batch-32 adds significant wait time compared to batch-1.
# │ How: Calculate formation delay given arrival rate and batch size.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: wait_time_b1_ms_str, wait_time_b32_ms_str, lat_b1_ms_str, lat_b32_ms_str, penalty_ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt, check
# ┌── LEGO ───────────────────────────────────────────────
class BatchingTax:
"""
Namespace for The Batching Tax calculation.
Scenario: Comparing wait times for B=1 vs B=32 at 500 QPS.
"""
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
lambda_qps = 500.0
# Inference times (ms)
t_inf_b1 = 2.0
t_inf_b32 = 15.0
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
# Step 1: Batch 1
w_form_b1 = (1-1) / (2 * lambda_qps) * 1000 # 0ms
lat_b1 = w_form_b1 + t_inf_b1
# Batch 32
# Step 2: Formation Delay ~ (B-1) / (2 * lambda)
w_form_b32 = (32-1) / (2 * lambda_qps) * 1000 # ~31ms
lat_b32 = w_form_b32 + t_inf_b32
penalty_ratio = lat_b32 / lat_b1
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
check(lat_b32 > lat_b1 * 10, f"Batch-32 penalty ({lat_b32:.1f}ms) should be significant.")
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
wait_time_b1_ms_str = f"{int(w_form_b1)}"
wait_time_b32_ms_str = f"{int(w_form_b32)}"
lat_b1_ms_str = f"{lat_b1:.1f}"
lat_b32_ms_str = f"{lat_b32:.1f}"
penalty_ratio_str = f"{int(penalty_ratio)}"
```
### The Batching Tax: The Latency-Throughput Frontier {#sec-model-serving-batching-tax}
While Little's Law relates queue depth to throughput, it does not account for the **Batching Tax**\index{Batching Tax!latency penalty}: the deliberate delay introduced to maximize hardware utilization. In the tradition of quantitative systems, we analyze this as a **Queuing Delay** problem.
When an inference server batches requests, it introduces two distinct sources of latency:
1. **Batch Formation Delay ($W_{form}$)**: The time the first request in a batch waits for the last request to arrive.
2. **Inference Inflation ($W_{inf}$)**: The increase in execution time when the GPU processes $B$ samples instead of 1.
The resulting **Latency-Throughput Pareto Frontier**\index{Pareto Frontier!serving trade-offs} is the set of configurations where one cannot improve throughput without paying a "Tax" in increased wait time. We can quantify the total wait time for a batch size $B$ and arrival rate $\lambda$ as @eq-batching-tax:
$$ W_{total} \approx \underbrace{ \frac{B-1}{2\lambda} }_{\text{Formation Delay}} + \underbrace{ T_{inf}(B) }_{\text{Inference Time}} $$ {#eq-batching-tax}
This equation reveals the "Cost of Throughput." Increasing $B$ to saturate the GPU amortizes the hardware cost, but inflates the per-request latency. Concretely, at 500 QPS, moving from batch-1 to batch-32 increases wait-time from **`{python} BatchingTax.wait_time_b1_ms_str`ms** to **`{python} BatchingTax.wait_time_b32_ms_str`ms**, contributing to a **`{python} BatchingTax.penalty_ratio_str`$\times$** total latency penalty (`{python} BatchingTax.lat_b1_ms_str`ms → `{python} BatchingTax.lat_b32_ms_str`ms). For a systems engineer, this tax is the primary regulator of **Economic Efficiency**: the engineer chooses the batch size that maximizes throughput (minimizing cost per query) without violating the **Latency SLO** ($L_{\text{lat}}$).
Little's Law has immediate practical implications. If an inference service averages 10 ms per request ($W = 0.01$s) and the system shows 50 concurrent requests on average ($L = 50$), then the arrival rate must be $\lambda = L/W = 5000$ requests per second. Conversely, if the system must limit concurrent requests to 10 (perhaps due to GPU memory constraints) and the service time is 10 ms, it can sustain at most 1000 requests per second.
### The Utilization-Latency Relationship {#sec-model-serving-utilizationlatency-relationship-a2f0}
Little's Law describes average system behavior, but it does not reveal *how* latency changes as load approaches capacity. To answer the critical question of *how* much spare capacity a serving system needs, we turn to the M/M/1 queue model.[^fn-mm1-erlang-serving] For a system with Poisson arrivals\index{Poisson Arrivals!queuing model} and exponential service times, the average time in system follows:
[^fn-mm1-erlang-serving]: **M/M/1 Queue**: Queuing theory originated with Agner Krarup Erlang's 1909 analysis of the Copenhagen Telephone Exchange, where call arrivals genuinely were memoryless (Poisson). The M/M/1 model's exponential service time assumption fit telephony well but overpredicts variance for ML inference, where fixed-architecture forward passes produce near-constant service times. This mismatch is *useful*: M/M/1 overestimates wait times by roughly 2$\times$ compared to the more realistic M/D/1, providing a built-in safety margin for capacity planning. \index{M/M/1 Queue!Erlang origin}
$$W = \frac{1}{\mu - \lambda} = \frac{\text{service time}}{1 - \rho}$$ {#eq-mm1-wait}
where $\mu$ is the service rate (requests per second the server can handle), and $\rho = \lambda/\mu$ is the utilization\index{Utilization!latency relationship}\index{M/M/1 Queue!wait time formula} (fraction of time the server is busy).
This equation reveals why serving systems exhibit nonlinear behavior: small increases in load near capacity cause disproportionate latency increases[^fn-queuing-divergence]. @tbl-utilization-latency quantifies this relationship, showing how average time in system grows rapidly as utilization approaches 100%.
[^fn-queuing-divergence]: **Super-Linear Latency Divergence**: The 70% utilization threshold follows directly from M/M/1 queuing theory: mean response time $E[T] = \frac{1/\mu}{1-\rho}$, where $\rho = \lambda/\mu$ is utilization. The $(1-\rho)^{-1}$ term diverges as $\rho \to 1$: at $\rho = 0.7$, mean response time is already $3.3\times$ the base service time; at $\rho = 0.9$ it is $10\times$. This is not a conservative heuristic but a mathematical inevitability — there is no "stretching" from 80% to 90% utilization without disproportionate tail latency growth. \index{Queuing Theory!utilization divergence}
The M/M/1 model assumes exponentially distributed service times, but ML inference typically has near-constant service time for fixed batch sizes, making the M/D/1\index{M/D/1 Queue!deterministic service} (deterministic service) model more accurate in practice. We use M/M/1 here because it yields closed-form solutions and produces conservative estimates. For M/D/1 queues, average wait time is approximately half of M/M/1 at the same utilization, which matters for capacity planning: M/M/1 analysis will slightly over-provision, erring on the side of meeting SLOs rather than violating them.[^fn-kendall-notation-serving]
[^fn-kendall-notation-serving]: **Kendall Notation**: In the A/S/c (Arrival/Service/servers) system, "M" signifies a Markovian (memoryless) process and "D" means deterministic. The text selects M/M/1 rather than the more realistic M/D/1 because M/M/1's conservative bias is a *feature* for capacity planning: it overestimates wait times by roughly 2$\times$, building a 10--30% safety margin against variance surprises. The cost of over-provisioning by that margin is far lower than the cost of an SLA miss at the p99 tail when service time variance spikes unexpectedly. \index{Kendall Notation!queuing models}
| **Utilization ($\rho$)** | **Latency Multiple** | **Example (5 ms service)** |
|:-------------------------|---------------------:|---------------------------:|
| 50% | 2.0$\times$ | 10 ms |
| 70% | 3.3$\times$ | 17 ms |
| 80% | 5.0$\times$ | 25 ms |
| 90% | 10.0$\times$ | 50 ms |
| 95% | 20.0$\times$ | 100 ms |
: **Utilization-Latency Relationship**: Average **time in system** (wait + service) as a multiple of service time for an M/M/1 queue. At 50% utilization, time in system is 2$\times$ service time; at 90%, it reaches 10$\times$. This nonlinear growth explains why systems that perform well at moderate load suddenly violate SLOs when traffic increases: moving from 80% to 90% utilization doubles latency. {#tbl-utilization-latency}
### Multi-Server Considerations {#sec-model-serving-multiserver-considerations-00fc}
The preceding analysis focuses on a single ML node (one machine serving inference requests). This scope aligns with this book's focus on mastering the basic unit of ML systems. Single-node queuing dynamics are prerequisite to effective scaling. Engineers cannot optimize a distributed system without first understanding the behavior of its components.
#### When Single-Node Analysis Applies {#sec-model-serving-singlenode-analysis-applies-305d}
M/M/1 analysis remains the foundation for:
- **Right-sizing individual nodes**: Determining whether a single GPU can meet latency SLOs at expected traffic
- **Identifying the scaling trigger**: Calculating when traffic exceeds single-node capacity
- **Cost-effective provisioning**: Avoiding premature scale-out that wastes resources
For traffic exceeding single-node capacity, production systems deploy multiple replicas behind a load balancer. The M/M/c queuing model\index{M/M/c Queue!multi-server} extends M/M/1 to c parallel servers, showing that multiple replicas\index{Replica!tail latency improvement} dramatically improve tail latency: the probability of all servers being simultaneously slow drops exponentially with server count. At c=4 replicas and moderate utilization, p99 latency can be 3$\times$ lower than the single-server case at the same total throughput. This chapter establishes single-node serving foundations; distributed inference systems (model sharding across GPUs, tensor parallelism, pipeline parallelism) introduce coordination overhead and consistency challenges that require advanced scaling principles beyond our scope here.
### Tail Latency {#sec-model-serving-tail-latency-5376}
Production SLOs\index{SLO (Service Level Objective)!percentile targets} typically specify percentile targets (p95, p99) rather than averages because tail latency determines user experience for the slowest requests [@dean2013tail]. For an M/M/1 queue, the p99 latency follows:
$$W_{p99} \approx \frac{\text{service time}}{1 - \rho} \cdot \ln\left(\frac{1}{1 - 0.99}\right) \approx \frac{4.6 \cdot \text{service time}}{1 - \rho}$$ {#eq-p99-latency}
At 70 percent utilization, p99 latency is approximately fifteen times the service time ($4.6 / 0.3 \approx 15.3$), while average latency is only 3.3 times. For the M/D/1 model (more representative of ML inference with near-constant service times), p99 values are roughly half these M/M/1 estimates. This explains *why* systems that seem healthy with low average latency can have unacceptable tail latency, since the average hides the experience of the unluckiest requests.
#### The Tail at Scale Problem {#sec-model-serving-tail-scale-problem-958d}
Dean and Barroso's analysis reveals *why* tail latency\index{Tail at Scale!fan-out amplification} becomes critical as systems scale beyond single machines [@dean2013tail]. When requests fan out to multiple servers, the probability of experiencing at least one slow response grows rapidly with server count. This "tail at scale" effect makes individual server tail latency critical for overall system performance.
For single-machine serving, this principle has two implications. First, tail latency on individual machines matters because it will compound when systems eventually scale. Second, the tail-tolerant techniques described below (hedging, graceful degradation) provide value even on single machines and become indispensable at scale.
Tail-tolerant techniques such as request hedging send redundant requests after a timeout, accepting whichever response arrives first. Backup requests and load balancing away from slow servers directly address latency variance. These techniques apply to single-machine serving with multiple GPU streams or model replicas, and become essential when scaling to distributed inference systems.
With the queuing model and tail latency analysis established, we can now apply these tools to a concrete capacity planning exercise.
```{python}
#| label: capacity-planning-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESNET-50 CAPACITY PLANNING
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50 Capacity Planning"
# │
# │ Goal: Demonstrate the complete capacity planning workflow.
# │ Show: That 8 V100 GPUs are needed for 5000 QPS at 50ms p99 with N+1 redundancy.
# │ How: Apply queuing theory to find safe utilization and required service rate.
# │
# │ Imports: math, mlsysim.book (fmt)
# │ Exports: cp_* and mm1_* formatted strings
# └─────────────────────────────────────────────────────────────────────────────
import math
from mlsysim.fmt import fmt
class CapacityPlanningCalc:
"""ResNet-50 capacity planning: GPUs needed for 5000 QPS at 50ms p99 with N+1 redundancy."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
cp_peak_qps_value = 5000 # peak traffic (QPS)
cp_service_ms_value = 5 # TensorRT FP16 service time (ms)
cp_p99_target_ms_value = 50 # p99 latency SLO (ms)
cp_rho_safe_value = 0.72 # safe utilization (M/D/1 adjusted)
cp_v100_throughput_value = 1143 # V100 throughput at batch-16 (img/s)
cp_headroom_value = 1.3 # 30% headroom for variance
cp_fp32_bits_value = 32 # FP32 bit width
cp_int8_bits_value = 8 # INT8 bit width
mm1_p99_factor_value = 4.6 # p99 multiplier for M/M/1
mm1_rho_example_value = 0.7 # example utilization
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
cp_mu_required_value = cp_peak_qps_value / cp_rho_safe_value
cp_gpus_raw_value = cp_mu_required_value / cp_v100_throughput_value
cp_gpus_ceil_value = math.ceil(cp_gpus_raw_value)
cp_final_raw_value = cp_gpus_ceil_value * cp_headroom_value
cp_final_ceil_value = math.ceil(cp_final_raw_value)
cp_gpus_after_fail_value = cp_final_ceil_value - 1
cp_util_after_fail_value = (cp_peak_qps_value / cp_v100_throughput_value) / cp_gpus_after_fail_value * 100
cp_precision_ratio_value = cp_fp32_bits_value // cp_int8_bits_value
mm1_wait_factor_value = mm1_p99_factor_value / (1 - mm1_rho_example_value)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
cp_mu_required_str = fmt(cp_mu_required_value, precision=0, commas=False)
cp_gpus_raw_str = fmt(cp_gpus_raw_value, precision=1, commas=False)
cp_gpus_ceil_str = f"{cp_gpus_ceil_value}"
cp_final_raw_str = fmt(cp_final_raw_value, precision=1, commas=False)
cp_final_ceil_str = f"{cp_final_ceil_value}"
cp_gpus_after_fail_str = f"{cp_gpus_after_fail_value}"
cp_util_after_fail_str = fmt(cp_util_after_fail_value, precision=1, commas=False)
cp_peak_qps_str = f"{cp_peak_qps_value:,}"
cp_v100_throughput_str = f"{cp_v100_throughput_value:,}"
cp_rho_safe_str = fmt(cp_rho_safe_value, precision=2, commas=False)
cp_rho_safe_pct_str = f"{cp_rho_safe_value * 100:.0f}"
cp_mu_required_comma_str = f"{cp_mu_required_value:,.0f}"
cp_precision_ratio_str = f"{cp_precision_ratio_value}"
mm1_p99_factor_str = f"{mm1_p99_factor_value}"
mm1_wait_factor_str = fmt(mm1_wait_factor_value, precision=1, commas=False)
```
We can formalize this through *ResNet-50 capacity planning*.
::: {.callout-notebook title="ResNet-50 Capacity Planning"}
Consider designing a ResNet-50 serving system with these requirements:
- **Target p99 latency**: 50 ms
- **Peak expected traffic**: `{python} CapacityPlanningCalc.cp_peak_qps_str` requests per second
- **Service time** (TensorRT FP16): 5 ms
#### Step 1: Find Safe Utilization {.unnumbered}
From @eq-p99-latency, $W_{p99}$ ≈ `{python} CapacityPlanningCalc.mm1_p99_factor_str`$\times$ service time / (1 $\rho$). Setting $W_{p99}$ ≤ 50 ms with 5 ms service time gives $\rho$ ≤ 1 (`{python} CapacityPlanningCalc.mm1_p99_factor_str`$\times$ 5)/50 = 0.54. However, the M/M/1 model is conservative for ML inference, which has near-deterministic service times (closer to M/D/1). For M/D/1 queues, average wait is roughly half of M/M/1 at the same utilization, allowing a higher safe operating point. Using the M/D/1-adjusted threshold yields $\rho$ ≤ `{python} CapacityPlanningCalc.cp_rho_safe_str` (`{python} CapacityPlanningCalc.cp_rho_safe_pct_str`% maximum utilization).
#### Step 2: Calculate Required Service Rate {.unnumbered}
mu_required = `{python} CapacityPlanningCalc.cp_peak_qps_str` / `{python} CapacityPlanningCalc.cp_rho_safe_str` = `{python} CapacityPlanningCalc.cp_mu_required_str` requests/second
#### Step 3: Determine GPU Count {.unnumbered}
Single V100 throughput at batch=16: `{python} CapacityPlanningCalc.cp_v100_throughput_str` images/second
GPUs needed = `{python} CapacityPlanningCalc.cp_mu_required_str` / `{python} CapacityPlanningCalc.cp_v100_throughput_str` = `{python} CapacityPlanningCalc.cp_gpus_raw_str` → `{python} CapacityPlanningCalc.cp_gpus_ceil_str` GPUs
#### Step 4: Add Headroom for Variance {.unnumbered}
Production systems add 30% headroom for traffic spikes and variance:
Final count = `{python} CapacityPlanningCalc.cp_gpus_ceil_str`$\times$ 1.3 = `{python} CapacityPlanningCalc.cp_final_raw_str` → `{python} CapacityPlanningCalc.cp_final_ceil_str` GPUs
#### Step 5: Verify Fault Tolerance {.unnumbered}
The 30% headroom addresses traffic variance, but production systems also need fault tolerance. With `{python} CapacityPlanningCalc.cp_final_ceil_str` GPUs, losing one leaves `{python} CapacityPlanningCalc.cp_gpus_after_fail_str` GPUs handling `{python} CapacityPlanningCalc.cp_peak_qps_str` QPS:
Utilization after failure = (`{python} CapacityPlanningCalc.cp_peak_qps_str` / `{python} CapacityPlanningCalc.cp_v100_throughput_str`) / `{python} CapacityPlanningCalc.cp_gpus_after_fail_str` = `{python} CapacityPlanningCalc.cp_util_after_fail_str`%
This remains well below the `{python} CapacityPlanningCalc.cp_rho_safe_pct_str`% safe utilization threshold, confirming N+1 redundancy is satisfied. For stricter fault tolerance requirements, N+2 redundancy (tolerating two simultaneous failures) would require 1112 GPUs.
**Result**: Provision `{python} CapacityPlanningCalc.cp_final_ceil_str` V100 GPUs to serve `{python} CapacityPlanningCalc.cp_peak_qps_str` QPS at 50 ms p99 latency with N+1 fault tolerance.
:::
The queuing analysis explains the capacity planning approach detailed in @sec-model-serving-capacity-planning-96a3 and connects directly to the MLPerf Server scenario. @sec-benchmarking explains how MLPerf measures throughput only for requests meeting the latency SLO: a system achieving 10,000 QPS but violating the SLO on 5% of requests reports only 9,500 valid QPS.
### Tail-Tolerant Techniques {#sec-model-serving-tailtolerant-techniques-066e}
Eliminating all sources of latency variability is often impractical. Production systems instead employ techniques that tolerate variability while still meeting SLOs [@dean2013tail; @dean2012rapid]. These techniques treat latency variance as a given and design around it.
#### Hedged Requests {#sec-model-serving-hedged-requests-b923}
When\index{Hedged Requests!tail tolerance} a request has not completed within the expected time, the system sends a duplicate request to another server.[^fn-hedging-tail-tolerance] The client uses whichever response arrives first and cancels the other. For ML serving, this means maintaining multiple model replicas and routing slow requests to alternative replicas. The overhead is modest: if the system hedges at the 95th percentile, only 5% of requests generate duplicates, increasing load by only 5% while dramatically reducing tail latency.
[^fn-hedging-tail-tolerance]: **Hedging**: The term is borrowed from finance, where an offsetting bet reduces risk; here, the redundant request is a bet against a slow server. This is not free: for ML systems, the losing hedged request still occupies a GPU for one full inference cycle because CUDA kernels cannot be preempted. Thus, the 5% load increase from hedging at the 95th percentile translates directly to a 5% waste in GPU compute for those requests. \index{Hedging!tail tolerance}
CUDA kernels cannot be interrupted mid-execution. When a hedged request completes, the duplicate must be cancelled, but if inference has already begun on the GPU, cancellation approaches include checking a cancellation flag before launching inference, accepting wasted compute for the in-flight kernel, or using request prioritization to deprioritize the duplicate. Since hedging typically applies only to the slowest 5 percent of requests, the overhead from occasional wasted compute remains acceptable.
#### Tied Requests {#sec-model-serving-tied-requests-961c}
Tied requests\index{Tied Requests!latency reduction} send the request to multiple servers simultaneously, but include a tag allowing servers to cancel execution once another server begins processing. This eliminates the delay of waiting to detect a slow response before hedging. For inference servers with significant startup overhead from model loading and memory allocation, tied requests ensure at least one server begins immediately.
#### Canary Requests {#sec-model-serving-canary-requests-83b2}
For\index{Canary Requests!fan-out protection} requests that fan out to many backends, first send the request to a small subset of 1 to 2 servers.[^fn-canary-fanout] If these return within expected time, send to the remainder. If the canary is slow, the system can take corrective action by retrying elsewhere or using cached results before committing to the full fan-out. This prevents a single slow backend from stalling an entire distributed inference request.
\index{Canary!etymology}
[^fn-canary-fanout]: **Canary**: Named for the coal mine practice (early 1900s--1980s) of using birds whose high metabolic rate made them sensitive to toxic gases before concentrations became lethal to humans. In ML serving, canary requests serve the same early-warning function for fan-out queries: by testing 1--2 backends before committing to the full fan-out, the system detects slow or failing replicas before a single straggler stalls the entire distributed inference request---a critical protection when fan-out width means tail latency grows with the *maximum* of all backend response times. \index{Canary!fan-out protection}
#### Graceful Degradation {#sec-model-serving-graceful-degradation-d1d8}
When\index{Graceful Degradation!overload handling} load exceeds capacity, return approximate results rather than timing out. For classification, return cached predictions for similar inputs. For generative models, return shorter outputs. For ensemble systems, return predictions from a subset of models. This maintains responsiveness at the cost of some accuracy, which users often prefer to outright failures.
#### Admission Control {#sec-model-serving-admission-control-c852}
When\index{Admission Control!queue depth threshold} traffic exceeds capacity, accepting all requests can trigger widespread SLO violations. Admission control proactively rejects requests when queue depth exceeds a threshold, returning immediate 503 responses rather than accepting requests that are likely to timeout. This sacrifices throughput to protect latency for admitted requests.
A practical starting point for setting the threshold is 2 to 3 times service time multiplied by the number of workers. For a system with 4 workers and 10 ms service time, this yields a queue depth threshold of 80 to 120 requests. Adaptive admission control adjusts thresholds based on observed p99 latency, tightening when latency increases above target and relaxing when latency remains healthy.
#### Retry Storm Prevention {#sec-model-serving-retry-storm-prevention-4bf0}
A subtle\index{Retry Storm!load shedding coordination} failure mode occurs when all replicas are overloaded simultaneously. If the load balancer retries rejected requests at other replicas that are also overloaded, retry traffic amplifies the overload. Coordinated load shedding addresses this by sharing load information across replicas, enabling system-wide decisions about which requests to accept. When global load exceeds capacity, replicas collectively reject the same fraction of requests rather than each rejecting independently and triggering retries.
These techniques become essential at scale when fan-out amplification makes individual server tail latency visible to users. Single-machine serving systems can implement hedged and tied requests across GPU streams or model replicas. The queuing analysis here assumes FIFO processing, but production systems often implement priority scheduling such as deadline-aware or shortest-job-first approaches to further reduce tail latency for heterogeneous workloads [@harchol2013performance].
The tail-tolerant techniques examined in this section optimize the flow of requests through a functioning serving system. The queuing analysis, however, assumes two critical preconditions: that models are loaded and ready to process requests, and that predictions match what was validated during development. In production, this assumption fails regularly: during deployments, new instances must load models from scratch; during scaling events, cold start latency affects the first requests to new replicas; and when preprocessing pipelines diverge from training, accuracy silently degrades. The next section examines these lifecycle challenges that must be solved before queuing optimization becomes relevant.
::: {.callout-checkpoint title="Queuing and SLO Headroom"}
Latency SLOs are not enforced by "fast inference" alone; they are enforced by *headroom*.
- [ ] **Little's Law**: Can you use \(L = \lambda W\) to explain why rising queue depth implies rising latency even if per-request compute time is unchanged?
- [ ] **Utilization cliff**: Can you explain why latency grows non-linearly as utilization \(\rho\) approaches 1, and why production systems target a conservative \(\rho\) rather than "100% busy"?
- [ ] **Wait vs. compute**: Given an end-to-end latency budget, can you separate \(L_{\text{lat,compute}}\) from \(L_{\text{lat,wait}}\) and explain which one queuing theory primarily predicts?
- [ ] **Capacity planning**: Can you explain why a throughput number is only "real" if requests still meet the percentile latency SLO under load?
:::
## Model Lifecycle Management {#sec-model-serving-model-lifecycle-management-ff2e}
Queuing theory and tail-tolerant techniques optimize the steady-state flow of requests, but they cannot help if the system never reaches steady state. A newly deployed replica that takes 35 seconds to compile its TensorRT engine violates every SLO during that window. A model whose OpenCV-based serving pipeline resizes images differently than the PIL-based training pipeline silently drops 5 percentage points of accuracy—a degradation invisible to latency dashboards. *These lifecycle failures are not edge cases; they occur at every deployment, every scaling event, and every framework migration.* Addressing them requires engineering discipline in two areas: getting models ready to serve (cold start and initialization) and keeping predictions faithful to what was validated (training-serving skew).
### Training-Serving Skew {#sec-model-serving-trainingserving-skew-7b99}
A model that performed well during validation may silently degrade when deployed. This phenomenon, known as **training-serving skew**\index{Training-Serving Skew!silent accuracy degradation}, represents one of the most subtle failure modes in production ML because it is invisible to latency monitoring and exception tracking.
::: {.callout-definition title="Training-Serving Skew"}
***Training-Serving Skew***\index{Training-Serving Skew!definition} is the **Distributional Divergence** between the training and inference environments caused by inconsistent logic or state.
1. **Significance (Quantitative):** It violates the **Consistency Imperative**, causing **Silent Accuracy Degradation** proportional to the difference in the transformation functions ($f_{train}(x) \neq f_{serve}(x)$).
2. **Distinction (Durable):** Unlike **Data Drift** (which is an **External Shift** in the environment), Training-Serving Skew is an **Internal Failure** of the engineering stack.
3. **Common Pitfall:** A frequent misconception is that skew is "found" by looking for errors. In reality, it is **Invisible to Exceptions**: the system runs perfectly and the latency is low, but the predictions are statistically wrong.
:::
@sec-ml-operations provides comprehensive coverage of skew diagnosis, monitoring, and organizational prevention strategies. Here we focus on the *serving-specific* manifestation: **preprocessing divergence**\index{Preprocessing Divergence!training vs serving}. This occurs when the real-time inference pipeline processes raw data differently than the batch training pipeline, a common failure mode when training uses Python/Pandas while serving uses C++/Java or optimized inference servers. Unlike data drift (which @sec-ml-operations addresses through monitoring), preprocessing divergence is deterministic and preventable through careful engineering.
::: {.callout-example title="ResNet-50: Image Preprocessing Skew"}
For ResNet-50 serving, common sources of skew include:
**Resize interpolation**\index{Resize Interpolation!skew source}: Training uses PIL.BILINEAR while OpenCV defaults to cv2.INTER_LINEAR. These produce pixel-level differences that can shift accuracy by 0.51%.
**Color space handling**: JPEG loading in different libraries may produce BGR vs RGB ordering. If the model trained on RGB but serves BGR inputs, predictions are essentially random.
**Normalization constants**: ImageNet normalization uses specific mean/std values. Using `mean=[0.5, 0.5, 0.5]` instead of `mean=[0.485, 0.456, 0.406]` shifts inputs out of the training distribution.
**Prevention**: The safest approach is to export the exact preprocessing code used during training and run it identically in serving, or use a framework like NVIDIA DALI that can help standardize preprocessing across training and serving environments.
:::
### Cold Start and Initialization Dynamics {#sec-model-serving-model-loading-initialization-cc5a}
With preprocessing pipelines designed to avoid training-serving skew, the next challenge is getting models ready to serve. Before processing any request, models must load from storage into memory and prepare for inference [@romero2021infaas]. This initialization latency, known as **cold start**\index{Cold Start!scaling events}, affects system responsiveness during deployments, scaling events, and recovery from failures.
::: {.callout-definition title="Cold Start"}
***Cold Start***\index{Cold Start!definition} is the **Initialization Latency** incurred when instantiating a new model replica.
1. **Significance (Quantitative):** It represents the fixed cost of **State Hydration** (loading weights, compiling graphs), which can take seconds or minutes, effectively blocking the system's ability to scale elastically in response to traffic bursts.
2. **Distinction (Durable):** Unlike **Inference Latency** ($L_{\text{lat}}$), which is a **Per-Request Cost**, Cold Start is a **Per-Replica Cost** that occurs only during deployment or scaling events.
3. **Common Pitfall:** A frequent misconception is that Cold Start is "just loading weights." In reality, it includes **Graph Compilation** and **Memory Allocation**, which can often take longer than the data transfer itself ($BW$).
:::
Cold start dynamics determine whether systems meet latency requirements from the moment they begin serving traffic. A *cold start timeline* for a representative model reveals where each phase contributes to total initialization latency.
Cold start\index{Cold Start!anatomy} latency compounds from multiple sources, each adding to the time between deployment and serving readiness. Weight loading\index{Weight Loading!cold start} reads model parameters from disk or network storage. Graph compilation\index{Graph Compilation!JIT overhead} performs just-in-time compilation of operations for the specific hardware. Memory allocation reserves GPU memory for activations and intermediate values. Warmup\index{Warmup!cache population}[^fn-warmup-coldstart] execution performs initial inferences that populate caches and trigger lazy initialization.
[^fn-warmup-coldstart]: **Warmup**: Borrowed from JIT compilation, where initial executions compile hot paths into optimized machine code. For ML serving, warmup inferences trigger CUDA kernel compilation, cuDNN algorithm auto-tuning, and memory pool allocation that frameworks defer until first use. Without warmup, the first live request absorbs all of this setup, running over 100$\times$ slower than steady state. During autoscaling events, this means new replicas can violate SLOs for their first several seconds of traffic. \index{Warmup!cold start mitigation}
```{python}
#| label: cold-start-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COLD START TIMELINE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Cold Start Timeline"
# │
# │ Goal: Decompose the cold start latency of serverless inference.
# │ Show: That pre-compiling models reduces cold start from ~35s to ~1.5s.
# │ How: Sum phase durations for data transfer, CUDA initialization, and model loading.
# │
# │ Imports: (none)
# │ Exports: cs_*_str formatted strings for timeline table
# └─────────────────────────────────────────────────────────────────────────────
class ColdStartCalc:
"""Decomposes cold start latency showing pre-compiled TensorRT reduces startup from ~35s to ~1.5s."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
cs_ssd_value = 0.5 # weight loading from SSD (s)
cs_s3_value = 4.0 # weight loading from S3 (s)
cs_cuda_value = 0.4 # CUDA context initialization (s)
cs_compile_value = 30.0 # TensorRT compilation (s)
cs_warmup_value = 0.2 # warmup inferences (s)
cs_runtime_overhead_value = 0.4 # runtime overhead (s)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
cs_local_total_value = cs_ssd_value + cs_cuda_value + cs_warmup_value + cs_runtime_overhead_value
cs_cloud_total_value = cs_s3_value + cs_cuda_value + cs_compile_value + cs_warmup_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
cs_local_str = f"~{cs_local_total_value:.1f}s"
cs_cloud_str = f"~{cs_cloud_total_value:.0f}s"
cs_ssd_str = f"{cs_ssd_value}s"
cs_s3_str = "3-5s" # range for S3
cs_cuda_str = "0.3-0.5s" # range for CUDA
cs_compile_str = "15-30s" # range for TensorRT
cs_warmup_str = f"{cs_warmup_value}s"
```
::: {.callout-notebook title="ResNet-50: Cold Start Timeline"}
| **Phase** | **Duration** | **Notes** |
|:--------------------------------|:------------------------------------------|:--------------------------------------------------|
| **Weight loading (SSD)** | `{python} ColdStartCalc.cs_ssd_str` | 98 MB FP32 weights from local storage |
| **Weight loading (S3)** | `{python} ColdStartCalc.cs_s3_str` | Network latency dominates for cloud storage |
| **CUDA context** | `{python} ColdStartCalc.cs_cuda_str` | GPU driver initialization and memory setup |
| **TensorRT compilation** | `{python} ColdStartCalc.cs_compile_str` | Converts PyTorch model to optimized engine |
| **Warmup (10 inferences)** | `{python} ColdStartCalc.cs_warmup_str` | Triggers remaining lazy initialization |
| **Total (local, optimized)** | **`{python} ColdStartCalc.cs_local_str`** | With pre-compiled TensorRT engine, warm container |
| **Total (cloud, first deploy)** | **`{python} ColdStartCalc.cs_cloud_str`** | Including compilation from cold state |
**Key insight**: Pre-compiling models and storing the optimized engine eliminates the 30-second compilation phase on subsequent deployments.
:::
The CUDA context[^fn-cuda-context-serving]\index{CUDA Context!initialization overhead} is the first cost in the cold start timeline. Before any GPU operation, the CUDA runtime must establish a *context*: a data structure that tracks memory allocations, loaded kernels, and device state. Creating a context requires communicating with the GPU driver and allocating GPU memory for internal bookkeeping. This one-time cost (0.30.5 s) affects every new process that uses the GPU. CUDA 11+ introduced lazy initialization that defers some setup until first use, reducing apparent startup time but shifting cost to the first inference.
CUDA MPS (Multi-Process Service)[^fn-cuda-mps-serving]\index{CUDA MPS!context sharing} addresses the context overhead for multi-model deployments. Normally, each process creates its own CUDA context, and the GPU time-slices between contexts. MPS allows multiple processes to share a single context, eliminating redundant initialization and enabling concurrent kernel execution. For serving systems running multiple model replicas, MPS can reduce aggregate cold start time and improve GPU utilization. The trade-off is reduced isolation: a crash in one process can affect others sharing the MPS server.
[^fn-cuda-context-serving]: **CUDA (Compute Unified Device Architecture)**: NVIDIA's parallel computing platform (released June 2007), named for its goal of unifying diverse GPU shader models into a single general-purpose architecture. Before CUDA, GPU programming required disguising computations as graphics operations. The CUDA context---the data structure tracking memory allocations, loaded kernels, and device state---is the runtime's per-process gateway to GPU resources, and its 0.3--0.5 s creation cost makes it a significant component of cold start latency for serverless inference. \index{CUDA!context overhead}
[^fn-cuda-mps-serving]: **CUDA MPS (Multi-Process Service)**: Introduced in CUDA 5.0 (2012), MPS creates a daemon that mediates GPU access through a shared CUDA context, enabling true concurrent kernel execution rather than time-sliced scheduling between separate contexts. For multi-model serving, MPS eliminates redundant context initialization and allows replicas to share GPU streaming multiprocessors efficiently. The trade-off is fault isolation: all clients share one context, so a segfault in one process can corrupt GPU state for all others---a risk that MIG (hardware-level isolation) eliminates at the cost of fixed partition granularity. \index{CUDA MPS!multi-model serving}
Without warmup, the first real request triggers compilation and memory allocation mid-inference, often causing timeout failures. A request that normally takes 5 ms might require 500 ms during cold start, violating SLOs and degrading user experience.
### Loading Strategies {#sec-model-serving-loading-strategies-eb38}
Different loading strategies trade off cold start duration against serving performance and memory efficiency. The simplest approach, *full loading*\index{Model Loading!full loading}, reads the entire model into memory before serving begins. This maximizes inference speed since all weights are immediately available, but extends cold start duration and limits model size to available memory. The approach is appropriate when cold start latency is acceptable and models comfortably fit in memory.
When models are too large for immediate full loading, *memory mapping*\index{Memory Mapping!on-demand loading}\index{mmap!model loading} offers an alternative by mapping model files directly into the address space and loading pages on demand as accessed. This reduces cold start time since inference can begin before the full model loads, but causes unpredictable latency as pages fault in during initial requests. Memory mapping works well for infrequently accessed model components but can cause latency spikes if critical weights are not preloaded.
A third strategy, *lazy initialization*\index{Lazy Initialization!deferred compilation}, defers compilation and allocation until first use. This minimizes startup time but shifts latency to the first request. Production systems often combine lazy initialization with synthetic warmup requests to trigger initialization before real traffic arrives.
### Model Caching Infrastructure {#sec-model-serving-model-caching-infrastructure-4f1a}
Production systems cache model weights at the infrastructure level to reduce cold start for common deployment scenarios. One approach, *container image embedding*\index{Model Caching!container embedding}, bundles model weights directly in the container image. This produces a single deployment artifact and eliminates network fetches at startup, but creates large images (often 10-50 GB) that slow container pulls and consume registry storage. This approach works best for models that rarely update.
For organizations with many models and frequent updates, a *shared filesystem* (EFS, GCS FUSE) containing model weights provides a more flexible alternative. Multiple replicas share cached weights, and updates propagate immediately without redeployment. The tradeoff is that network latency affects cold start, and filesystem availability becomes a critical dependency.
When cold start latency is critical for high-traffic models, *node-local SSD caching*\index{SSD Cache!model loading} pre-populates local SSDs on inference nodes with frequently-used models. This approach provides fast loading (500 MB/s+ for NVMe) without network dependency, but requires cache management to handle model updates and capacity limits. The choice among these strategies depends on model update frequency: infrequent updates favor container embedding, frequent updates favor shared filesystem, and performance-critical deployments benefit from local caching with background refresh.
### Multi-Model Serving {#sec-model-serving-multimodel-serving-a9c1}
Production systems often serve multiple models from a single machine\index{Multi-Model Serving!GPU memory management}, whether different model versions for A/B testing, ensemble components, or entirely different models sharing infrastructure. GPU memory becomes the limiting resource, requiring careful management strategies.
Three strategies address multi-model memory management. Time-multiplexing\index{Time-Multiplexing!model swapping} loads one model at a time and swaps based on request routing—simple but introduces swap latency. Memory sharing\index{Memory Sharing!GPU multi-model} partitions GPU memory among models, limiting concurrent execution count but enabling more models to remain resident. Model virtualization, as implemented by frameworks like Triton, manages model lifecycle automatically, loading and unloading models based on traffic patterns [@nvidia2024triton]. The choice depends on request patterns: if models receive traffic evenly, concurrent loading works; if traffic is bursty and model-specific, time-multiplexing with intelligent preloading reduces average latency while maximizing GPU utilization.
#### Multi-Stream Execution {#sec-model-serving-multistream-execution-1b1f}
When multiple models or multiple instances of the same model must run concurrently on a single GPU, the hardware must partition resources between them. NVIDIA's Multi-Instance GPU[^fn-mig-multimodel]\index{MIG (Multi-Instance GPU)!hardware isolation} technology enables hardware-level isolation, dividing an A100 into up to 7 independent GPU instances, each with dedicated memory and compute resources. MIG is available on A100, A30 (up to 4 instances), H100, H200, and newer data center GPUs. For older GPUs such as V100 or T4, CUDA stream scheduling provides time-multiplexed sharing without hardware isolation.
[^fn-mig-multimodel]: **MIG (Multi-Instance GPU)**: Introduced with NVIDIA's A100 (Ampere, 2020), MIG partitions a single physical GPU into up to seven independent instances, each with dedicated streaming multiprocessors, memory controllers, and L2 cache. Unlike software sharing (MPS or time-slicing), MIG provides hardware-level isolation: a runaway kernel in one partition cannot affect another's performance or memory. The trade-off is granularity---partitions must follow fixed profiles (e.g., 1g.5gb, 2g.10gb on A100), so resources cannot be divided arbitrarily. For multi-model serving, MIG eliminates the "noisy neighbor" problem, enabling per-model SLO guarantees on shared hardware. \index{MIG!hardware isolation}
The choice depends on whether consistent latency with MIG or maximum utilization with shared streams is the priority.
#### Model Swapping and Host Memory {#sec-model-serving-model-swapping-host-memory-c54f}
When the aggregate size of all models exceeds GPU memory capacity, the serving system must swap models between host memory (DRAM)\index{DRAM!host memory} and device memory (VRAM)\index{VRAM!device memory} on demand. This introduces a new latency component determined by the PCIe bus bandwidth\index{PCIe Bandwidth!model swapping}.
```{python}
#| label: model-swap-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ MODEL SWAP TIME
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Model swapping and host memory discussion
# │
# │ Goal: Quantify the latency cost of model swapping.
# │ Show: That loading a 10 GB model over PCIe takes 300 ms, exceeding most SLOs.
# │ How: Calculate transfer duration using PCIe Gen4 bandwidth constants.
# │
# │ Imports: mlsysim.core.constants (PCIE_GEN4_BW), mlsysim.book (fmt)
# │ Exports: model_size_gb_str, pcie_bw_gbs_str, model_swap_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.core.constants import PCIE_GEN4_BW, GB, second
from mlsysim.fmt import fmt
class ModelSwapCalc:
"""Quantifies the latency cost of swapping a 10 GB model over PCIe Gen4."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
model_size_gb_value = 10 # model size (GB)
pcie_bw_gbs_value = PCIE_GEN4_BW.m_as(GB / second) # PCIe Gen4 x16 bandwidth
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
model_swap_ms_value = model_size_gb_value / pcie_bw_gbs_value * 1000
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
model_size_gb_str = f"{model_size_gb_value}"
pcie_bw_gbs_str = fmt(pcie_bw_gbs_value, precision=0, commas=False)
model_swap_ms_str = fmt(model_swap_ms_value, precision=0, commas=False)
```
For a `{python} ModelSwapCalc.model_size_gb_str` GB model on PCIe Gen4 x16 (`{python} ModelSwapCalc.pcie_bw_gbs_str` GB/s theoretical bandwidth), loading takes at least:
Tload = `{python} ModelSwapCalc.model_size_gb_str` GB / `{python} ModelSwapCalc.pcie_bw_gbs_str` GB/s ≈ `{python} ModelSwapCalc.model_swap_ms_str` ms
To mitigate this, systems use *pinned memory*\index{Pinned Memory!DMA transfer} (page-locked host memory). By default, the operating system can move ("page") any memory region to disk when RAM is under pressure. This creates a problem for GPU transfers: if the GPU's DMA (Direct Memory Access) engine begins reading a memory region that gets paged out mid-transfer, the transfer fails or stalls. To avoid this, the CPU must first copy data to a temporary pinned buffer before the GPU can safely read it, adding both latency and CPU overhead.
Pinning memory instructs the OS to keep that region permanently in physical RAM. The GPU's DMA engine can then transfer data directly from the pinned region at full PCIe bandwidth without CPU involvement. The trade-off is that pinned memory reduces the RAM available for other processes and cannot be reclaimed under memory pressure. For model serving, the performance gain (2--3$\times$ faster transfers) typically justifies pinning model weights and frequently-used input buffers, while leaving less critical memory pageable.
The lifecycle management strategies examined so far ensure models are ready to serve: loaded into memory, warmed up, and producing predictions consistent with training. With these prerequisites satisfied, the queuing dynamics from @sec-model-serving-queuing-theory-tail-latency-29a6 become relevant. The next optimization opportunity lies in how requests are grouped for processing, which directly affects both the throughput and latency terms in our queuing equations.
## Throughput Optimization {#sec-model-serving-throughput-optimization-18d1}
Consider a ResNet-50 classifier running on a V100 GPU at batch size 1: the GPU processes one image, then sits idle while the CPU fetches and preprocesses the next—achieving only 15% hardware utilization and 200 images per second. The same GPU processing 32 images at once reaches 95% utilization and 1,280 images per second, a 6.4$\times$ throughput improvement on identical hardware. The difference is batching, the core lever for improving serving economics. Batching\index{Batching!training vs serving}\index{Batching!throughput optimization}[^fn-batch-serving-tradeoff] differs sharply between training and serving [@crankshaw2017clipper]. Training batches maximize throughput by processing hundreds or thousands of samples together with no concern for individual sample latency. Serving batches must balance throughput against individual request latency, typically processing single digits of requests together while ensuring no request waits too long. This adaptive approach is called **dynamic batching** because the system adjusts batch composition in real time based on arriving requests.
[^fn-batch-serving-tradeoff]: **Batch**: From Old French *bache* (a quantity baked at one time), entering computing in the 1950s for jobs processed together without human interaction. The ML serving usage preserves the original trade-off: grouping requests amortizes fixed costs (kernel launch, weight loading) across multiple inputs, but each request must wait for the batch to fill. In training, batches of 256--4096 are routine; in serving, batches above 8--32 typically violate latency SLOs, making the serving batch a fundamentally different optimization target. \index{Batching!etymology}
::: {.callout-definition title="Dynamic Batching"}
***Dynamic Batching***\index{Dynamic Batching!definition} is the runtime optimization of trading **Latency** for **Throughput** under stochastic arrival patterns.
1. **Significance (Quantitative):** By buffering requests into a **Batching Window**, the scheduler amortizes fixed overheads ($L_{\text{lat}}$) across multiple inputs, pushing the system away from the memory-bound regime ($BW$) toward the compute-bound regime ($R_{\text{peak}}$).
2. **Distinction (Durable):** Unlike **Static Batching**, which is fixed during training, Dynamic Batching adaptively adjusts the batch size at **Inference Time** based on real-time traffic volume.
3. **Common Pitfall:** A frequent misconception is that batching "always helps." In reality, there is a **Latency-Throughput Pareto Frontier**: if the batching window is too large, the increased **Queuing Delay** may violate the system's SLO before the throughput gains are realized.
:::
### Why Batching Helps {#sec-model-serving-batching-helps-f1dc}
Modern accelerators achieve peak efficiency only at sufficient batch sizes\index{GPU Utilization!batch size dependency} [@shen2019nexus]. A single inference request leaves most compute units idle because GPUs are designed for parallel execution across thousands of threads. Batching amortizes fixed costs across multiple requests and enables parallel execution across the batch dimension.
Two fixed costs dominate at small batch sizes. **Kernel launch overhead**\index{Kernel Launch Overhead!fixed cost}[^fn-kernel-launch-serving] is the time for the CPU to prepare and submit work to the GPU. Each layer in a neural network typically requires a separate kernel launch: the CPU must assemble kernel parameters, copy them to GPU-accessible memory, and signal the GPU to begin execution. This overhead is typically 520 μs per kernel, independent of batch size. ResNet-50 has approximately 50 layers, so kernel launch alone adds 2501000 μs per inference. At batch size 1, this overhead may exceed the actual compute time; at batch size 32, the same overhead is amortized across 32 images. **Weight loading**\index{Weight Loading!memory efficiency} reads model parameters from GPU memory (VRAM) to the compute units. At batch size 1, the GPU reads all weights to process one image; at batch size 32, the same weight read processes 32 images, achieving 32$\times$ better memory efficiency. Measuring *batching efficiency* on a concrete model quantifies how these fixed costs amortize in practice.
[^fn-kernel-launch-serving]: **Kernel (GPU)**: CUDA borrowed this term from operating systems circa 2007 because GPU functions represent the computational "core" of parallel algorithms. Unlike OS kernels that run continuously, GPU kernels are discrete units of parallel work launched by the CPU. Each launch carries 5--20 $\mu$s of overhead independent of batch size---negligible for large training batches but dominant at batch-1 serving, where a 50-layer model accumulates 250--1000 $\mu$s of pure launch overhead per inference. \index{Kernel Launch!serving overhead}
```{python}
#| label: batch-throughput-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING THROUGHPUT AND LATENCY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50 Batching Efficiency" key insight
# │
# │ Goal: Quantify the throughput-latency trade-off of batching.
# │ Show: That batch-32 achieves 6.4× throughput at the cost of 7× higher latency.
# │ How: Contrast batch-1 and batch-32 performance including window wait times.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: throughput_ratio_str, batch_window_ms_str, batch32_inference_ms_str,
# │ batch32_total_str, batch1_inference_total_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class BatchThroughputCalc:
"""Quantifies the 6.4× throughput gain of batch-32 over batch-1 and its latency cost."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
batch1_throughput_value = 200 # batch-1 throughput (img/s)
batch32_throughput_value = 1280 # batch-32 throughput (img/s)
batch32_inference_ms_value = 25.0 # batch-32 inference time (ms)
batch_window_ms_value = 10.0 # batching window (ms)
batch1_inference_total_ms_value = 5.0 # batch-1 total latency (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
throughput_ratio_value = batch32_throughput_value / batch1_throughput_value
batch32_total_ms_value = batch_window_ms_value + batch32_inference_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
throughput_ratio_str = fmt(throughput_ratio_value, precision=1, commas=False)
batch_window_ms_str = fmt(batch_window_ms_value, precision=0, commas=False)
batch32_inference_ms_str = fmt(batch32_inference_ms_value, precision=0, commas=False)
batch32_total_str = fmt(batch32_total_ms_value, precision=0, commas=False)
batch1_inference_total_ms_str = fmt(batch1_inference_total_ms_value, precision=0, commas=False)
```
::: {.callout-notebook title="ResNet-50 Batching Efficiency"}
The throughput-latency tradeoff for ResNet-50 on a V100 GPU illustrates the power of batching:
| **Batch Size** | **Inference Time*** | **Per-Image Compute** | **Throughput** | **GPU Util.** |
|:---------------|--------------------:|----------------------:|---------------:|--------------:|
| 1 | 5.0 ms | 5.0 ms | 200 img/s | 15% |
| 4 | 7.2 ms | 1.8 ms | 556 img/s | 42% |
| 8 | 9.1 ms | 1.1 ms | 879 img/s | 65% |
| 16 | 14.0 ms | 0.9 ms | 1,143 img/s | 85% |
| 32 | 25.0 ms | 0.8 ms | 1,280 img/s | 95% |
Note: Times shown are pure inference time, excluding queue wait. @sec-model-serving-traffic-patterns-batching-strategy-2e6b analyzes how user-perceived latency includes batching window wait.
**Key insight**: Batch size 32 achieves `{python} BatchThroughputCalc.throughput_ratio_str`$\times$ higher throughput than batch size 1. However, user-perceived latency includes both queue wait and inference time. With a `{python} BatchThroughputCalc.batch_window_ms_str` ms batching window and `{python} BatchThroughputCalc.batch32_inference_ms_str` ms inference, total latency reaches `{python} BatchThroughputCalc.batch32_total_str` ms versus `{python} BatchThroughputCalc.batch1_inference_total_ms_str` ms at batch size 1.
:::
The table reveals the throughput-latency tradeoff in stark terms: larger batches dramatically improve hardware efficiency but increase per-request latency. In practice, the optimal batch size depends on both the latency Service Level Objective (SLO) and the arrival rate of requests. The question facing every serving engineer is therefore quantitative: determining the largest batch size that still meets a given latency SLO. The following analysis shows how to find *the batching sweet spot*.
```{python}
#| label: batching-sweetspot-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING SWEET SPOT
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Batching Sweet Spot"
# │
# │ Goal: Demonstrate the economic "sweet spot" for batching.
# │ Show: That batch-8 yields 3× throughput gain while remaining within typical SLOs.
# │ How: Model throughput and latency for small batches (1 to 8).
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: batch1_ms_str, batch1_imgs_str, batch8_* strings, latency_increase_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class BatchingSweetspotCalc:
"""Demonstrates that batch-8 yields ~3× throughput within a 20ms SLO budget."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
batch1_ms_value = 5.0 # batch-1 inference (ms)
batch1_imgs_value = 200 # batch-1 throughput (img/s)
batch8_wait_ms_value = 5.0 # batch-8 wait time (ms)
batch8_inference_ms_value = 9.0 # batch-8 inference time (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
batch8_user_latency_ms_value = batch8_wait_ms_value + batch8_inference_ms_value
batch8_throughput_value = 8 / (batch8_user_latency_ms_value / 1000)
latency_increase_value = batch8_user_latency_ms_value / batch1_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
batch1_ms_str = fmt(batch1_ms_value, precision=0, commas=False)
batch1_imgs_str = f"{batch1_imgs_value}"
batch8_wait_ms_str = fmt(batch8_wait_ms_value, precision=0, commas=False)
batch8_inference_ms_str = fmt(batch8_inference_ms_value, precision=0, commas=False)
batch8_user_latency_str = fmt(batch8_user_latency_ms_value, precision=0, commas=False)
batch8_throughput_str = fmt(batch8_throughput_value, precision=0, commas=False)
latency_increase_str = fmt(latency_increase_value, precision=0, commas=False)
```
::: {.callout-notebook title="The Batching Sweet Spot"}
**Problem**: A ResNet-50 model is served at batch=1, leaving the GPU mostly idle (15% utilization). The goal is to increase throughput to reduce cost, subject to a **20 ms** latency budget.
**The Math**:
1. **Baseline (Batch 1)**: Inference = **`{python} BatchingSweetspotCalc.batch1_ms_str` ms**. Throughput = **`{python} BatchingSweetspotCalc.batch1_imgs_str` img/s**.
2. **Optimized (Batch 8)**:
- **Wait Time**: A **`{python} BatchingSweetspotCalc.batch8_wait_ms_str` ms** batching window collects requests.
- **Inference Time**: Batch 8 inference takes **`{python} BatchingSweetspotCalc.batch8_inference_ms_str` ms**.
- **User Latency**: `{python} BatchingSweetspotCalc.batch8_wait_ms_str` ms (wait) + `{python} BatchingSweetspotCalc.batch8_inference_ms_str` ms (compute) = **`{python} BatchingSweetspotCalc.batch8_user_latency_str` ms**.
- **Throughput**: 8 img / `{python} BatchingSweetspotCalc.batch8_user_latency_str` ms ≈ **`{python} BatchingSweetspotCalc.batch8_throughput_str` img/s**.
**The Systems Conclusion**: By accepting a **`{python} BatchingSweetspotCalc.latency_increase_str`$\times$ increase in latency** (`{python} BatchingSweetspotCalc.batch1_ms_str` ms → `{python} BatchingSweetspotCalc.batch8_user_latency_str` ms), the system achieves nearly **`{python} BatchingSweetspotCalc.latency_increase_str`$\times$ higher throughput** on the same hardware. As long as `{python} BatchingSweetspotCalc.batch8_user_latency_str` ms remains under the 20 ms budget, this is "free" capacity. This trade-off is the primary lever of serving economics.
:::
The **"Knee"** in @fig-throughput-latency-knee marks, the point where the blue throughput curve begins to plateau just as the orange latency curve starts its sharp upward spike. This is the optimal operating point: push batch size beyond the knee and queuing delays dominate; staying below it leaves hardware capacity on the table. The numbers are representative rather than tied to a single benchmark.
::: {#fig-throughput-latency-knee fig-env="figure" fig-pos="htb" fig-cap="**The Throughput-Latency Knee.** Batch Size vs. Throughput (Blue) and Latency (Orange). Throughput increases with batch size as hardware utilization improves, but eventually saturates. Latency remains relatively flat until the 'Knee,' after which it spikes due to queuing. Values are representative and depend on model/hardware." fig-alt="Dual-axis line chart. Blue line (Throughput) rises and plateaus. Orange line (Latency) stays low then spikes upward. A vertical line marks the optimal point where throughput is high before latency explodes."}
```{python}
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ THROUGHPUT-LATENCY KNEE FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-throughput-latency-knee — dual-axis plot preceding the
# │ dynamic batching section, illustrating the throughput/latency tradeoff
# │
# │ Goal: Show that throughput saturates and latency spikes at the same batch
# │ size threshold, revealing the optimal operating point.
# │ Show: Throughput (req/s) and latency (ms) as a function of batch size on
# │ log-scale x-axis with an annotated optimal point.
# │ How: Plot tabular BATCHING_DATA on dual y-axes; mark optimal with axvline.
# │
# │ Imports: pandas (pd), mlsysim.core.viz (viz)
# │ Exports: (figure only, no prose variables)
# └─────────────────────────────────────────────────────────────────────────────
import pandas as pd
from mlsysim import viz
fig, ax1, COLORS, plt = viz.setup_plot()
# =============================================================================
# DATA
# =============================================================================
BATCHING_DATA = [
{'BatchSize': 1, 'Throughput': 64, 'Latency': 15.6},
{'BatchSize': 2, 'Throughput': 120, 'Latency': 16.5},
{'BatchSize': 4, 'Throughput': 230, 'Latency': 17.4},
{'BatchSize': 8, 'Throughput': 404, 'Latency': 19.8},
{'BatchSize': 16, 'Throughput': 650, 'Latency': 24.6},
{'BatchSize': 32, 'Throughput': 935, 'Latency': 34.2},
{'BatchSize': 64, 'Throughput': 1100, 'Latency': 60.0},
{'BatchSize': 128, 'Throughput': 1143, 'Latency': 136.8},
{'BatchSize': 256, 'Throughput': 1150, 'Latency': 300.0}
]
df = pd.DataFrame(BATCHING_DATA)
# =============================================================================
# PLOT: The Throughput-Latency Knee
# =============================================================================
color_tp, color_lat = COLORS['BlueLine'], COLORS['OrangeLine']
ax1.plot(df['BatchSize'], df['Throughput'], 'o-', color=color_tp, label='Throughput')
ax1.set_xlabel('Batch Size')
ax1.set_ylabel('Throughput (Requests/sec)', color=color_tp, fontweight='bold')
ax1.tick_params(axis='y', labelcolor=color_tp)
ax1.set_xscale('log', base=2)
ax2 = ax1.twinx()
ax2.plot(df['BatchSize'], df['Latency'], 's-', color=color_lat, label='Latency')
ax2.set_ylabel('Latency (ms)', color=color_lat, fontweight='bold', rotation=270, labelpad=15)
ax2.tick_params(axis='y', labelcolor=color_lat)
ax2.spines['right'].set_visible(True)
ax2.spines['top'].set_visible(False)
optimal_idx = 5
ax1.axvline(df['BatchSize'].iloc[optimal_idx], color='gray', linestyle='--', alpha=0.5)
ax1.text(df['BatchSize'].iloc[optimal_idx], 200, " Optimal\n Point", ha='right', color='gray', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
plt.show()
```
:::
The efficiency gains from batching come at a cost: requests must wait for the batch to form. This creates a direct tension between throughput optimization (larger batches) and latency minimization (immediate processing). The different batching strategies and their tradeoffs govern how engineers tune this balance.
### Static vs Dynamic Batching {#sec-model-serving-static-vs-dynamic-batching-fd0a}
Static batching\index{Static Batching!fixed batch size} waits for a fixed batch size before processing. Simple to implement but problematic in practice: during low traffic, requests wait indefinitely for a full batch, and during high traffic, large batches increase per-request latency.
Dynamic batching\index{Dynamic Batching!time window} addresses these limitations by collecting requests within a time window and processing whatever has arrived when the window closes [@olston2017tensorflow]. This bounds maximum wait time regardless of traffic level. The window size represents a direct tradeoff: shorter windows reduce latency but sacrifice throughput; longer windows improve throughput but increase latency.
Typical configurations use windows of 550 ms with maximum batch sizes of 832 for latency-sensitive applications. The optimal configuration depends on request arrival patterns, model characteristics, and latency requirements.
### Dynamic Batching Latency-Throughput Trade-offs {#sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d}
Dynamic batching introduces a quantifiable tension between throughput optimization and latency constraints, revealing *why latency spikes under load* and enabling systematic configuration decisions rather than trial-and-error tuning.
::: {.callout-notebook title="Why Latency Spikes Under Load"}
**Recall** from @sec-model-serving-littles-law-9352: Little's Law ($L = \lambda W$) governs all stable queues. When hardware is saturated (throughput $\lambda$ is maxed out), any increase in traffic increases queue depth ($L$). Since $\lambda$ cannot grow, **latency ($W$) must grow linearly with queue depth**. This is why **admission control** (rejecting requests when $L$ exceeds a threshold) is the only way to preserve latency during overload.
:::
@eq-batching-latency decomposes the total user-perceived latency for a batched request into two components:
$$L_{\text{lat}} = L_{\text{lat,wait}} + L_{\text{lat,compute}}(b)$$ {#eq-batching-latency}
where $L_{\text{lat,wait}}$ is the time spent waiting in the batching queue (corresponding to $L_{queue}$ in the overall latency budget) and $L_{\text{lat,compute}}(b)$ is the inference time for batch size $b$ (encompassing $L_{infer}$ plus portions of $L_{pre}$ and $L_{post}$). The batching window $T$ bounds wait time ($L_{\text{lat,wait}} \leq T$), while batch size affects compute time through GPU utilization characteristics.
#### Quantitative Analysis of Batching {#sec-model-serving-queue-waiting-time-analysis-8d5c}
For Poisson arrivals with rate $\lambda$ and batching window $T$, requests arrive uniformly within the window. A request arriving at time $t$ within the window waits $T - t$ for the batch to close. @eq-avg-wait shows that the average wait time is simply half the window:
$$E[L_{\text{lat,wait}}] = \frac{T}{2}$$ {#eq-avg-wait}
```{python}
#| label: batching-budget-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING WINDOW LATENCY BUDGET
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Batching window latency budget analysis
# │
# │ Goal: Demonstrate how batching windows consume the latency budget.
# │ Show: That a 20ms window consumes 20% of a 50ms SLO before computation.
# │ How: Calculate average wait time assuming uniform request arrival.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: avg_wait_str, budget_pct_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class BatchingBudgetCalc:
"""Shows that a 20ms batching window consumes 20% of a 50ms SLO before any computation."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
batch_window_ms_value = 20 # batching window (ms)
slo_ms_value = 50 # latency SLO (ms)
inference_ms_value = 5 # inference time (ms)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
avg_wait_ms_value = batch_window_ms_value / 2
budget_pct_value = avg_wait_ms_value / slo_ms_value * 100
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
avg_wait_str = fmt(avg_wait_ms_value, precision=0, commas=False)
budget_pct_str = fmt(budget_pct_value, precision=0, commas=False)
```
This simple relationship has direct implications. A 20 ms batching window adds `{python} BatchingBudgetCalc.avg_wait_str` ms average latency regardless of batch size achieved. For a 50 ms latency SLO with 5 ms inference, the batching window consumes `{python} BatchingBudgetCalc.budget_pct_str`% of the latency budget before any computation begins.
#### Batch Size Distribution {#sec-model-serving-batch-size-distribution-b3d3}
The number of requests collected during window $T$ follows a Poisson distribution with mean $\lambda T$. @eq-batch-distribution formalizes this relationship:
$$P(\text{batch size} = k) = \frac{(\lambda T)^k e^{-\lambda T}}{k!}$$ {#eq-batch-distribution}
@tbl-batch-variability quantifies this variability, showing how batch size fluctuates for different traffic levels with a fixed 10 ms window:
| **Arrival Rate** | **Mean Batch** | **Std Dev** | **P(batch=0)** | **P(batch≥2$\times$ mean)** |
|:-----------------|---------------:|------------:|---------------:|----------------------------:|
| **50 QPS** | 0.5 | 0.7 | 61% | 39% |
| **200 QPS** | 2.0 | 1.4 | 14% | 14% |
| **500 QPS** | 5.0 | 2.2 | 0.7% | 3% |
| **1000 QPS** | 10.0 | 3.2 | 0.005% | 0.3% |
: **Batch Size Variability**: At low traffic, batching windows frequently contain zero requests (wasted GPU cycles). At moderate traffic, batch sizes fluctuate significantly around the mean. High traffic provides more stable batching, and the probability of batches exceeding twice the mean size decreases as traffic grows (from 39% at 50 QPS to 0.3% at 1000 QPS), reflecting the law of large numbers. {#tbl-batch-variability}
#### Throughput Maximization Strategy {#sec-model-serving-throughput-maximization-strategy-27f5}
Throughput optimization requires maximizing the number of requests processed per unit time. For a system with service time $S(b)$ for batch size $b$, throughput follows @eq-batch-throughput:
$$\text{Throughput}(b) = \frac{b}{T + S(b)}$$ {#eq-batch-throughput}
The numerator increases linearly with batch size while the denominator increases sub-linearly (due to GPU parallelism). This creates an optimal batch size that balances these competing effects.
For ResNet-50 on a V100 GPU, service time approximately scales as $S(b) = 5\text{ms} + 0.6b$ (5 ms fixed overhead plus 0.6 ms per additional image in the batch). This linear approximation captures the dominant trend; actual service times may deviate slightly due to memory hierarchy effects. With $T = 10$ms batching window:
```{python}
#| label: batching-analysis-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING THROUGHPUT ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @tbl-batching-throughput and "Iron Law of Batching Efficiency" callout
# │
# │ Goal: Quantify the efficiency gains from high-batch serving.
# │ Show: That batch-32 improves utilization from 11% to 79% over batch-1.
# │ How: Contrast throughput and latency while applying Iron Law efficiency terms.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: throughput_*, b1_*, b32_*, il_*, T_window_str
# │ service_time_value (function, used by slo-violation-calc)
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
# Helper functions at module scope (used by class body and slo-violation-calc)
def service_time_value(b, _fixed=5.0, _per_image=0.6):
return _fixed + _per_image * b
def total_latency_value(b, _T=10.0, _fixed=5.0, _per_image=0.6):
return _T + _fixed + _per_image * b
def throughput_value(b, _T=10.0, _fixed=5.0, _per_image=0.6):
return b / ((_T + _fixed + _per_image * b) / 1000)
class BatchingAnalysisCalc:
"""Quantifies batching efficiency: batch-32 achieves 14.6× throughput gain over batch-1."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
T_window_value = 10.0 # batching window (ms)
fixed_overhead_ms_value = 5.0 # fixed overhead (ms)
per_image_ms_value = 0.6 # per-image compute (ms)
batch_sizes_value = [1, 4, 8, 16, 32]
il_overhead_ms_value = 5.0 # Iron Law overhead (ms)
il_compute_b1_ms_value = 0.6 # batch-1 compute (ms)
il_compute_b32_ms_value = 19.2 # batch-32 compute (ms)
il_threshold_pct_value = 10 # efficiency threshold (%)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# for-loop avoids Python 3 class-scope comprehension issue with batch_sizes_value
throughputs_value = {}
for b in batch_sizes_value:
throughputs_value[b] = throughput_value(b)
latencies_value = {}
for b in batch_sizes_value:
latencies_value[b] = total_latency_value(b)
throughput_increase_value = throughputs_value[32] / throughputs_value[1]
il_eff_b1_pct_value = int(il_compute_b1_ms_value / (il_overhead_ms_value + il_compute_b1_ms_value) * 100)
il_eff_b32_pct_value = int(il_compute_b32_ms_value / (il_overhead_ms_value + il_compute_b32_ms_value) * 100)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
throughput_increase_str = fmt(throughput_increase_value, precision=1, commas=False)
b1_throughput_str = fmt(throughputs_value[1], precision=0, commas=False)
b32_throughput_str = fmt(throughputs_value[32], precision=0, commas=False)
b1_latency_str = fmt(latencies_value[1], precision=1, commas=False)
b32_latency_str = fmt(latencies_value[32], precision=1, commas=False)
il_overhead_str = fmt(il_overhead_ms_value, precision=0, commas=False)
il_compute_b1_str = f"{il_compute_b1_ms_value:.1f}"
il_compute_b32_str = f"{il_compute_b32_ms_value:.1f}"
il_eff_b1_str = f"{il_eff_b1_pct_value}"
il_eff_b32_str = f"{il_eff_b32_pct_value}"
il_threshold_str = fmt(il_threshold_pct_value, precision=0, commas=False)
T_window_str = fmt(T_window_value, precision=0, commas=False)
# service_time_value, total_latency_value, throughput_value already at module scope
```
| **Batch Size** | **Service Time** | **Total Latency** | **Throughput** | **Efficiency** |
|:---------------|-----------------:|------------------:|---------------:|:---------------|
| 1 | 5.6 ms | 15.6 ms | 64 img/s | Low |
| 4 | 7.4 ms | 17.4 ms | 230 img/s | Moderate |
| 8 | 9.8 ms | 19.8 ms | 404 img/s | Good |
| 16 | 14.6 ms | 24.6 ms | 650 img/s | High |
| 32 | 24.2 ms | 34.2 ms | 935 img/s | Maximum |
: **Batching Throughput Analysis**: ResNet-50 throughput on V100 with 10 ms batching window. Throughput increases 14.6$\times$ from batch size 1 to 32 (64 to 935 img/s), but total latency more than doubles (15.6 ms to 34.2 ms). The optimal configuration depends on whether the latency SLO or throughput target is the binding constraint. {#tbl-batching-throughput}
The throughput gains in @tbl-batching-throughput trace directly back to *the Iron Law of batching efficiency*, the framework established in @sec-model-training-iron-law-training-performance-a53f, where batching amortizes the fixed overhead term.
::: {.callout-notebook title="The Iron Law of Batching Efficiency"}
**The Iron Law Connection:**
In serving, we maximize throughput by amortizing the **Latency Term** ($L_{\text{lat}}$), as shown in @eq-compute-time:
$$ T = \frac{O}{R_{\text{peak}} \cdot \eta} + L_{\text{lat}} $$ {#eq-compute-time}
**Deriving the Sweet Spot:**
* **Case 1 (Batch 1):** Overhead (`{python} BatchingAnalysisCalc.il_overhead_str` ms) ≈ Compute (`{python} BatchingAnalysisCalc.il_compute_b1_str` ms). Efficiency ≈ `{python} BatchingAnalysisCalc.il_eff_b1_str`%. The GPU is mostly waiting.
* **Case 2 (Batch 32):** Overhead (`{python} BatchingAnalysisCalc.il_overhead_str` ms) ≪ Compute (`{python} BatchingAnalysisCalc.il_compute_b32_str` ms). Efficiency ≈ `{python} BatchingAnalysisCalc.il_eff_b32_str`%. The GPU is crunching numbers.
**The Golden Rule:** Increase batch size until the **Latency Term** becomes negligible (< `{python} BatchingAnalysisCalc.il_threshold_str`% of total time). Beyond this point, additional batching yields minimal throughput but imposes a linear latency penalty.
:::
#### Latency-Constrained Optimization {#sec-model-serving-latencyconstrained-optimization-8f66}
When latency SLOs provide the binding constraint, the optimization problem becomes finding the maximum batch size that meets the SLO. For a latency target $L_{\text{lat,target}}$ and average wait time $T/2$, @eq-latency-constrained-batch defines the maximum allowable batch size using a first-order **average** latency approximation:
$$b_{\text{max}} = \max\{b : \frac{T}{2} + S(b) \leq L_{\text{lat,target}}\}$$ {#eq-latency-constrained-batch}
Consider a 50 ms p95 latency SLO for ResNet-50 serving (using this mean-based approximation as a starting point):
```{python}
#| label: latency-constrained-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LATENCY-CONSTRAINED OPTIMIZATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Latency-constrained optimization narrative, comparing conservative
# │ vs aggressive batching window scenarios for a 50ms p95 SLO
# │
# │ Goal: Demonstrate diminishing returns for large batching windows.
# │ Show: That aggressive windows gain only ~12% throughput while adding 10ms wait.
# │ How: Contrast throughput and average wait for conservative vs. long windows.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: s1_window_ms_str, s1_wait_ms_str, s1_budget_ms_str,
# │ s1_max_batch_str, s1_batch_str, s1_throughput_str,
# │ s2_window_ms_str, s2_wait_ms_str, s2_budget_ms_str,
# │ s2_batch_str, s2_throughput_str,
# │ throughput_gain_pct_str, latency_avg_increase_ms_str,
# │ latency_p99_increase_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
# ┌── LEGO ───────────────────────────────────────────────
class BatchingOptimization:
"""
Namespace for Latency-Constrained Batching Optimization.
Scenario: Comparing 5ms (Conservative) vs 25ms (Aggressive) batching windows.
"""
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
# Scenario 1 (Conservative)
s1_window = 5.0
s1_batch = 32
s1_tput = 1140.0
# Scenario 2 (Aggressive)
s2_window = 25.0
s2_batch = 48
s2_tput = 1280.0
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
# Step 1: Avg wait = Window / 2
s1_wait = s1_window / 2
s2_wait = s2_window / 2
# Step 2: Budget (target 50ms)
s1_budget = 50 - s1_wait
s2_budget = 50 - s2_wait
# Step 3: Trade-off metrics
tput_gain = ((s2_tput / s1_tput) - 1) * 100
latency_increase = s2_wait - s1_wait
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
check(tput_gain <= 25, f"Aggressive batching gained too much throughput ({tput_gain:.1f}%). Diminishing returns not shown.")
check(latency_increase >= 5, "Latency penalty is too small to be a concern.")
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
s1_window_ms_str = f"{int(s1_window)}"
s1_wait_ms_str = f"{s1_wait:.0f}"
s1_budget_ms_str = f"{s1_budget:.0f}"
s1_max_batch_str = "70" # Theoretical ceiling
s1_batch_str = f"{s1_batch}"
s1_throughput_str = f"{int(s1_tput):,}"
s2_window_ms_str = f"{int(s2_window)}"
s2_wait_ms_str = f"{s2_wait:.1f}"
s2_budget_ms_str = f"{s2_budget:.1f}"
s2_batch_str = f"{s2_batch}"
s2_throughput_str = f"{int(s2_tput):,}"
throughput_gain_pct_str = f"{tput_gain:.0f}"
latency_avg_increase_ms_str = f"{latency_increase:.0f}"
# Simplified P99 increase for prose consistency
latency_p99_increase_ms_str = f"{int(s2_window - s1_window)}"
```
**Scenario 1: Conservative window (T = `{python} BatchingOptimization.s1_window_ms_str`ms)**
- Average wait: `{python} BatchingOptimization.s1_wait_ms_str`ms
- Latency budget for inference: `{python} BatchingOptimization.s1_budget_ms_str`ms
- Maximum batch size: `{python} BatchingOptimization.s1_max_batch_str` (but typically capped at 32 for memory)
- Achieved throughput: ~`{python} BatchingOptimization.s1_throughput_str` img/s (batch=`{python} BatchingOptimization.s1_batch_str`)
**Scenario 2: Aggressive window (T = `{python} BatchingOptimization.s2_window_ms_str`ms)**
- Average wait: `{python} BatchingOptimization.s2_wait_ms_str`ms
- Latency budget for inference: `{python} BatchingOptimization.s2_budget_ms_str`ms
- Maximum batch size: `{python} BatchingOptimization.s2_batch_str`
- Achieved throughput: ~`{python} BatchingOptimization.s2_throughput_str` img/s (batch=`{python} BatchingOptimization.s2_batch_str`)
The aggressive window achieves only `{python} BatchingOptimization.throughput_gain_pct_str`% higher throughput but increases average latency by `{python} BatchingOptimization.latency_avg_increase_ms_str`ms and p99 latency by `{python} BatchingOptimization.latency_p99_increase_ms_str`ms. Examine @tbl-batching-throughput: for latency-sensitive applications, the conservative window provides better user experience at modest throughput cost.
#### SLO Violation Analysis {#sec-model-serving-slo-violation-analysis-6ebf}
Batch size variability causes SLO violations even when mean latency appears safe. The p99 latency includes both worst-case wait time (full window) and worst-case batch size (governed by Poisson tail). @eq-p99-batch-latency captures this relationship:
$$L_{\text{lat,p99}} \approx T + S(b_{p99})$$ {#eq-p99-batch-latency}
```{python}
#| label: slo-violation-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SLO VIOLATION ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: SLO violation analysis narrative, estimating p99 latency impact
# │ from Poisson-driven batch size variability
# │
# │ Goal: Demonstrate why provisioning on mean latency causes SLO violations.
# │ Show: That p99 latency can be 2.2× higher than the mean due to batch size variance.
# │ How: Model request arrival and batch assembly to compare mean vs. tail response times.
# │
# │ Imports: mlsysim.book (fmt); service_time_value from batching-analysis-calc
# │ Exports: qps_str, T_slo_str, mean_wait_str, mean_batch_str,
# │ mean_service_str, mean_latency_str, p99_service_str,
# │ p99_latency_str, p99_ratio_str, p99_batch_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class SloViolationCalc:
"""p99 latency is 2.2× the mean due to batch size variance from Poisson arrivals."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
qps_value = 500
T_slo_value = 10.0
p99_batch_value = 11
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# service_time_value is a module-level function from batching-analysis-calc
mean_batch_value = qps_value * (T_slo_value / 1000)
mean_wait_value = T_slo_value / 2
mean_service_value = service_time_value(int(mean_batch_value))
mean_latency_value = mean_wait_value + mean_service_value
p99_service_value = service_time_value(p99_batch_value)
p99_latency_value = T_slo_value + p99_service_value
p99_to_mean_value = p99_latency_value / mean_latency_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
qps_str = f"{qps_value}"
T_slo_str = fmt(T_slo_value, precision=0, commas=False)
mean_wait_str = fmt(mean_wait_value, precision=0, commas=False)
mean_batch_str = fmt(mean_batch_value, precision=0, commas=False)
mean_service_str = fmt(mean_service_value, precision=1, commas=False)
mean_latency_str = fmt(mean_latency_value, precision=0, commas=False)
p99_service_str = fmt(p99_service_value, precision=1, commas=False)
p99_latency_str = fmt(p99_latency_value, precision=1, commas=False)
p99_ratio_str = fmt(p99_to_mean_value, precision=2, commas=False)
p99_batch_str = f"{p99_batch_value}"
```
where bp99 is the 99th percentile batch size. For lambda = `{python} SloViolationCalc.qps_str` QPS and T = `{python} SloViolationCalc.T_slo_str` ms:
- Mean batch size: `{python} SloViolationCalc.mean_batch_str`
- p99 batch size: `{python} SloViolationCalc.p99_batch_str` (from Poisson distribution)
- Mean latency: `{python} SloViolationCalc.mean_wait_str` ms + `{python} SloViolationCalc.mean_service_str` ms = `{python} SloViolationCalc.mean_latency_str` ms
- p99 latency: `{python} SloViolationCalc.T_slo_str` ms + `{python} SloViolationCalc.p99_service_str` ms = `{python} SloViolationCalc.p99_latency_str` ms
The p99 latency is `{python} SloViolationCalc.p99_ratio_str`$\times$ the mean, reflecting both wait time variance and batch size variance. Systems that provision based on mean latency will experience SLO violations.
::: {.callout-perspective title="Practitioner's Perspective: The Latency-Throughput Trade-off"}
In systems engineering interviews and architecture reviews, the most common pitfall is discussing "inference speed" without specifying batch size.
* **Batch-1 Regime**: Optimized for latency. Relevant for real-time interaction (e.g., typing helpers, robotics). The bottleneck is usually Python overhead or memory bandwidth.
* **Batch-32 Regime**: Optimized for throughput. Relevant for offline processing or high-traffic services. The bottleneck is usually compute (FLOPS).
**The Professional Response**: When asked "how fast is this model?", always clarify: "Are we optimizing for single-stream latency (Batch 1) or maximum throughput (Batch N)?" This distinction demonstrates systems maturity.
:::
#### Adaptive Batching Windows {#sec-model-serving-adaptive-batching-windows-c404}
Fixed batching windows waste latency budget during high traffic when large batches form quickly. @lst-adaptive-batching demonstrates how adaptive strategies adjust the window based on queue depth.
::: {#lst-adaptive-batching lst-cap="**Adaptive Batching Window**: Dynamically adjusts batch timeout based on queue depth and arrival rate, reducing average latency by 27% compared to fixed windows while maintaining throughput."}
```{.python}
def adaptive_batching_window(queue_depth, arrival_rate, slo_ms):
"""Compute optimal batching window.
Based on current system state.
"""
target_batch_size = 16 # Optimal batch for GPU utilization
# Fast path: batch ready, close immediately to minimize latency
if queue_depth >= target_batch_size:
return 0
# Compute maximum allowable wait from SLO constraint
# Reserve 30% of latency budget for batching,
# remainder for inference
max_wait = slo_ms * 0.3
# Estimate time to accumulate target batch at current arrival rate
if arrival_rate > 0:
requests_needed = target_batch_size - queue_depth
estimated_wait = requests_needed / arrival_rate
# Return minimum of estimated wait and SLO-constrained maximum
return min(estimated_wait, max_wait)
return (
max_wait # Low traffic: use full budget to accumulate batch
)
```
:::
This approach reduces average wait time during high traffic while maintaining batch sizes. For traffic varying between 2001000 QPS:
- Fixed window (10 ms): Average latency 15 ms, throughput 650 img/s
- Adaptive window: Average latency 11 ms (27% reduction), throughput 680 img/s (5% improvement)
The interplay between window size and batch limits creates a space of possible configurations, each representing a different balance between throughput and latency.
The batching configuration space forms a Pareto frontier\index{Pareto Frontier!throughput-latency} where improving throughput requires accepting higher latency. @tbl-pareto-batching traces this frontier across five representative configurations:
| **Window (ms)** | **Max Batch** | **Avg Latency** | **p99 Latency** | **Throughput** | **Configuration** |
|:----------------|--------------:|----------------:|----------------:|---------------:|:---------------------|
| 2 | 16 | 8 ms | 18 ms | 890 img/s | Ultra-low latency |
| 5 | 32 | 10 ms | 22 ms | 1,140 img/s | Balanced |
| 10 | 32 | 15 ms | 35 ms | 1,240 img/s | Moderate latency |
| 20 | 64 | 23 ms | 52 ms | 1,310 img/s | Throughput-optimized |
| 50 | 128 | 38 ms | 98 ms | 1,350 img/s | Maximum throughput |
: **Batching Pareto Frontier**: Each configuration represents a different point on the throughput-latency trade-off curve. Moving from 2 ms to 50 ms windows improves throughput by only 52% while increasing p99 latency by 5.4$\times$. Diminishing returns make aggressive batching costly for latency-sensitive applications. {#tbl-pareto-batching}
#### Practical Configuration Guidelines {#sec-model-serving-practical-configuration-guidelines-9791}
The Pareto frontier in @tbl-pareto-batching illustrates why these guidelines matter: moving from a 2 ms to a 50 ms window improves throughput by only 52% while increasing p99 latency by 5.4$\times$. Principled batching configuration avoids this region of diminishing returns by working backward from the latency budget. Allocating 20 to 30 percent of the SLO to batching wait time leaves the remainder for inference and overhead, which bounds the maximum window at $T_{\text{max}} = 0.3 \times L_{\text{lat,SLO}}$. The traffic estimate that feeds this calculation should use the p95 arrival rate rather than the average, because batching windows tuned for average traffic produce oversized batches during spikes—precisely when SLO headroom matters most. GPU memory imposes a hard ceiling on batch size independent of the latency constraint, since activation memory scales linearly with the batch dimension. Finally, monitoring the actual batch size distribution in production reveals whether initial traffic assumptions hold; high variance signals that the window needs adaptive tuning rather than a fixed configuration.
For ResNet-50 with 50 ms SLO and 500 QPS traffic:
```{python}
#| label: practical-config-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PRACTICAL BATCHING CONFIGURATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Practical configuration guidelines example — calculating optimal
# │ batch window and size for ResNet-50 with 50ms SLO at 500 QPS
# │
# │ Goal: Outline the systematic procedure for deriving a production batching config.
# │ Show: How allocating 30% of the SLO budget to batching yields a safe 12ms window.
# │ How: Calculate expected batch size from QPS and cap by memory limits.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: pc_slo_ms_str, pc_qps_str, pc_batch_budget_ms_str,
# │ pc_max_window_ms_str, pc_expected_batch_str,
# │ pc_mem_limit_batch_str, pc_config_window_ms_str,
# │ pc_config_batch_str, pc_predicted_p99_ms_str,
# │ pc_predicted_throughput_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class PracticalConfigCalc:
"""Derives a production batching config: 30% SLO budget yields a 12ms window for 500 QPS."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
pc_slo_ms_value = 50
pc_qps_value = 500
pc_budget_pct_value = 0.3
pc_mem_limit_batch_value = 32
pc_config_window_ms_value = 12 # tuned value
pc_config_batch_value = 32
pc_predicted_p99_ms_value = 43
pc_predicted_throughput_value = 1180
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
pc_batch_budget_ms_value = pc_slo_ms_value * pc_budget_pct_value
pc_max_window_ms_value = pc_batch_budget_ms_value
pc_expected_batch_value = pc_qps_value * (pc_max_window_ms_value / 1000)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
pc_slo_ms_str = f"{pc_slo_ms_value}"
pc_qps_str = f"{pc_qps_value}"
pc_batch_budget_ms_str = f"{pc_batch_budget_ms_value:.0f}"
pc_max_window_ms_str = f"{pc_max_window_ms_value:.0f}"
pc_expected_batch_str = f"{pc_expected_batch_value}"
pc_mem_limit_batch_str = f"{pc_mem_limit_batch_value}"
pc_config_window_ms_str = f"{pc_config_window_ms_value}"
pc_config_batch_str = f"{pc_config_batch_value}"
pc_predicted_p99_ms_str = f"{pc_predicted_p99_ms_value}"
pc_predicted_throughput_str = f"{pc_predicted_throughput_value:,}"
```
- Latency budget for batching: `{python} PracticalConfigCalc.pc_batch_budget_ms_str`ms
- Maximum window: `{python} PracticalConfigCalc.pc_max_window_ms_str`ms
- Expected batch size: `{python} PracticalConfigCalc.pc_expected_batch_str`
- Maximum batch size: `{python} PracticalConfigCalc.pc_mem_limit_batch_str` (memory limit)
- Configuration: T = `{python} PracticalConfigCalc.pc_config_window_ms_str`ms, b_max = `{python} PracticalConfigCalc.pc_config_batch_str`
- Predicted p99 latency: `{python} PracticalConfigCalc.pc_predicted_p99_ms_str`ms (within SLO)
- Predicted throughput: `{python} PracticalConfigCalc.pc_predicted_throughput_str` img/s
### Continuous Batching {#sec-model-serving-continuous-batching-8bb6}
Autoregressive models\index{Autoregressive Models!token generation} like language models generate outputs token by token—each new token depends on all previously generated tokens, so generation is inherently sequential. The dynamic batching examined in @sec-model-serving-throughput-optimization-18d1 assumes fixed-length outputs. LLMs violate this assumption: if one sequence in a batch of 8 finishes after 10 tokens while others need 100 tokens, 87.5 percent of the compute for that sequence slot is wasted\index{Sequence Length Variability!batch waste} [@yu2022orca].
Continuous batching[^fn-continuous-batching-llm]\index{Continuous Batching!LLM serving} (also called iteration-level batching) addresses this waste by allowing new requests to join a batch between generation steps and completed sequences to exit [@kwon2023vllm]. The system manages batch composition dynamically at each decoding iteration rather than forming static batches that persist for the entire generation process.
The mechanism works as follows: when a sequence generates its end-of-sequence token, its slot becomes immediately available. A waiting request can fill that slot for the next iteration rather than waiting for the entire batch to complete. Similarly, the system can add new requests to available slots without interrupting ongoing generation.
[^fn-continuous-batching-llm]: **Continuous Batching**: Also called "iteration-level batching" (Orca, 2022) and "in-flight batching" (NVIDIA TensorRT-LLM). The key insight is scheduling granularity: traditional batching commits to a fixed batch for an entire generation sequence (potentially hundreds of iterations), while continuous batching reschedules at every token-generation step---analogous to preemptive OS process scheduling versus run-to-completion. This finer granularity eliminates the waste from variable-length sequences, where a batch slot occupied by a completed sequence sits idle until all other sequences finish. \index{Continuous Batching!scheduling granularity}
This dynamic approach maintains high GPU utilization even when sequence lengths vary by 100$\times$ or more.
Systems implementing continuous batching, such as vLLM[^fn-vllm-paging] and TensorRT-LLM, achieve 2--4$\times$ higher throughput than traditional static batching [@agrawal2024sarathi]. The improvement comes from two sources: eliminating wasted compute on completed sequences and reducing average wait time for new requests. For production language model serving where response lengths vary from single tokens to thousands, continuous batching has become essential for cost-effective deployment.
[^fn-vllm-paging]: **vLLM (Virtual LLM)**: An open-source serving system that enables continuous batching via its PagedAttention algorithm. Inspired by OS virtual memory, this technique eliminates the severe memory fragmentation that constrains static batching. By avoiding the 40-50% memory waste common in prior systems, vLLM directly achieves the 2--4$\times$ throughput improvement. \index{vLLM!virtual memory analogy}
Memory management adds complexity to continuous batching. As sequences enter and exit the batch, the key-value cache that stores attention context must be dynamically allocated and freed. Consider what happens when sequences of varying lengths share GPU memory: a 100-token sequence completes and releases its cache, but a new 150-token sequence cannot use that space because it needs a larger contiguous block. Over time, small unusable gaps accumulate between allocated regions, eventually preventing new sequences from starting even when total free memory appears sufficient. This *memory fragmentation*\index{Memory Fragmentation!KV cache} can waste 40 to 50 percent of available memory in naive implementations, severely limiting the concurrent batch size that determines throughput.
#### PagedAttention {#sec-model-serving-pagedattention-b8d4}
PagedAttention\index{Continuous Batching!iteration-level}\index{PagedAttention!memory fragmentation solution},[^fn-pagedattention-serving] introduced in vLLM, solves this fragmentation problem by applying operating system virtual memory concepts to GPU memory [@kwon2023vllm]. Instead of allocating one contiguous block per sequence, PagedAttention divides the KV cache into fixed-size *pages* (typically 16 tokens each). A sequence's cache consists of pointers to non-contiguous pages scattered across GPU memory. When a sequence completes, its pages return to a free list and can be reused by any new sequence, regardless of length. This approach achieves near-zero fragmentation: vLLM reports memory utilization above 95% compared to 5060% for contiguous allocation schemes. The overhead is modest (one pointer lookup per page during attention computation), making PagedAttention the standard for production LLM serving.
[^fn-pagedattention-serving]: **PagedAttention**: The name directly references OS virtual memory paging, first implemented on the Atlas computer at Manchester (1962) to solve the same problem---programs needed more memory than physically available, and contiguous allocation wasted space. Introduced at SOSP 2023, PagedAttention applies this six-decade-old abstraction to GPU memory: before it, LLM serving systems wasted 60--80% of KV cache memory due to fragmentation and over-reservation. PagedAttention reduces waste to under 4%, enabling 2--4$\times$ higher throughput on the same hardware. \index{PagedAttention!memory efficiency}
The batching and memory techniques covered here establish the foundation for LLM serving, but several advanced topics warrant additional study:
::: {.callout-perspective title="LLM Serving: Beyond the Fundamentals"}
Language model serving introduces challenges beyond the batching and memory principles established here. The key-value cache that stores attention context scales with sequence length and batch size, often exceeding the model weights themselves in memory consumption. Techniques like speculative decoding\index{Speculative Decoding!latency reduction}\index{Speculative Decoding!draft model verification} use small draft models to propose multiple tokens that the target model verifies in parallel, achieving 2--3$\times$ latency reduction for interactive applications. Weight-only quantization (INT4 weights with FP16 activations) proves more effective than activation quantization for memory-bandwidth-bound LLM inference.
These LLM-specific optimizations build directly on the foundations this chapter establishes: queuing theory governs request scheduling, batching tradeoffs determine throughput-latency curves, and precision selection follows the same accuracy-efficiency principles. The serving fundamentals apply universally; LLM serving adds domain-specific techniques atop this foundation. Advanced treatments provide detailed coverage of KV cache optimization, including advanced techniques for multi-tenant serving and distributed inference.
:::
Continuous batching is the dominant technique for LLM serving, yet not all deployment scenarios benefit from batching. The sophisticated techniques examined so far (from dynamic batching windows to PagedAttention) optimize for high-throughput server workloads. These techniques introduce complexity and latency overhead that may not be justified for all deployment contexts. The practical question is *when* batching hurts rather than helps.
#### When Not to Batch {#sec-model-serving-batch-12a4}
Some\index{Batching!when to avoid} scenarios require single-request processing. Ultra-low latency requirements\index{Ultra-Low Latency!no batching}, where p99 latency must stay under 10 ms, make any batching delay unacceptable. Highly variable request sizes create padding overhead that wastes compute, since the smallest input in a batch must be padded to match the largest. Memory constraints also become binding when models already consume most GPU memory, since batch activations scale linearly with batch size and can trigger out-of-memory errors.
### Session Affinity Constraints {#sec-model-serving-session-affinity-constraints-8b1f}
When requests from the same user or session should route to the same replica, batching becomes constrained. Session affinity, also called sticky sessions, matters for three main reasons.
The most impactful case is KV-cache reuse\index{KV Cache!session reuse}\index{KV Cache!multi-turn conversations} in conversational AI, where the key-value cache from previous turns dramatically speeds up multi-turn conversations. Routing a follow-up request to a different replica forfeits this cached context, increasing latency by 2 to 5 times for long conversations.
A second driver is user-specific models\index{Personalized Models!user adapters}: some systems serve personalized models or adapters per user, and routing requests to the replica that has already loaded that user's adapter avoids repeated loading overhead. Similarly, stateful preprocessing that maintains tokenizer caches or session-specific normalization requires rebuilding state when requests route to a different replica.
The tension with batching is clear since strict affinity\index{Session Affinity!sticky sessions} constrains which requests can be batched together, potentially reducing batch sizes and GPU utilization. Production systems often implement soft affinity\index{Soft Affinity!load balancing} where requests prefer their assigned replica but can overflow to others when that replica is overloaded. This preserves most affinity benefits while maintaining load balance.
### Traffic Patterns and Batching Strategy {#sec-model-serving-traffic-patterns-batching-strategy-2e6b}
The optimal batching strategy depends critically on how requests arrive. Different deployment contexts exhibit distinct arrival patterns, each requiring different batching approaches. The MLPerf inference benchmark codifies these patterns into four scenarios that directly map to real-world deployments, as @sec-benchmarking explains in detail.
#### Server Traffic (Poisson Arrivals) {#sec-model-serving-server-traffic-poisson-arrivals-5d26}
Cloud APIs\index{Server Traffic!Poisson process} and web services typically receive requests following a Poisson process,[^fn-poisson-serving] where arrivals are independent and uniformly distributed over time. @eq-poisson-batch expresses the expected batch size for Poisson arrivals with rate $\lambda$ and batching window $T$:
[^fn-poisson-serving]: **Poisson Process**: Named after French mathematician Simeon Denis Poisson (1781--1840), this stochastic model describes events occurring continuously and independently at a constant average rate. The key property for serving: variance equals the mean, so batch sizes fluctuate significantly at moderate traffic---with $\lambda=200$ req/s and a 10 ms window, expected batch size is 2 but 16% of windows will be empty (wasted GPU cycles). This variance is why batching windows must be tuned probabilistically rather than set from average traffic alone. \index{Poisson Process!serving arrivals}
$$E[\text{batch size}] = \lambda \cdot T$$ {#eq-poisson-batch}
The variance equals the mean (a property of Poisson distributions), so batch sizes fluctuate significantly at moderate traffic. With $\lambda = 200$ requests/second and $T = 10$ms, expected batch size is 2, but 16% of windows will have zero requests (wasted compute cycles) while others may have 4 or more.
The optimal batching window balances waiting cost against throughput benefit. @eq-optimal-window defines this optimum:
$$T_{\text{optimal}} = \min\left(L_{\text{lat,SLO}} - S, \sqrt{\frac{S}{\lambda}}\right)$$ {#eq-optimal-window}
where $L_{\text{lat,SLO}}$ is the latency SLO and $S$ is the service time. A counterintuitive result emerges from this equation: as traffic increases, the optimal window decreases while achieved batch sizes still grow. @tbl-traffic-adaptive demonstrates this phenomenon across four traffic levels.
| **Arrival Rate** | **Optimal Window** | **Avg Batch Size** | **p99 Latency** |
|:-----------------|-------------------:|-------------------:|----------------:|
| **100 QPS** | 20 ms | 2.0 | 45 ms |
| **500 QPS** | 8 ms | 4.0 | 42 ms |
| **1,000 QPS** | 5 ms | 5.0 | 38 ms |
| **5,000 QPS** | 2 ms | 10.0 | 35 ms |
: **Traffic-Adaptive Batching**: Higher traffic enables shorter windows while still achieving larger batches. The optimal window decreases even as batch sizes grow because more requests arrive per unit time. {#tbl-traffic-adaptive}
#### Streaming Traffic (Correlated Arrivals) {#sec-model-serving-streaming-traffic-correlated-arrivals-32b6}
Autonomous vehicles\index{Streaming Traffic!sensor synchronization}, video analytics, and robotics systems receive inputs from multiple synchronized sensors. This scenario illustrates *multi-camera autonomous vehicle serving*.
::: {.callout-notebook title="Multi-Camera Autonomous Vehicle Serving"}
Consider a vehicle with 6 cameras capturing at 30 FPS, requiring spatial fusion:
**Timeline for processing frame set N:**
| **Time** | **Event** |
|:----------|:----------------------------------|
| T = 0 ms | Cameras begin capturing frame N |
| T = 8 ms | Camera 1 frame arrives |
| T = 10 ms | Cameras 2-5 frames arrive |
| T = 15 ms | Camera 6 arrives (jitter) |
| T = 15 ms | Batch inference begins (6 images) |
| T = 25 ms | Inference complete |
| T = 32 ms | Result ready for planning module |
**Key constraints:**
- Hard deadline: 33 ms per frame set (real-time requirement)
- Batch size: Fixed at 6 (one per camera)
- Synchronization budget: 12 ms of 33 ms total (36% for jitter tolerance)
- Timeout policy: If camera frame not received by T+20 ms, use previous frame
Unlike Poisson traffic where dynamic batching optimizes throughput, streaming traffic requires synchronization policies that handle sensor jitter while meeting hard deadlines.
:::
#### Single-User Traffic (Sequential Arrivals) {#sec-model-serving-singleuser-traffic-sequential-arrivals-78da}
Mobile\index{Single-User Traffic!mobile serving}\index{SingleStream!MLPerf scenario} and embedded applications serve one user at a time, with requests arriving only after the previous result is consumed. We can analyze these constraints in *ResNet-50 mobile serving*.
```{python}
#| label: mobile-serving-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ MOBILE SERVING LATENCY AND ENERGY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Mobile Serving" — single-user traffic pattern
# │
# │ Goal: Contrast latency and energy costs for mobile inference.
# │ Show: That JPEG decode dominates the energy budget, exceeding NPU inference.
# │ How: Model latency and Joules per request for a complete vision pipeline.
# │
# │ Imports: (none)
# │ Exports: m_cam_ms_str, m_jpeg_ms_str, m_resize_ms_str, m_npu_ms_str,
# │ m_ui_ms_str, m_total_ms_str, m_cam_mj_str, m_jpeg_mj_str,
# │ m_resize_mj_str, m_npu_mj_str, m_ui_mj_str, m_total_mj_str
# └─────────────────────────────────────────────────────────────────────────────
class MobileServingCalc:
"""Mobile vision pipeline: JPEG decode dominates energy; NPU handles inference at 82% utilization."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
m_cam_ms_value = 8
m_jpeg_ms_value = 15
m_resize_ms_value = 5
m_npu_ms_value = 12
m_ui_ms_value = 5
m_cam_mj_value = 0.08
m_jpeg_mj_value = 1.5
m_resize_mj_value = 0.4
m_npu_mj_value = 0.8
m_ui_mj_value = 0.2
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
m_total_ms_value = (
m_cam_ms_value + m_jpeg_ms_value + m_resize_ms_value + m_npu_ms_value + m_ui_ms_value
)
m_total_mj_value = (
m_cam_mj_value + m_jpeg_mj_value + m_resize_mj_value + m_npu_mj_value + m_ui_mj_value
)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
m_cam_ms_str = f"{m_cam_ms_value}ms"
m_jpeg_ms_str = f"{m_jpeg_ms_value}ms"
m_resize_ms_str = f"{m_resize_ms_value}ms"
m_npu_ms_str = f"{m_npu_ms_value}ms"
m_ui_ms_str = f"{m_ui_ms_value}ms"
m_total_ms_str = f"{m_total_ms_value}ms"
m_cam_mj_str = f"{m_cam_mj_value:.2f}mJ"
m_jpeg_mj_str = f"{m_jpeg_mj_value:.1f}mJ"
m_resize_mj_str = f"{m_resize_mj_value:.1f}mJ"
m_npu_mj_str = f"{m_npu_mj_value:.1f}mJ"
m_ui_mj_str = f"{m_ui_mj_value:.1f}mJ"
m_total_mj_str = f"{m_total_mj_value:.1f}mJ"
```
::: {.callout-notebook title="ResNet-50: Mobile Serving"}
| **Phase** | **Duration** | **Energy** | **Notes** |
|:-----------------------|:------------------------------------------------|:------------------------------------------------|:------------------|
| **Camera buffer read** | `{python} MobileServingCalc.m_cam_ms_str` | `{python} MobileServingCalc.m_cam_mj_str` | System API |
| **JPEG decode (CPU)** | `{python} MobileServingCalc.m_jpeg_ms_str` | `{python} MobileServingCalc.m_jpeg_mj_str` | Single-threaded |
| **Resize + Normalize** | `{python} MobileServingCalc.m_resize_ms_str` | `{python} MobileServingCalc.m_resize_mj_str` | CPU preprocessing |
| **NPU inference** | `{python} MobileServingCalc.m_npu_ms_str` | `{python} MobileServingCalc.m_npu_mj_str` | 82% utilization |
| **Post-process + UI** | `{python} MobileServingCalc.m_ui_ms_str` | `{python} MobileServingCalc.m_ui_mj_str` | Result rendering |
| **Total** | **`{python} MobileServingCalc.m_total_ms_str`** | **`{python} MobileServingCalc.m_total_mj_str`** | 22 FPS sustained |
**Key metrics for ML node serving:**
- **Energy per inference**: 3.0mJ enables ~12 million inferences per 10Wh battery (typical smartphone)
- **Thermal budget**: At 3.0mJ/45 ms = 67mW sustained, indefinite operation without throttling
- **NPU vs CPU tradeoff**: CPU fallback uses 4.2mJ (1.4$\times$ energy) at 85 ms (1.9$\times$ latency)
- **Memory footprint**: 150 MB peak (model + activations), competing with app memory
**Critical insight**: Even at batch size 1, the mobile NPU achieves 82% utilization because its compute capacity matches single-image workloads. This differs from datacenter GPUs, which achieve only 15% utilization at batch size 1 because their massive parallelism requires larger batches to saturate.
:::
#### Mobile Serving Constraints {#sec-model-serving-mobile-serving-constraints-eb68}
Unlike cloud serving where cost dominates, mobile serving faces three related constraints that shape optimization strategy:
1. **Energy Budget**\index{Energy Budget!mobile inference}: Each inference depletes battery. A photo app running continuous inference at 22 FPS drains 240mW, acceptable for active use but problematic for background processing. The optimization target shifts from throughput to energy-per-inference.
2. **Thermal Throttling**\index{Thermal Throttling!mobile serving}: Sustained high-power operation triggers thermal management. When the SoC reaches thermal limits (typically 45°C junction), the OS reduces NPU frequency by 3050%, degrading both latency and throughput. Bursty workloads that allow cooling between bursts outperform sustained maximum throughput.
3. **Memory Constraints**\index{Memory Constraints!mobile RAM}: Mobile devices share limited RAM between applications. A model consuming 500 MB may be evicted during background operation, requiring reload (cold start) that adds 200500 ms latency. Even a 150 MB footprint becomes problematic when the model must coexist with other app components. Memory-efficient quantization directly improves user experience through faster model restoration, and memory-mapped model loading (@sec-model-serving-loading-strategies-eb38) helps further by loading pages on demand rather than requiring the full model in memory.
These constraints make mobile serving optimization qualitatively different from cloud optimization. The goal is not maximum throughput but **sustainable performance**, maintaining acceptable latency without thermal throttling or excessive battery drain.
@tbl-traffic-patterns-summary maps the four MLPerf scenarios to their deployment contexts and optimal batching strategies, providing a decision framework for serving system design.
| **Scenario** | **Context** | **Strategy** | **Focus** |
|:-----------------------------------------------------|:------------------------------------|:------------------------------|:-----------------------------------------|
| **Server**\index{Server Scenario!MLPerf} | Cloud APIs, web services | Dynamic batching with timeout | Window tuning, utilization-latency curve |
| **MultiStream**\index{MultiStream!MLPerf scenario} | Autonomous driving, video analytics | Synchronized sensor fusion | Jitter handling, deadline guarantees |
| **SingleStream** | Mobile apps, embedded devices | No batching (batch=1) | Preprocessing, power efficiency |
| **Offline**\index{Offline Inference!MLPerf scenario} | Batch processing, data pipelines | Maximum batch size | Throughput, hardware utilization |
: **Traffic Patterns and Batching Strategies**: The four MLPerf inference scenarios map to distinct deployment contexts. Server traffic (cloud APIs) uses dynamic batching with timeout; MultiStream (autonomous driving) uses synchronized sensor fusion; SingleStream (mobile) processes requests individually; Offline (batch processing) maximizes batch size for throughput. {#tbl-traffic-patterns-summary}
::: {.callout-checkpoint title="Batching and Traffic Patterns"}
Batching is the primary lever for serving economics, but the optimal strategy depends on context.
- [ ] **Throughput-latency tradeoff**: Can you explain why batch size 32 achieves 6$\times$ higher throughput than batch size 1, yet a production system with a 20 ms SLO might still choose batch size 8?
- [ ] **Dynamic vs. static batching**: Can you describe why static batching (waiting for a full batch) fails under variable traffic, and how dynamic batching with a time window solves this?
- [ ] **Traffic pattern matching**: Given a deployment scenario (e.g., cloud API, autonomous vehicle, mobile app), can you select the appropriate MLPerf scenario and explain why that batching strategy fits?
- [ ] **Adaptive windows**: Can you explain why the optimal batching window *decreases* as traffic *increases*, even though batch sizes grow?
:::
The batching strategies examined so far share a critical assumption: each request produces a single, fixed-size output---one classification label, one bounding box, one embedding vector. This assumption governs the queuing math, the Pareto frontier analysis, and the traffic-adaptive window tuning. The fastest-growing category of serving workloads, however, violates this assumption entirely. Large language models generate outputs token by token, with each token depending on every previous one. A single request may produce hundreds or thousands of tokens over seconds of elapsed time, yet must feel responsive from the first token onward. This fundamental shift from fixed-output to variable-length, streaming-output serving demands new metrics, new memory management strategies, and new batching techniques that build on---but substantially extend---the foundations established above.
## LLM Serving {#sec-model-serving-llm-serving-b8bf}
Large language models\index{LLM Serving!token generation} introduce three properties absent from traditional serving: *autoregressive generation*[^fn-autoregressive-serving] (each token depends on all previous tokens, making output inherently sequential), *variable-length output* (response length is unknown at request time, invalidating fixed-batch assumptions), and *stateful memory* (the key-value cache grows with each generated token, creating dynamic memory pressure that traditional models never face). Together, these properties create a qualitatively different serving challenge. The p50, p95, and p99 metrics that govern classification serving still matter, but they apply to different *phases* of the request---the initial prompt processing and the subsequent token generation. The foundational principles of queuing theory, batching tradeoffs, and latency budgets apply universally; LLM serving adds domain-specific techniques atop this foundation.
[^fn-autoregressive-serving]: **Autoregressive**: From Greek *auto-* (self) and Latin *regressus* (a going back)---the output "regresses" on itself. George Udny Yule introduced autoregressive models in 1927 for analyzing sunspot cycles. In language modeling, each output token conditions on all previously generated tokens, creating a serial dependency that prevents the parallelism exploited during training. This serial bottleneck explains why LLM serving is memory-bandwidth-bound rather than compute-bound: the model weights must be read from memory once per token, regardless of available compute capacity. \index{Autoregressive!serial bottleneck}
### Performance Metrics: TTFT and TPOT {#sec-model-serving-performance-metrics-ttft-tpot-b009}
Generative models produce a stream of tokens rather than a single output tensor. This streaming nature requires dedicated *LLM performance metrics* that reflect the internal state transition from "prefill" (processing input) to "decode" (generating output). The two key measures are *Time to First Token (TTFT)* and *Time Per Output Token (TPOT)*, which capture responsiveness and fluidity respectively.
::: {.callout-definition title="LLM Performance Metrics"}
***LLM Performance Metrics***\index{LLM Performance Metrics!definition} are the two-dimensional measurements of latency for streaming autoregressive generation.
1. **Significance (Quantitative):** They decompose user-perceived latency into **Time to First Token (TTFT)** (governed by the compute-bound **Prefill Phase**) and **Time Per Output Token (TPOT)** (governed by the memory-bandwidth-bound **Decode Phase**).
2. **Distinction (Durable):** Unlike **Fixed-Output Metrics** (e.g., end-to-end latency), LLM metrics measure the **Fluidity of Generation**, acknowledging that the user experience depends on the *rhythm* of token arrival.
3. **Common Pitfall:** A frequent misconception is that a "fast model" has a low TTFT. In reality, a model can have a fast TTFT but a sluggish TPOT (if the **Memory Wall ($BW$)** is the bottleneck), leading to a frustrating user experience where the answer starts quickly but "stutters" thereafter.
:::
These two metrics capture distinct user experience aspects.
::: {.callout-lighthouse title="LLM Serving Latency Targets"}
A production-grade LLM service typically targets the following SLOs:
- **TTFT**: < 500 ms (for a 1000-token prompt)
- **TPOT**: < 50 ms (equivalent to ~20 tokens/second, faster than human reading speed)
- **Throughput**: > 1000 tokens/second aggregate across all users
:::
### Decoding Strategies {#sec-model-serving-decoding-strategies-afe8}
Generative models require decoding strategies that trade off quality, diversity, and latency. The choice of decoding strategy dramatically affects both output quality and computational cost.
The simplest approach, greedy decoding\index{Greedy Decoding!LLM generation}, selects the highest-probability token at each step. It is fast but often produces repetitive, low-quality outputs because it cannot recover from early mistakes. Beam search\index{Beam Search!decoding strategy}\index{Beam Search!candidate sequences} improves quality by maintaining multiple candidate sequences and selecting the highest-scoring complete sequence, though it multiplies computation by the beam width. Sampling\index{Sampling!temperature, top-k, top-p} with temperature, top-k, and top-p parameters introduces randomness for diversity [@holtzman2020curious]. Temperature scales logits before softmax. Top-k limits sampling to the k highest-probability tokens. Top-p, also called nucleus sampling\index{Nucleus Sampling!top-p}, limits sampling to tokens comprising probability mass p.
The choice presents latency tradeoffs [@meister2020beam]. Beam search with width 5 takes roughly 5$\times$ the compute of greedy decoding. Sampling adds minimal overhead but requires careful parameter tuning to balance quality and coherence.
Production LLM systems\index{Streaming Responses!LLM serving}\index{Chunked HTTP!streaming tokens} return tokens as they are produced rather than waiting for complete generation. This transforms the user experience: a 2-second total generation feels responsive when tokens stream continuously, but feels broken when users stare at a blank screen for 2 seconds. Streaming requires infrastructure support for chunked HTTP responses and client-side incremental rendering. The latency profile shifts accordingly: TTFT determines when output starts appearing (responsiveness), while TPOT determines the perceived generation speed (fluidity).
### Memory and KV Cache {#sec-model-serving-memory-kv-cache-d1ea}
Generative inference requires managing the **KV Cache**[^fn-kv-cache-serving]\index{KV Cache!LLM memory}\index{KV Cache!sequence length scaling}, a stateful memory structure that grows with sequence length. Unlike traditional models where memory usage is constant per batch, LLM memory usage is dynamic. Each generated token adds to the context window, consuming additional GPU memory through state accumulation, and variable-length sequences can lead to memory fragmentation if not managed explicitly.
[^fn-kv-cache-serving]: **KV Cache (Key-Value Cache)**: To avoid redundant work, the system caches the Key and Value vectors from previous tokens, which remain valid throughout generation. This design choice is the direct cause of the dynamic memory growth described; the cache's size grows linearly with every generated token, making memory management, not computation, the primary constraint. For a 70B model, this state can consume over 1.3 MB per token, meaning a batch of 32 requests at an 8,000-token context requires ~330 GB of memory—far exceeding the model weights themselves. \index{KV Cache!memory scaling}
The continuous batching and PagedAttention techniques covered in @sec-model-serving-continuous-batching-8bb6 address these challenges.
#### Prefix Caching and Memory Offloading {#sec-model-serving-prefix-caching-offloading}
The memory pressure from KV caches can be further mitigated through architectural strategies that exploit request patterns. **Prefix Caching**\index{Prefix Caching!KV cache reuse} stores the KV cache of common instruction prefixes (such as a 2,000-token system prompt or a shared RAG context), allowing many independent requests to reuse the same pre-computed hidden states. This eliminates redundant prefill compute ($R_{\text{peak}}$) and reduces memory traffic ($BW$). For multi-turn conversations, this "caching of the past" allows the system to process only the *new* tokens in each turn.
When the aggregate KV cache exceeds GPU VRAM, systems can employ **KV Cache Offloading**\index{KV Cache!offloading}. This strategy spills inactive or low-priority context windows to host CPU RAM or NVMe SSD, freeing VRAM for active generation. While retrieving offloaded context introduces a latency "tax" due to PCIe bandwidth limits (@sec-model-serving-model-swapping-host-memory-c54f), it prevents Out-of-Memory (OOM) failures and enables handling much larger context windows than the hardware could otherwise support.
Advanced techniques including speculative decoding[^fn-speculative-decoding] and distributed parallelism are covered in specialized treatments of large-scale systems.
[^fn-speculative-decoding]: **Speculative Decoding**: A small "draft" model generates $k$ candidate tokens autoregressively; the large target model then verifies all $k$ in a single parallel forward pass. When the draft model's proposals are accepted at rate $\alpha$, effective throughput scales as $k \cdot \alpha$ — but verification is parallel, so wall-clock cost is approximately one large-model step regardless of $k$. At $\alpha = 0.8$ with $k = 4$, speculative decoding delivers roughly 3.2$\times$ throughput improvement over sequential decoding without modifying the target model. This breaks the serial autoregressive bottleneck at the runtime layer, not the architecture layer. \index{Speculative Decoding!throughput}
The computational intensity of managing KV caches across concurrent requests raises a broader question about the energy cost of each token generated. Unlike classification models where energy per inference is constant, LLM energy consumption scales with response length—every generated token requires reading the entire model from memory. Quantifying *the carbon cost of a chat* translates these hardware demands into energy and carbon metrics that make the environmental impact concrete.
```{python}
#| label: h100-tdp-bw-specs
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ H100 TDP AND BANDWIDTH SPECIFICATIONS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Carbon Cost of Chat" callout (~10 lines below: GpuSpecs.h100_tdp),
# │ Layer Fusion footnote (~120 lines below: GpuSpecs.h100_bw_tbs)
# │
# │ Goal: Provide H100 TDP and memory bandwidth for energy and runtime analysis.
# │ Show: The power envelope and bandwidth ceiling of the H100.
# │ How: Retrieve constants from Hardware Digital Twins.
# │
# │ Imports: mlsysim (Hardware), mlsysim.constants (TB, second, watt)
# │ Exports: GpuSpecs.h100_tdp, GpuSpecs.h100_bw_tbs
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim import Hardware
from mlsysim.core.constants import TB, second, watt
class GpuSpecs:
"""Hardware specifications used by downstream prose and callouts."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
h_h100 = Hardware.Cloud.H100
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
h100_bw_tbs_value = h_h100.memory_bw.m_as(TB / second)
h100_tdp_value = h_h100.tdp.m_as(watt)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
h100_bw_tbs = f"{h100_bw_tbs_value:.2f}" # e.g. "3.35" TB/s
h100_tdp = f"{h100_tdp_value:.0f}" # e.g. "700" W
```
```{python}
#| label: carbon-cost-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ CARBON COST OF CHAT
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Carbon Cost of a Chat" - energy footprint of LLM serving
# │
# │ Goal: Quantify the energy cost per LLM token.
# │ Show: That poor utilization causes 10× higher energy consumption per token.
# │ How: Calculate Joules per token based on TDP and concurrent request volume.
# │
# │ Imports: mlsysim.core.constants (H100_TDP, energy comparisons)
# │ Exports: cc_concurrent_str, cc_tokens_req_str, cc_total_tokens_str,
# │ cc_host_overhead_str, cc_total_power_str, cc_joules_token_str,
# │ cc_response_tokens_str, cc_response_joules_str, cc_smartphone_str,
# │ cc_boiling_str, cc_low_util_str, cc_idle_power_str, cc_low_util_joules_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.core.constants import H100_TDP, ENERGY_SMARTPHONE_CHARGE_J, ENERGY_BOILING_WATER_J, watt, joule
from mlsysim.fmt import fmt
class CarbonCostCalc:
"""Energy cost per LLM token: poor utilization causes 10× higher Joules/token."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
cc_concurrent_req_value = 114 # concurrent requests per H100
cc_tokens_per_sec_req_value = 7.5 # tokens/sec per request (decode phase)
cc_host_overhead_w_value = 300 # host server power overhead (W)
cc_response_tokens_value = 500 # typical response length
cc_low_util_pct_value = 10 # poor utilization scenario (%)
cc_idle_power_w_value = 300 # GPU idle power (W)
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
_h100_tdp_w = H100_TDP.m_as(watt)
cc_total_tokens_sec_value = cc_concurrent_req_value * cc_tokens_per_sec_req_value
cc_total_power_w_value = _h100_tdp_w + cc_host_overhead_w_value
cc_joules_per_token_value = cc_total_power_w_value / cc_total_tokens_sec_value
cc_response_joules_value = cc_joules_per_token_value * cc_response_tokens_value
_smartphone_joules = ENERGY_SMARTPHONE_CHARGE_J.m_as(joule)
_boiling_joules = ENERGY_BOILING_WATER_J.m_as(joule)
cc_low_util_tokens_sec_value = cc_total_tokens_sec_value * (cc_low_util_pct_value / 100)
cc_low_util_joules_value = cc_idle_power_w_value / cc_low_util_tokens_sec_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
cc_concurrent_str = fmt(cc_concurrent_req_value, precision=0, commas=False)
cc_tokens_req_str = fmt(cc_tokens_per_sec_req_value, precision=1, commas=False)
cc_total_tokens_str = fmt(cc_total_tokens_sec_value, precision=0, commas=False)
cc_host_overhead_str = fmt(cc_host_overhead_w_value, precision=0, commas=False)
cc_total_power_str = fmt(cc_total_power_w_value, precision=0, commas=False)
cc_joules_token_str = fmt(cc_joules_per_token_value, precision=2, commas=False)
cc_response_tokens_str = fmt(cc_response_tokens_value, precision=0, commas=False)
cc_response_joules_str = fmt(cc_response_joules_value, precision=0, commas=False)
cc_smartphone_str = f"{_smartphone_joules:,}"
cc_boiling_str = f"{_boiling_joules:,}"
cc_low_util_str = fmt(cc_low_util_pct_value, precision=0, commas=False)
cc_idle_power_str = fmt(cc_idle_power_w_value, precision=0, commas=False)
cc_low_util_joules_str = fmt(cc_low_util_joules_value, precision=1, commas=False)
```
::: {.callout-notebook #notebook-carbon-chat title="The Carbon Cost of a Chat"}
**Joules per Token: The Green Metric**:
As LLMs scale, energy efficiency becomes a first-class operational metric alongside latency. For an H100 GPU (`{python} GpuSpecs.h100_tdp`W TDP), we can quantify the energy footprint of serving:
1. **Throughput**: `{python} CarbonCostCalc.cc_concurrent_str` concurrent requests$\times$ `{python} CarbonCostCalc.cc_tokens_req_str` tokens/sec/req ≈ **`{python} CarbonCostCalc.cc_total_tokens_str` tokens/sec**.
2. **Power**: `{python} GpuSpecs.h100_tdp` W (GPU) + `{python} CarbonCostCalc.cc_host_overhead_str` W (Host/Overhead) = **`{python} CarbonCostCalc.cc_total_power_str` W**.
3. **Energy per Token**:
`{python} CarbonCostCalc.cc_total_power_str` Joules/sec / `{python} CarbonCostCalc.cc_total_tokens_str` tokens/sec ≈ **`{python} CarbonCostCalc.cc_joules_token_str` Joules/token**
**The Systems Conclusion**: A typical `{python} CarbonCostCalc.cc_response_tokens_str`-token response consumes ≈ **`{python} CarbonCostCalc.cc_response_joules_str` Joules**.
- For comparison, charging a smartphone consumes ≈ `{python} CarbonCostCalc.cc_smartphone_str` Joules.
- Boiling a cup of water consumes ≈ `{python} CarbonCostCalc.cc_boiling_str` Joules.
**The Engineering Lever**: The primary way to reduce Joules/Token is to **increase hardware utilization** and **eliminate redundant compute**. If the GPU sits at `{python} CarbonCostCalc.cc_low_util_str`% utilization due to poor batching, the "Idle Power" is still ~`{python} CarbonCostCalc.cc_idle_power_str` W, causing the energy-per-token to skyrocket to **>`{python} CarbonCostCalc.cc_low_util_joules_str` Joules**. Furthermore, architectural optimizations like **Prefix Caching** skip the energy-intensive prefill phase for shared context, directly reducing the energy footprint of RAG and chat applications. MLOps is not just about speed; it is about sustainability through efficiency.
:::
::: {.callout-checkpoint title="LLM Serving Fundamentals"}
LLM serving introduces constraints absent from traditional model serving.
- [ ] **TTFT vs. TPOT**: Can you explain why these two metrics capture different user experience aspects (responsiveness vs. fluidity) and why they are governed by different hardware bottlenecks (compute vs. memory bandwidth)?
- [ ] **Memory wall**: Can you explain why adding more compute cores yields zero latency improvement for token generation, and why only faster memory or smaller models help? (The Llama-3 case study in @sec-model-serving-production-case-study-serving-llama38b-0499 quantifies this relationship.)
- [ ] **Continuous batching**: Can you explain why traditional static batching wastes compute when sequence lengths vary, and how iteration-level batching solves this?
- [ ] **PagedAttention**: Can you explain the memory fragmentation problem in KV cache management and how borrowing virtual memory concepts from OS design achieves near-zero waste?
- [ ] **Prefix Caching**: Can you explain how caching the KV states of common instruction prefixes reduces redundant computation and speeds up RAG or multi-turn applications?
:::
## Inference Runtime Selection {#sec-model-serving-inference-runtime-selection-5eef}
The batching strategies and LLM-specific techniques examined in preceding sections determine *how* requests are grouped and processed. These strategies assume an underlying execution engine that actually runs the model computations—an assumption that matters enormously. The token generation time (@eq-token-generation-time) and the latency budgets established earlier are achievable only if the runtime efficiently maps operations to hardware. The inference runtime, the software layer that orchestrates tensor operations and manages hardware resources, can vary by an order of magnitude in performance for identical models. Choosing appropriately requires understanding the tradeoffs between framework-native serving, general-purpose optimization, and specialized inference engines.
### Runtime Ecosystem and Configuration {#sec-model-serving-frameworknative-serving-da62}
PyTorch and TensorFlow models can serve directly using their native runtimes. This approach maximizes compatibility (any model that trains will serve) and simplifies the deployment pipeline (no export or conversion step). Framework runtimes include training functionality that adds overhead, and default execution paths may not exploit hardware-specific optimizations.
TorchScript and TensorFlow SavedModel formats enable ahead-of-time compilation and graph optimization, improving over eager execution while maintaining framework compatibility. These formats represent the first step toward deployment optimization without abandoning the familiar framework ecosystem.
#### General-Purpose Optimization {#sec-model-serving-generalpurpose-optimization-9ec2}
ONNX Runtime[^fn-onnx-runtime-serving]\index{ONNX Runtime!cross-platform inference} provides a hardware-agnostic optimization layer [@onnxruntime2024]. Models export to ONNX format, then ONNX Runtime applies graph optimizations and selects execution providers for the target hardware. This enables single-format deployment across CPUs, GPUs, and specialized accelerators.
[^fn-onnx-runtime-serving]: **ONNX Runtime**: Microsoft's inference engine (released December 2018) acts as a hardware abstraction layer: the same ONNX model runs on CPUs, NVIDIA GPUs, AMD GPUs, or custom accelerators through pluggable "execution providers." ONNX Runtime applies framework-agnostic graph optimizations---constant folding, redundant node elimination, operator fusion---that benefit all targets. This cross-platform capability avoids maintaining separate optimization pipelines per hardware target, accepting a 5--15% throughput loss versus TensorRT for vision models, offset by the ability to retarget the same `.onnx` artifact across CPU/GPU/NPU without recompilation---a flexibility premium that matters most in heterogeneous device fleets where recompiling per-target is measured in engineer-days. \index{ONNX Runtime!cross-platform serving}
#### Specialized Inference Engines {#sec-model-serving-specialized-inference-engines-475f}
TensorRT\index{Inference Engine!specialized}\index{TensorRT!GPU optimization}[^fn-tensorrt-gpu-serving] (NVIDIA GPUs), OpenVINO[^fn-openvino-serving]\index{OpenVINO!Intel optimization} (Intel hardware), and similar engines optimize specifically for their target hardware [@nvidia2024tensorrt; @chen2018tvm]. They apply aggressive optimizations that framework-native runtimes cannot safely perform:
[^fn-openvino-serving]: **OpenVINO (Open Visual Inference and Neural network Optimization)**: An Intel-specific engine that bypasses framework abstractions to map computations directly onto proprietary hardware instructions like AVX-512 and AMX. This direct hardware targeting is an "aggressive" optimization because it abandons the portability that framework-native runtimes must guarantee, allowing it to exploit specialized kernels unsafe for general execution. The resulting 2--5$\times$ speedup over standard CPU execution makes dedicated CPU serving economically viable for models under ~500M parameters. \index{OpenVINO!Intel inference}
[^fn-tensorrt-gpu-serving]: **TensorRT**: It abandons the portability of general-purpose frameworks by requiring a build phase that recompiles the model for a *specific* target GPU architecture (e.g., an H100). This hardware lock-in allows for aggressive, irreversible optimizations like layer fusion that are unsafe for a framework that must run on any hardware. The resulting non-portable engine can deliver 2--5$\times$ lower latency, directly reducing the number of GPUs required to meet a throughput target. \index{TensorRT!GPU optimization}
Layer fusion[^fn-layer-fusion-serving]\index{Layer Fusion!kernel optimization} combines multiple sequential operations into a single GPU kernel. Consider a common pattern: convolution → batch normalization → ReLU activation. Without fusion, this requires three kernel launches, three round-trips to GPU memory (write conv output, read for batchnorm, write batchnorm output, read for ReLU), and three sets of intermediate tensors. Fusion combines all three into one kernel that reads inputs once, computes the combined result in registers, and writes final outputs once. This eliminates kernel launch overhead (1560 μs saved per fusion) and reduces memory traffic by 2--3$\times$. TensorRT automatically detects and fuses common patterns; a typical ResNet-50 reduces from ~50 kernels to ~15 after fusion.
[^fn-layer-fusion-serving]: **Layer Fusion**: Analogous to loop fusion in compiler optimization, where adjacent loops over the same array are combined to reduce memory traffic. Kernel fusion applies the identical principle to GPU operations: sequential kernels that write and re-read intermediate tensors from HBM are merged into a single kernel that keeps data in registers. The savings compound---a typical ResNet-50 has ~35 fusible operation pairs, and each eliminated HBM round-trip saves 1--3 $\mu$s at `{python} GpuSpecs.h100_bw_tbs` TB/s bandwidth, converting memory-bound chains into compute-bound fused kernels. \index{Layer Fusion!compiler analogy}
Kernel auto-tuning\index{Kernel Auto-Tuning!algorithm selection}\index{Algorithm Selection!convolution implementations} selects the fastest algorithm for each operation on the specific GPU. A single convolution can be implemented using dozens of algorithms (direct, FFT-based, Winograd, various tiling strategies), each optimal for different input sizes and GPU architectures. Auto-tuning benchmarks each candidate and caches the winner, trading compilation time for runtime performance.
These optimizations typically achieve 2--5$\times$ speedup over framework-native serving but require explicit export and may not support all operations. A *runtime comparison* on a standard model quantifies these gains across the optimization spectrum.
```{python}
#| label: runtime-comparison-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RUNTIME COMPARISON
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Runtime Comparison" — specialized inference
# │ engines section
# │
# │ Goal: Demonstrate the speedup spectrum across inference runtimes.
# │ Show: That hardware-specific engines (TensorRT) yield up to 9× speedup over eager PyTorch.
# │ How: Compare benchmarked latencies for JIT, ONNX, and TensorRT at multiple precisions.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: rt_pytorch_ms_str, rt_torchscript_ms_str, rt_onnx_ms_str,
# │ rt_trt_fp32_ms_str, rt_trt_fp16_ms_str, rt_trt_int8_ms_str,
# │ rt_pytorch_speedup_str, rt_torchscript_speedup_str,
# │ rt_onnx_speedup_str, rt_trt_fp32_speedup_str,
# │ rt_trt_fp16_speedup_str, rt_trt_int8_speedup_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class RuntimeComparisonCalc:
"""ResNet-50 runtime comparison: TensorRT INT8 achieves up to 9× speedup over eager PyTorch."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
rt_pytorch_ms_value = 8.5
rt_torchscript_ms_value = 6.2
rt_onnx_ms_value = 5.1
rt_trt_fp32_ms_value = 2.8
rt_trt_fp16_ms_value = 1.4
rt_trt_int8_ms_value = 0.9
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
rt_pytorch_speedup_value = 1.0
rt_torchscript_speedup_value = rt_pytorch_ms_value / rt_torchscript_ms_value
rt_onnx_speedup_value = rt_pytorch_ms_value / rt_onnx_ms_value
rt_trt_fp32_speedup_value = rt_pytorch_ms_value / rt_trt_fp32_ms_value
rt_trt_fp16_speedup_value = rt_pytorch_ms_value / rt_trt_fp16_ms_value
rt_trt_int8_speedup_value = rt_pytorch_ms_value / rt_trt_int8_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
rt_pytorch_ms_str = f"{rt_pytorch_ms_value:.1f}"
rt_torchscript_ms_str = f"{rt_torchscript_ms_value:.1f}"
rt_onnx_ms_str = f"{rt_onnx_ms_value:.1f}"
rt_trt_fp32_ms_str = f"{rt_trt_fp32_ms_value:.1f}"
rt_trt_fp16_ms_str = f"{rt_trt_fp16_ms_value:.1f}"
rt_trt_int8_ms_str = f"{rt_trt_int8_ms_value:.1f}"
rt_pytorch_speedup_str = fmt(rt_pytorch_speedup_value, precision=1, commas=False)
rt_torchscript_speedup_str = fmt(rt_torchscript_speedup_value, precision=1, commas=False)
rt_onnx_speedup_str = fmt(rt_onnx_speedup_value, precision=1, commas=False)
rt_trt_fp32_speedup_str = fmt(rt_trt_fp32_speedup_value, precision=1, commas=False)
rt_trt_fp16_speedup_str = fmt(rt_trt_fp16_speedup_value, precision=1, commas=False)
rt_trt_int8_speedup_str = fmt(rt_trt_int8_speedup_value, precision=1, commas=False)
```
::: {.callout-notebook title="ResNet-50: Runtime Comparison"}
Performance comparison for ResNet-50 inference on V100 GPU (batch size 1):
| **Runtime** | **Latency** | **Speedup** | **Notes** |
|:----------------|----------------------------------------------------------:|--------------------------------------------------------------------:|:--------------------------|
| PyTorch (eager) | `{python} RuntimeComparisonCalc.rt_pytorch_ms_str` ms | `{python} RuntimeComparisonCalc.rt_pytorch_speedup_str`$\times$ | Baseline, no optimization |
| TorchScript | `{python} RuntimeComparisonCalc.rt_torchscript_ms_str` ms | `{python} RuntimeComparisonCalc.rt_torchscript_speedup_str`$\times$ | JIT compilation |
| ONNX Runtime | `{python} RuntimeComparisonCalc.rt_onnx_ms_str` ms | `{python} RuntimeComparisonCalc.rt_onnx_speedup_str`$\times$ | Cross-platform |
| TensorRT FP32 | `{python} RuntimeComparisonCalc.rt_trt_fp32_ms_str` ms | `{python} RuntimeComparisonCalc.rt_trt_fp32_speedup_str`$\times$ | NVIDIA-specific |
| TensorRT FP16 | `{python} RuntimeComparisonCalc.rt_trt_fp16_ms_str` ms | `{python} RuntimeComparisonCalc.rt_trt_fp16_speedup_str`$\times$ | Tensor Core acceleration |
| TensorRT INT8 | `{python} RuntimeComparisonCalc.rt_trt_int8_ms_str` ms | `{python} RuntimeComparisonCalc.rt_trt_int8_speedup_str`$\times$ | Requires calibration |
**Key insight**: The `{python} RuntimeComparisonCalc.rt_trt_int8_speedup_str`$\times$ speedup from TensorRT INT8 comes at the cost of: (1) quantization calibration data, (2) potential accuracy loss (<1% for ResNet-50), and (3) NVIDIA-specific deployment.
:::
The optimization-compatibility tradeoff is inherent. More aggressive optimization yields better performance yet increases deployment complexity and may introduce numerical differences from training. The choice depends on latency requirements, deployment constraints, and available engineering resources.
#### Runtime Configuration {#sec-model-serving-runtime-configuration-492b}
Beyond runtime selection, configuration choices significantly impact serving performance. Thread pool sizing controls parallelism for CPU inference—too few threads leave cores idle, while too many cause contention. Memory allocation strategies (pre-allocated buffers versus dynamic allocation) trade startup cost against flexibility. Execution provider selection prioritizes which hardware backends handle each operation, and graph optimization level trades compilation time for runtime performance. Production deployments require systematic experimentation to find optimal configurations for specific models and hardware combinations, measuring their impact on latency distributions rather than relying on defaults.
### Precision Selection for Serving {#sec-model-serving-precision-selection-serving-55ba}
A team deploying ResNet-50 on V100 GPUs faces a concrete constraint: their 30-GPU cluster costs \$90/hour, and business growth requires 3$\times$ more throughput without expanding the fleet. Switching from FP32 to INT8 inference achieves exactly this—the same model on the same hardware serves 3$\times$ more requests per second, reducing the effective cost per inference by two-thirds, at a cost of less than 0.4 percentage points of accuracy. This example illustrates the direct connection between numerical precision and infrastructure economics. Precision selection connects to the quantization techniques covered in @sec-model-compression. For the foundational comparison of numerical formats (FP32, FP16, BF16, FP8, INT8) and their precision-range trade-offs, see @sec-machine-foundations-numerical-representations-c889; for the mechanics of symmetric and asymmetric integer quantization, see @sec-machine-foundations-integer-quantization-5442. While @sec-model-compression focuses on training-time quantization, serving introduces additional considerations including calibration requirements, layer sensitivity, and dynamic precision selection.
#### Precision-Throughput Relationship {#sec-model-serving-precisionthroughput-relationship-b503}
For\index{Precision!throughput tradeoff}\index{Quantization!serving precision} memory-bandwidth-bound operations\index{Memory-Bandwidth Bound!precision impact}, reducing precision proportionally increases throughput by reducing data movement. @eq-precision-throughput quantifies the theoretical maximum speedup from precision reduction:
$$
\frac{\text{Throughput}_{\text{INT8}}}{\text{Throughput}_{\text{FP32}}} = \frac{32}{8} = 4\times \text{ (theoretical maximum)}
$$ {#eq-precision-throughput}
In practice, GPU compute pipelines and Tensor Core alignment requirements limit achieved speedup to 2.5--3.5$\times$ for INT8 versus FP32. Tensor Cores\index{Tensor Cores!alignment requirements} require specific alignment: INT8 operations need tensor dimensions divisible by 16, while FP16 requires divisibility by 8. @sec-hardware-acceleration provides the detailed Tensor Core architecture that explains these alignment constraints. The *precision tradeoffs* for a standard vision model illustrate how these theoretical limits manifest in practice.
```{python}
#| label: precision-tradeoff-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PRECISION TRADEOFFS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Precision Tradeoffs on V100" — precision-
# │ throughput relationship section
# │
# │ Goal: Quantify the three-way trade-off between latency, memory, and accuracy.
# │ Show: That FP16 is a "free lunch" (2× speedup) while INT8 trades marginal accuracy for 3× gains.
# │ How: Contrast ResNet-50 metrics across FP32, FP16, and INT8 precisions.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: pt_fp32_ms_str, pt_fp32_mem_mb_str, pt_fp32_acc_str,
# │ pt_fp16_ms_str, pt_fp16_mem_mb_str, pt_fp16_acc_str,
# │ pt_fp16_util_str, pt_int8_ms_str, pt_int8_mem_mb_str,
# │ pt_int8_ptq_acc_str, pt_int8_qat_acc_str, pt_int8_util_str,
# │ pt_int8_speedup_str, pt_fp16_speedup_str, pt_int8_acc_loss_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class PrecisionTradeoffCalc:
"""ResNet-50 precision tradeoffs: FP16 gives 2× free speedup; INT8 gives 3× with <0.4pp accuracy loss."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
pt_fp32_ms_value = 2.8
pt_fp32_mem_mb_value = 98
pt_fp32_acc_value = 76.13
pt_fp16_ms_value = 1.4
pt_fp16_mem_mb_value = 49
pt_fp16_acc_value = 76.13
pt_fp16_util_value = 85
pt_int8_ms_value = 0.9
pt_int8_mem_mb_value = 25
pt_int8_ptq_acc_value = 75.80
pt_int8_qat_acc_value = 76.05
pt_int8_util_value = 92
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
pt_int8_speedup_value = pt_fp32_ms_value / pt_int8_ms_value
pt_fp16_speedup_value = pt_fp32_ms_value / pt_fp16_ms_value
pt_int8_acc_loss_value = pt_fp32_acc_value - pt_int8_ptq_acc_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
pt_fp32_ms_str = f"{pt_fp32_ms_value:.1f}"
pt_fp32_mem_mb_str = f"{pt_fp32_mem_mb_value}"
pt_fp32_acc_str = f"{pt_fp32_acc_value:.2f}"
pt_fp16_ms_str = f"{pt_fp16_ms_value:.1f}"
pt_fp16_mem_mb_str = f"{pt_fp16_mem_mb_value}"
pt_fp16_acc_str = f"{pt_fp16_acc_value:.2f}"
pt_fp16_util_str = f"{pt_fp16_util_value}"
pt_int8_ms_str = f"{pt_int8_ms_value:.1f}"
pt_int8_mem_mb_str = f"{pt_int8_mem_mb_value}"
pt_int8_ptq_acc_str = f"{pt_int8_ptq_acc_value:.2f}"
pt_int8_qat_acc_str = f"{pt_int8_qat_acc_value:.2f}"
pt_int8_util_str = f"{pt_int8_util_value}"
pt_int8_speedup_str = fmt(pt_int8_speedup_value, precision=1, commas=False)
pt_fp16_speedup_str = fmt(pt_fp16_speedup_value, precision=0, commas=False)
pt_int8_acc_loss_str = f"{pt_int8_acc_loss_value:.2f}"
```
::: {.callout-notebook title="ResNet-50: Precision Tradeoffs on V100"}
| **Precision** | **Latency** | **Memory** | **Accuracy** | **Tensor Core Util.** | **Calibration** |
|:---------------|--------------------------------------------------:|------------------------------------------------------:|------------------------------------------------------:|---------------------------------------------------:|:----------------|
| **FP32** | `{python} PrecisionTradeoffCalc.pt_fp32_ms_str`ms | `{python} PrecisionTradeoffCalc.pt_fp32_mem_mb_str`MB | `{python} PrecisionTradeoffCalc.pt_fp32_acc_str`% | 0% | None |
| **FP16** | `{python} PrecisionTradeoffCalc.pt_fp16_ms_str`ms | `{python} PrecisionTradeoffCalc.pt_fp16_mem_mb_str`MB | `{python} PrecisionTradeoffCalc.pt_fp16_acc_str`% | `{python} PrecisionTradeoffCalc.pt_fp16_util_str`% | None |
| **INT8 (PTQ)** | `{python} PrecisionTradeoffCalc.pt_int8_ms_str`ms | `{python} PrecisionTradeoffCalc.pt_int8_mem_mb_str`MB | `{python} PrecisionTradeoffCalc.pt_int8_ptq_acc_str`% | `{python} PrecisionTradeoffCalc.pt_int8_util_str`% | 1,000 samples |
| **INT8 (QAT)** | `{python} PrecisionTradeoffCalc.pt_int8_ms_str`ms | `{python} PrecisionTradeoffCalc.pt_int8_mem_mb_str`MB | `{python} PrecisionTradeoffCalc.pt_int8_qat_acc_str`% | `{python} PrecisionTradeoffCalc.pt_int8_util_str`% | Full retraining |
**Key observations:**
- INT8 achieves `{python} PrecisionTradeoffCalc.pt_int8_speedup_str`$\times$ speedup but loses `{python} PrecisionTradeoffCalc.pt_int8_acc_loss_str`% accuracy with post-training quantization (PTQ)
- Quantization-aware training (QAT) recovers most accuracy but requires retraining
- FP16 provides `{python} PrecisionTradeoffCalc.pt_fp16_speedup_str`$\times$ speedup with no accuracy loss for most models
:::
#### Layer Sensitivity {#sec-model-serving-layer-sensitivity-6a31}
Not\index{Quantization!layer sensitivity}\index{Layer Sensitivity!precision tolerance} all layers tolerate reduced precision equally. Empirically, quantization error for a layer scales with weight magnitude and gradient sensitivity, captured by the following proportionality in @eq-quant-error:
$$\epsilon_{\text{quant}} \propto \alpha \cdot \|W\|_2 \cdot 2^{-b}$$ {#eq-quant-error}
where $\alpha$ is a layer-specific sensitivity coefficient (determined empirically or via Fisher information), $\|W\|_2$ is the weight L2 norm, and $b$ is the bit width. This explains observed patterns where first convolutional layers with high gradients and large sensitivity coefficients are precision-sensitive and often kept at FP16, middle layers with stable gradients and low sensitivity coefficients tolerate INT8 well, and final classification layers with small weights but high task sensitivity benefit from FP16 or higher precision.
#### Calibration Requirements {#sec-model-serving-calibration-requirements-06b0}
Post-training\index{Calibration!INT8 quantization}\index{Post-Training Quantization!calibration}\index{Calibration Dataset!representative traffic} quantization requires a calibration dataset to determine optimal scale factors for INT8 conversion. Production experience shows that calibration data must be representative of actual serving traffic, not just training data. Using ImageNet validation images to calibrate a model serving wildlife camera images resulted in 3.2% accuracy degradation in one production system.
#### Dynamic Precision Selection {#sec-model-serving-dynamic-precision-selection-dc60}
Advanced\index{Dynamic Precision!adaptive quality} serving systems select precision per request based on runtime conditions. If the system is ahead of latency SLO, it uses higher precision for better accuracy. For low-confidence INT8 results, it recomputes at FP16. Different customer tiers may receive different precision levels. This pattern enables adaptive quality-latency tradeoffs while maximizing throughput during normal operation.
The precision decision has direct infrastructure consequences: INT8 inference achieves roughly 3$\times$ higher throughput than FP32, meaning a workload requiring 30 GPUs at FP32 needs only 10 at INT8. This 3$\times$ reduction in hardware translates directly to a 3$\times$ reduction in operating costs. The connection between model-level optimization and infrastructure economics is why precision selection cannot be treated as purely a model concern.
Runtime selection and precision tuning operate at the model level: they determine *what* computation runs and at *what* numerical format. Between the model and the silicon, however, lies another optimization layer encompassing the mechanics of graph compilation to kernels, byte movement from disk to memory, and CPU-GPU coordination. These node-level techniques often yield the final 2--5$\times$ that separates a functional prototype from a production-grade serving node.
## Node-Level Optimization {#sec-model-serving-nodelevel-optimization-3d9d}
Runtime selection and precision tuning establish the software foundation for serving. Achieving peak efficiency requires going deeper: understanding *how* the hardware executes each operation and *where* every microsecond goes. Optimizations at the boundary of software and silicon target the computation graph itself: compiling the computation graph\index{Graph Compilation!optimization}, exploiting CPU capabilities when GPUs are absent, minimizing the time to get bytes from disk to memory, and visualizing exactly *where* every microsecond goes.
### Runtime Graph Compilation {#sec-model-serving-runtime-graph-compilation-7a7e}
Inference engines like TensorRT were introduced in @sec-model-serving-inference-runtime-selection-5eef. These engines achieve 2--5$\times$ speedups through **Graph Compilation**. Training computation graphs are dynamic and mutable, whereas serving graphs are static. This static nature allows compilers to perform aggressive optimizations that would be unsafe or too slow during training.
#### Operator Fusion {#sec-model-serving-operator-fusion-f8d2}
The most potent graph-level optimization is operator fusion\index{Operator Fusion!graph compilation}. Memory bandwidth often limits performance more than compute (@sec-hardware-acceleration). Fusion collapses multiple operations (e.g., `Conv2D` -> `BiasAdd` -> `ReLU`) into a single kernel launch. This keeps intermediate data in the GPU's fast L1/L2 cache or registers, avoiding round-trips to global memory (VRAM).
#### Constant Folding {#sec-model-serving-constant-folding-c652}
Parts of the graph that depend only on model weights\index{Constant Folding!compile-time optimization}\index{Compile-Time Optimization!constant folding}, which are constant during serving, can be pre-computed at compile time. For example, if a model contains `x * (sqrt(2) / 2)`, the compiler replaces the division and square root with a single multiplication by `0.707...`.
#### Memory Planning {#sec-model-serving-memory-planning-4cef}
Since the graph structure is known, the compiler can pre-calculate the exact memory offsets for every tensor\index{Memory Planning!tensor allocation}. This leads to the central architectural choice of *JIT vs. AOT compilation*.
::: {.callout-notebook title="JIT vs. AOT Compilation"}
* **Just-In-Time (JIT)**\index{JIT Compilation!runtime optimization}: Compiles the graph the first time it is run (e.g., `torch.compile`).
* *Pros*: Optimizes for the specific input shapes seen at runtime.
* *Cons*: First request pays a "compilation penalty" (latency spike).
* **Ahead-of-Time (AOT)**\index{AOT Compilation!pre-deployment}: Compiles the graph before deployment (e.g., `torch.export`, TensorRT `trtexec`).
* *Pros*: Zero compilation latency at startup; guarantees a fixed graph.
* *Cons*: Must handle all dynamic shapes explicitly or compile multiple profiles.
:::
### CPU Inference Optimization {#sec-model-serving-cpu-inference-optimization-ae86}
GPUs dominate the narrative, yet CPUs\index{CPU Inference!when to use} remain the workhorse for a vast number of inference workloads, particularly for smaller models, latency-insensitive batch jobs, or cost-constrained environments. Optimizing for the CPU requires a different mindset.
#### SIMD and Vectorization {#sec-model-serving-simd-vectorization-6086}
Modern CPUs[^fn-simd-cpu-serving]\index{SIMD!CPU vectorization}\index{AVX-512!vector instructions} (Intel Xeon, AMD EPYC) pack powerful vector units (AVX-512, AMX). Standard Python loops cannot use these. Specialized runtimes like **OpenVINO** or **Intel Extension for PyTorch (IPEX)** map neural network operators directly to these vector instructions, achieving order-of-magnitude speedups over vanilla implementations.
[^fn-simd-cpu-serving]: **SIMD (Single Instruction, Multiple Data)**: From Michael Flynn's 1966 taxonomy of computer architectures, SIMD enables one instruction to operate on multiple data elements simultaneously. Intel's AVX-512 (2016) processes 512 bits (16 floats) per instruction; AMX (2023) extends this to matrix tile operations. For CPU inference, SIMD exploitation is the primary optimization lever: naive scalar matrix multiplication achieves ~1% of theoretical peak, while SIMD-optimized kernels approach 80--90% utilization---a gap that determines whether CPU-only serving is economically viable. \index{SIMD!CPU inference}
#### Thread Pinning and NUMA {#sec-model-serving-thread-pinning-numa-dc2b}
On multi-socket servers[^fn-numa-cpu-serving]\index{NUMA!memory locality}\index{Thread Pinning!CPU affinity}, accessing memory attached to a different CPU socket (NUMA) adds significant latency. Inference servers must be "NUMA-aware," pinning threads to specific cores and ensuring that memory allocations remain local to those cores.
[^fn-numa-cpu-serving]: **NUMA (Non-Uniform Memory Access)**: Accessing memory local to a CPU socket is faster than accessing memory attached to a different socket. Pinning an inference thread to a core is insufficient if its required memory is allocated remotely, forcing every weight access across the slower inter-socket link. This failure to co-locate threads and data imposes a ~60% latency overhead, as remote access takes ~130ns versus ~80ns for local. \index{NUMA!inference latency}
#### Small Batch Advantage {#sec-model-serving-small-batch-advantage-c91c}
CPUs often outperform GPUs at batch size 1 for small models\index{Small Batch!CPU advantage}. The overhead of launching a GPU kernel (~10 $\mu$s) and transferring data (~50 $\mu$s) can exceed the compute time for a tiny dense layer. For models under 50 MB serving single requests, a well-optimized CPU runtime often delivers lower latency than a GPU.
### Model Serialization and Fast Loading {#sec-model-serving-fast-model-loading-1109}
In autoscaling systems, the time to spin up a new node is critical. A major component of "Cold Start" (@sec-model-serving-model-loading-initialization-cc5a) is simply reading the model weights from disk into memory. The choice of serialization format determines how quickly this loading can occur.
The standard PyTorch `torch.load()` uses Python's `pickle` format\index{Pickle!loading overhead}. This approach is inefficient because it requires the CPU to unpickle objects one by one, copy them into memory, and then often copy them *again* to the GPU. A faster alternative is memory mapping\index{mmap!zero-copy loading}, which allows the OS to map a file directly into the process's virtual address space. The data is effectively "loaded" only when accessed, and the OS handles the transfer from disk to RAM efficiently.
Building on this zero-copy principle, Safetensors[^fn-safetensors-loading]\index{Safetensors!zero-copy loading} is a modern format designed specifically for fast loading. It stores tensors as raw bytes with a minimal JSON header. This enables zero-copy\index{Zero-Copy!model loading} loading: the raw bytes on disk are mapped directly into the tensor's memory buffer.
::: {.callout-example title="Loading Speed: Safetensors vs. Pickle"}
Loading a 5 GB Stable Diffusion model:
* **Pickle (`torch.load`)**: ~15 seconds. High CPU usage.
* **Safetensors**: ~0.5 seconds. Near-zero CPU usage.
By using `mmap` and formats like `safetensors`, loading speed becomes limited only by the disk's read speed (e.g., 3 GB/s for NVMe), rather than CPU parsing overhead.
:::
[^fn-safetensors-loading]: **Safetensors**: Created by Hugging Face (released 2022), the name emphasizes safety: unlike Python's pickle format, safetensors cannot execute arbitrary code during deserialization, eliminating a class of security vulnerabilities where malicious model files could compromise a serving system. The format stores tensors as contiguous raw bytes with a minimal JSON header, enabling memory-mapped loading that achieves 30--100$\times$ faster loading than pickle. For autoscaling serving fleets, this loading speed directly reduces cold start latency---the difference between a 15-second and 0.5-second model load determines whether new replicas can absorb traffic spikes before SLOs are violated. \index{Safetensors!fast loading}
### Profiling the Serving Node {#sec-model-serving-profiling-serving-node-1e99}
Optimization without measurement is guesswork. The system efficiency metric defined in @eq-system-efficiency provides the target: maximizing the fraction of wall-clock time the accelerator spends on useful computation. Achieving that target requires visualizing the execution flow to find where time is lost.
#### The Timeline View {#sec-model-serving-timeline-view-6159}
Tools\index{Timeline Profiling!serving optimization} like **PyTorch Profiler**\index{Framework Profiler!timeline analysis} or NVIDIA **Nsight Systems (nsys)**\index{GPU Profiler!timeline analysis} generate a timeline trace. This visualization reveals the exact sequence of events on the CPU and GPU. When examining a trace, look for:
1. **Gaps in the GPU Timeline**: If the GPU bar has empty spaces, the GPU is idle. This usually means the GPU is waiting for the CPU (preprocessing bottleneck) or disk (data loading).
2. **Kernel Launch Overhead**: Thousands of tiny slivers on the GPU timeline indicate the model is launching too many small kernels. This is a prime candidate for **Operator Fusion**.
3. **Host-to-Device Transfers**: Look for `MemcpyHtoD` (Host to Device) blocks. Determine whether they overlap with computation or block it.
::: {.callout-example title="The Profiling Loop"}
1. **Capture**: Run a warmup, then capture a trace of 10-50 requests.
2. **Visualize**: Open the trace in a viewer (Chrome Tracing, Nsight).
3. **Identify**: Find the largest gap or the longest block.
4. **Optimize**: Apply a specific fix (e.g., fusion, pinning).
5. **Verify**: Re-capture and confirm the gap is gone.
:::
#### Optimization Technique Impact Matrix {#sec-model-serving-optimization-technique-impact-matrix-7c1e}
To guide optimization efforts, @tbl-optimization-impact summarizes the key techniques available at the node level, their primary targets, and expected returns.
| **Technique** | **Target Metric** | **Typical Gain** | **Implement. Cost** | **Best For** |
|:----------------------|:---------------------|-----------------:|:--------------------|:-------------------------|
| **Operator Fusion** | Latency & Throughput | 2--5$\times$ | Medium (Compiler) | Memory-bound layers |
| **INT8 Quantization** | Throughput | 3--4$\times$ | High (Calibration) | Inference-heavy nodes |
| **Graph Compilation** | Latency | 1.5--3$\times$ | Low (One-line) | Static graph models |
| **Zero-Copy Loading** | Startup Time | 10--50$\times$ | Low (File format) | Autoscaling / Cold Start |
| **CPU Pinning** | Tail Latency (P99) | 20-50% reduction | Low (Config) | Latency-critical apps |
: **Node-Level Optimization Impact**: A decision matrix for selecting optimization techniques. High-impact techniques like quantization often carry higher implementation costs (calibration data requirements), while architectural changes like zero-copy loading offer dramatic gains for specific metrics (startup time) with low effort. {#tbl-optimization-impact}
This hierarchy of impact guides where to invest engineering effort. The following checklist prioritizes the optimization strategy by layer.
::: {.callout-checkpoint title="The Optimization Hierarchy"}
Optimizing inference requires a layered approach.
**The Stack**
- [ ] **System Level**: Have you minimized network round trips and serialization overhead? (gRPC, persistent connections).
- [ ] **Application Level**: Are you batching requests effectively? (Dynamic batching).
- [ ] **Model Level**: Is the model compiled for the target hardware? (TensorRT, ONNX Runtime).
- [ ] **Kernel Level**: Are operations fused to minimize memory bandwidth?
:::
The optimization techniques examined so far (batching, runtime selection, precision tuning, graph compilation) collectively determine how much useful work a single serving node extracts from its hardware. The natural question that follows is economic: determining *how much* infrastructure is required and at *what* total cost.
## Economics and Planning {#sec-model-serving-economics-capacity-planning-3e7e}
Every optimization technique examined so far (batching, precision tuning, operator fusion, graph compilation) reduces a single number: the cost of one inference on one machine. Production deployment, however, requires answering a different question: how many machines, of what type, at what total cost. A team that achieves 1,200 images/second on a V100 still needs to know whether 8 V100s at \$3/hour each or 24 T4s at \$0.53/hour each yields lower total cost of ownership for their 5,000 QPS target. Serving costs\index{Serving Economics!infrastructure costs}\index{Serving Costs!request volume scaling} scale with request volume, unlike training costs that scale with dataset size and model complexity [@zhang2019mark]. The intelligence deflation trend shown in @fig-intelligence-deflation intensifies this pressure: as per-token prices collapse by orders of magnitude, the margin on each inference shrinks, making infrastructure efficiency the primary lever for economic viability.
### Cost Per Inference {#sec-model-serving-cost-per-inference-27fc}
Total\index{Cost Per Inference!serving economics} serving cost decomposes into four components: compute time (GPU or CPU cycles consumed per inference), memory (accelerator memory required to hold model weights and activations), data transfer (network bandwidth for request and response payloads), and orchestration overhead (container runtime, load balancing, and monitoring). For GPU inference, the dominant cost component shifts with utilization. At high utilization, compute time dominates because the GPU stays busy processing requests. At low utilization, memory cost dominates\index{GPU Utilization!cost economics} because the GPU is reserved and billed even while idle. This distinction matters for cost optimization: improving throughput reduces compute cost per inference, while improving utilization reduces the memory waste of idle hardware. We can apply this framework to a *ResNet-50 cost analysis*.
```{python}
#| label: cost-analysis-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COST ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Cost Analysis" — cost per inference section
# │
# │ Goal: Contrast the cost-per-million inferences across hardware tiers.
# │ Show: That expensive GPUs (V100) can be cheaper per-inference than T4 or CPU due to high throughput.
# │ How: Calculate unit costs using AWS hourly rates and measured images-per-second.
# │
# │ Imports: mlsysim.core.constants (SEC_PER_HOUR, MILLION), mlsysim.book (fmt)
# │ Exports: ca_cpu_cost_str, ca_cpu_throughput_str, ca_cpu_cpm_str,
# │ ca_t4_cost_str, ca_t4_throughput_str, ca_t4_cpm_str,
# │ ca_v100_cost_str, ca_v100_throughput_str, ca_v100_cpm_str,
# │ ca_v100_price_increase_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.core.constants import SEC_PER_HOUR, MILLION
from mlsysim.fmt import fmt
class CostAnalysisCalc:
"""ResNet-50 cost analysis: T4 achieves lowest cost-per-image despite higher hourly rate than CPU."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
ca_cpu_cost_value = 0.17
ca_cpu_throughput_value = 50
ca_t4_cost_value = 0.53
ca_t4_throughput_value = 400
ca_v100_cost_value = 3.06
ca_v100_throughput_value = 1200
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
ca_cpu_cpm_value = ca_cpu_cost_value / (ca_cpu_throughput_value * SEC_PER_HOUR / MILLION)
ca_t4_cpm_value = ca_t4_cost_value / (ca_t4_throughput_value * SEC_PER_HOUR / MILLION)
ca_v100_cpm_value = ca_v100_cost_value / (ca_v100_throughput_value * SEC_PER_HOUR / MILLION)
ca_v100_price_increase_value = ca_v100_cost_value / ca_t4_cost_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
ca_cpu_cost_str = fmt(ca_cpu_cost_value, precision=2, commas=False)
ca_cpu_throughput_str = f"{ca_cpu_throughput_value}"
ca_cpu_cpm_str = fmt(ca_cpu_cpm_value, precision=2, commas=False)
ca_t4_cost_str = fmt(ca_t4_cost_value, precision=2, commas=False)
ca_t4_throughput_str = f"{ca_t4_throughput_value}"
ca_t4_cpm_str = fmt(ca_t4_cpm_value, precision=2, commas=False)
ca_v100_cost_str = fmt(ca_v100_cost_value, precision=2, commas=False)
ca_v100_throughput_str = f"{ca_v100_throughput_value:,}"
ca_v100_cpm_str = fmt(ca_v100_cpm_value, precision=2, commas=False)
ca_v100_price_increase_str = fmt(ca_v100_price_increase_value, precision=0, commas=False)
```
::: {.callout-notebook title="ResNet-50: Cost Analysis"}
Consider serving ResNet-50 on AWS infrastructure (US-East region, on-demand pricing as of this writing):
| **Instance Type** | **Cost/Hour** | **Throughput** | **Cost per 1M Images** |
|:--------------------------|-----------------------------------------------:|---------------------------------------------------------:|----------------------------------------------:|
| **c5.xlarge (CPU)** | \$`{python} CostAnalysisCalc.ca_cpu_cost_str` | `{python} CostAnalysisCalc.ca_cpu_throughput_str` img/s | \$`{python} CostAnalysisCalc.ca_cpu_cpm_str` |
| **g4dn.xlarge (T4 GPU)** | \$`{python} CostAnalysisCalc.ca_t4_cost_str` | `{python} CostAnalysisCalc.ca_t4_throughput_str` img/s | \$`{python} CostAnalysisCalc.ca_t4_cpm_str` |
| **p3.2xlarge (V100 GPU)** | \$`{python} CostAnalysisCalc.ca_v100_cost_str` | `{python} CostAnalysisCalc.ca_v100_throughput_str` img/s | \$`{python} CostAnalysisCalc.ca_v100_cpm_str` |
**Key insight**: The T4 GPU instance achieves the lowest cost per inference despite higher hourly cost, because GPU throughput dramatically exceeds CPU throughput. The V100 is only cost-effective at very high sustained traffic where its higher throughput justifies the `{python} CostAnalysisCalc.ca_v100_price_increase_str`$\times$ price increase. Cloud pricing varies by region and changes over time; consult current pricing for production planning.
:::
### GPU vs CPU Economics {#sec-model-serving-gpu-vs-cpu-economics-eb06}
GPUs provide significant speedup for parallel operations but cost more per hour\index{GPU vs CPU!economics} [@wu2019machine]. The crossover point depends on model characteristics and latency requirements.
CPU inference makes economic sense for small models with few parameters and simple operations, when latency requirements are relaxed (hundreds of milliseconds acceptable), when request volume is low or highly variable (making GPU reservation wasteful), or when the model's operations do not parallelize well. GPU inference dominates when models are large with parallel-friendly operations, latency requirements are strict (tens of milliseconds), request volume is high and consistent enough to sustain utilization, and batching can amortize the per-inference overhead of GPU kernel launches.
Beyond\index{Autoscaling!startup latency} steady-state costs, startup time affects scaling economics. CPU instances typically start in 3060 seconds while GPU instances take 25 minutes including driver initialization, model loading, and warmup. For variable traffic patterns, this startup latency can be more important than cost per inference. If traffic spikes arrive faster than GPU instances can scale, latency SLOs will be violated despite having sufficient eventual capacity.
This asymmetry suggests different scaling strategies where CPU instances enable reactive scaling\index{Reactive Scaling!CPU instances} by responding to current demand while GPU instances often require predictive scaling\index{Predictive Scaling!GPU instances} by provisioning based on anticipated demand. For bursty workloads, a hybrid approach\index{Hybrid Scaling!GPU+CPU} uses always-on GPU capacity for baseline load plus CPU overflow capacity for spikes, trading higher per-inference cost during spikes for better responsiveness. This GPU+CPU hybrid is one instance of the broader *hybrid architecture* patterns cataloged in @sec-ml-systems-hybrid-architectures-combining-paradigms-7cdd, where the train-serve split and hierarchical processing patterns also combine paradigms to balance cost, latency, and capability.
### Capacity Planning {#sec-model-serving-capacity-planning-96a3}
The GPU versus CPU decision establishes the cost per inference, but determining how much infrastructure to provision requires combining cost analysis with the queuing theory foundations from @sec-model-serving-queuing-theory-tail-latency-29a6. Capacity planning\index{Capacity Planning!infrastructure sizing} translates three inputs into infrastructure specifications: traffic patterns (peak request rate, daily/weekly cycles, growth projections), latency SLOs (p50, p95, p99 targets), and model characteristics (inference time distribution at various batch sizes) [@harchol2013performance].
The worked example in @sec-model-serving-queuing-theory-tail-latency-29a6 demonstrates the complete workflow: starting from a 50 ms p99 SLO and 5,000 QPS target, deriving the safe utilization threshold of `{python} CapacityPlanningCalc.cp_rho_safe_pct_str` percent from @eq-p99-latency, and determining GPU count with headroom of `{python} CapacityPlanningCalc.cp_final_ceil_str` V100s. Production systems typically provision for peak load plus 30 percent headroom, using auto-scaling to reduce costs during low-traffic periods while meeting latency objectives during peaks. The key insight from capacity planning is that throughput numbers are meaningful only when coupled with latency guarantees: a system achieving 10,000 QPS but violating the p99 SLO on 5 percent of requests is actually serving 9,500 valid QPS and failing on the rest.
### Production Case Study: Serving Llama-3-8B {#sec-model-serving-production-case-study-serving-llama38b-0499}
To synthesize the principles of latency budgeting, memory management, and hardware efficiency, we analyze a complete production profile for a modern Large Language Model (LLM) serving workload\index{LLM Serving!8B parameter case study}\index{LLM Serving!production case study}. This case study demonstrates how physical constraints (memory bandwidth and capacity) translate directly into service-level metrics and unit economics.
We begin with the bottleneck that dominates LLM serving costs: KV cache memory. The memory curves in @fig-kv-cache-growth climb steeply, especially for larger batch sizes, illustrating *why* long-context serving is memory-bound even on H100s using typical 70B-class assumptions.
::: {#fig-kv-cache-growth fig-env="figure" fig-pos="htb" fig-cap="**The KV-Cache Explosion**: Memory usage vs. Context Length for a 70B-class model. Assumes 80 layers, d_model=8192, FP16 KV cache, GQA (8x). The linear growth of the Key-Value cache (storing attention history) quickly consumes available GPU memory (red dashed line). For batch size 32 (purple), the system hits the 'OOM Zone' at just 8k context length, forcing a trade-off between batch size (throughput) and context window (capability)." fig-alt="Line chart showing memory usage increasing linearly with context length. Multiple lines for different batch sizes. Red dashed horizontal line marks GPU memory limit. Purple line for batch 32 crosses into OOM zone at 8k context."}
```{python}
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ KV-CACHE GROWTH FIGURE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-kv-cache-growth — introduces the KV-cache memory bottleneck
# │ for LLM serving with long contexts and large batch sizes
# │
# │ Goal: Demonstrate that KV-cache memory grows linearly with context length
# │ and batch size, hitting OOM at practical serving configurations.
# │ Show: KV-cache size (GB) vs. context length (tokens) for batch sizes 1, 4,
# │ 16, 32; A100/H100 80 GB limit annotated; OOM zone shaded.
# │ How: Compute KV cache bytes = 2 × layers × d_model × seq × batch × bytes
# │ ÷ gqa_ratio; convert to GB; plot per-batch-size using mlsysim.core.viz palette.
# │
# │ Imports: numpy (np), mlsysim.core.viz (viz), mlsysim.constants (BYTES_FP16, byte, GB)
# │ Exports: (figure only, no prose variables)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
from mlsysim import viz
from mlsysim.core.constants import BYTES_FP16, byte, GB
fig, ax, COLORS, plt = viz.setup_plot()
# =============================================================================
# PLOT: The KV-Cache Explosion
# =============================================================================
seq_len = np.linspace(0, 32000, 100)
layers, d_model, bytes_per_param = 80, 8192, BYTES_FP16.m_as(byte) # 70B model params, FP16
gqa_ratio = 8 # Grouped Query Attention (8x reduction)
def get_kv_gb(batch, seq):
# KV cache size = 2 * layers * d_model * seq * batch * bytes_per_param / gqa_ratio
bytes_total = (2 * layers * d_model * seq * batch * bytes_per_param) / gqa_ratio
return (bytes_total * byte).m_as(GB)
batches = [1, 4, 16, 32]
colors = [COLORS['BlueLine'], COLORS['GreenLine'], COLORS['OrangeLine'], COLORS['VioletLine']]
for b, c in zip(batches, colors):
gb = get_kv_gb(b, seq_len)
ax.plot(seq_len, gb, label=f'Batch Size {b}', color=c, linewidth=2)
limit_gb = 80
ax.axhline(limit_gb, color=COLORS['RedLine'], linestyle='--', linewidth=2)
ax.text(1000, limit_gb + 2, "A100/H100 Capacity (80GB)", color=COLORS['RedLine'], fontweight='bold', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.axhspan(limit_gb, 140, color=COLORS['RedL'], alpha=0.2)
ax.text(16000, 100, "Out of Memory (OOM) Zone", color=COLORS['RedLine'], ha='center', fontsize=10, fontweight='bold', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.annotate("Linear Growth", xy=(15000, get_kv_gb(4, 15000)), xytext=(20000, 30),
arrowprops=dict(facecolor=COLORS['primary'], arrowstyle='->'), fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.set_xlabel('Context Length (Tokens)')
ax.set_ylabel('KV Cache Size (GB) [FP16]')
ax.set_xlim(0, 32000)
ax.set_ylim(0, 120)
ax.set_xticks([0, 8000, 16000, 24000, 32000])
ax.set_xticklabels(['0', '8k', '16k', '24k', '32k'])
ax.legend(loc='lower right', fontsize=8)
plt.show()
```
:::
The linear growth of the KV cache with sequence length forces a hard trade-off: to support longer contexts (32k+), we must reduce batch size, which in turn kills throughput efficiency.
```{python}
#| label: llm-case-study-hw-specs
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LLM CASE STUDY HARDWARE SPECIFICATIONS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Workload Profile (~5 lines below: LlmCaseStudyHwSpecs.h100_mem),
# │ "Physics of Token Generation" callout (~50 lines below:
# │ LlmCaseStudyHwSpecs.a100_bw_tbs)
# │
# │ Goal: Provide H100 memory capacity and A100 bandwidth for the Llama-3-8B
# │ case study's workload profile and token generation analysis.
# │ Show: The memory ceiling and bandwidth of H100 and A100 GPUs.
# │ How: Retrieve constants from Hardware Digital Twins; extend GpuSpecs class.
# │
# │ Imports: mlsysim (Hardware), mlsysim.constants (TB, second, GiB)
# │ Exports: LlmCaseStudyHwSpecs.h100_mem, LlmCaseStudyHwSpecs.a100_bw_tbs
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim import Hardware
from mlsysim.core.constants import TB, second, GiB
class LlmCaseStudyHwSpecs:
"""H100 memory and A100 bandwidth for the Llama-3-8B case study."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
h_a100 = Hardware.Cloud.A100
h_h100 = Hardware.Cloud.H100
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
a100_bw_tbs_value = h_a100.memory_bw.m_as(TB / second)
h100_mem_value = h_h100.memory_capacity.m_as(GiB)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
a100_bw_tbs = f"{a100_bw_tbs_value:.1f}" # e.g. "2.0" TB/s
h100_mem = f"{h100_mem_value:.0f}" # e.g. "80" GB
```
#### Workload Profile {#sec-model-serving-workload-profile-a380}
* **Model**: Llama-3-8B (quantized to 4-bit AWQ\index{AWQ!4-bit quantization}; see @sec-model-compression for quantization techniques).
* **Hardware**: 1$\times$ NVIDIA H100 SXM5 GPU (`{python} LlmCaseStudyHwSpecs.h100_mem` GB HBM3, `{python} GpuSpecs.h100_bw_tbs` TB/s bandwidth).
* **Request Characteristics**: 1,000-token input prompt (Prefill), 256-token generated response (Decode).
* **Target SLOs**: TTFT $<$ 200 ms, TPOT $<$ 20 ms.
#### Latency Deconstruction {#sec-model-serving-latency-deconstruction-217e}
The end-to-end request latency is governed by the two-phase execution model of autoregressive transformers, applying the TTFT and TPOT metrics defined in @sec-model-serving-performance-metrics-ttft-tpot-b009.
##### Prefill Phase (Time to First Token) {.unnumbered}
The model processes the 1,000-token prompt in parallel\index{Prefill Phase!compute-bound}\index{Prefill Phase!parallel processing}. On an H100, this compute-bound operation achieves approximately 10,000 tokens per second: $T_{\text{prefill}} = 1000 \text{ tokens} / 10{,}000 \text{ tokens/s} = 100 \text{ ms}$. Accounting for 20 ms of system overhead (network ingress, tokenization), the **TTFT is 120 ms**, comfortably within the 200 ms SLO.
##### Decode Phase (Time Per Output Token) {.unnumbered}
The model generates 256 tokens sequentially. This phase is memory-bandwidth bound\index{Decode Phase!memory-bandwidth bound}\index{Memory Bandwidth!LLM bottleneck}\index{Decode Phase!sequential generation}—the same IO-bound pattern seen in the DLRM embedding lookups (@sec-model-serving-latency-distribution-analysis-b0f8), but at a larger scale: the system must read the entire 3.5 GB weight tensor from VRAM to generate a single token.
::: {.callout-perspective title="The Physics of Token Generation"}
Recall the **Energy-Movement Invariant** from @sec-data-engineering: moving a bit is 100--1,000$\times$ more expensive than computing on it. In the **Decode Phase**, this law determines the physical "cost per word."
**The Memory Wall for Generative AI**: Because the decode phase has an arithmetic intensity of $\approx 1$ FLOP/byte (we must read every weight just to generate one token), performance is strictly limited by memory bandwidth ($BW$), not compute. This relationship is captured in @eq-token-generation-time:
$$ T_{\text{token}} \approx \frac{\text{Model Size (Bytes)}}{\text{Memory Bandwidth (Bytes/s)}} $$ {#eq-token-generation-time}
**The Engineering Implication**:
Every token generation pays a massive "energy tax" to move the model's logic from HBM into compute registers. For Llama-3-8B (3.5 GB int4), an A100 80 GB (`{python} LlmCaseStudyHwSpecs.a100_bw_tbs` TB/s HBM2e) generates tokens at $\approx 1.7$ ms/token. Adding more *compute cores* yields **zero** latency improvement; only faster memory (Physics) or smaller models (Algorithm) can speed up generation.
:::
```{python}
#| label: llm-serving-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LLM SERVING ECONOMICS (LLAMA-3-8B CASE STUDY)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Llama-3-8B serving case study — token latency, throughput, and
# │ unit economics on H100 with 4-bit quantization
# │
# │ Goal: Connect memory bandwidth, KV cache capacity, and serving economics.
# │ Show: That memory capacity bounds throughput while bandwidth bounds latency.
# │ How: Calculate TPOT, concurrent batch size, and $/M tokens for Llama-3-8B on H100.
# │
# │ Imports: mlsysim.core.constants (H100_MEM_BW, TB, second, SEC_PER_HOUR, MILLION),
# │ mlsysim.book (fmt)
# │ Exports: model_weight_gb_str, h100_bw_tb_str,
# │ token_time_theoretical_ms_str, realized_tpot_ms_str,
# │ decode_tokens_str, total_decode_s_str, kv_cache_gb_str,
# │ kv_per_token_mb_str, kv_capacity_tokens_str,
# │ tokens_per_req_str, concurrent_batch_str,
# │ req_time_s_str, hourly_cost_str, tokens_per_hour_m_str,
# │ cost_per_m_tokens_str, remaining_vram_gb_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.core.constants import H100_MEM_BW, TB, second, SEC_PER_HOUR, MILLION
from mlsysim.fmt import fmt
class LlmServingCalc:
"""Llama-3-8B serving economics: memory capacity bounds throughput, bandwidth bounds latency."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
model_weight_gb_value = 3.5
realized_tpot_ms_value = 10 # conservative production target (theoretical min ~1-2ms)
decode_tokens_value = 256
kv_cache_gb_value = 72
kv_per_token_mb_value = 0.5
tokens_per_req_value = 1256
ttft_s_value = 0.12
hourly_cost_value = 3.00
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
_h100_bw_tb = H100_MEM_BW.m_as(TB / second)
token_time_theoretical_ms_value = model_weight_gb_value / (_h100_bw_tb * 1000) * 1000
total_decode_s_value = decode_tokens_value * realized_tpot_ms_value / 1000
kv_capacity_tokens_value = int(kv_cache_gb_value * 1000 / kv_per_token_mb_value)
concurrent_batch_value = int(kv_capacity_tokens_value / tokens_per_req_value)
req_time_s_value = ttft_s_value + decode_tokens_value * realized_tpot_ms_value / 1000
tokens_per_hour_value = concurrent_batch_value * (SEC_PER_HOUR / req_time_s_value) * tokens_per_req_value
cost_per_m_tokens_value = hourly_cost_value / (tokens_per_hour_value / MILLION)
remaining_vram_gb_value = int(80 - model_weight_gb_value)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
model_weight_gb_str = f"{model_weight_gb_value:.1f}"
h100_bw_tb_str = fmt(_h100_bw_tb, precision=1, commas=False)
token_time_theoretical_ms_str = fmt(token_time_theoretical_ms_value, precision=0, commas=False)
realized_tpot_ms_str = f"{realized_tpot_ms_value}"
decode_tokens_str = f"{decode_tokens_value}"
total_decode_s_str = fmt(total_decode_s_value, precision=2, commas=False)
kv_cache_gb_str = f"{kv_cache_gb_value}"
kv_per_token_mb_str = f"{kv_per_token_mb_value:.1f}"
kv_capacity_tokens_str = f"{kv_capacity_tokens_value:,}"
tokens_per_req_str = f"{tokens_per_req_value:,}"
concurrent_batch_str = f"{concurrent_batch_value}"
req_time_s_str = fmt(req_time_s_value, precision=2, commas=False)
hourly_cost_str = fmt(hourly_cost_value, precision=2, commas=False)
tokens_per_hour_m_str = fmt(tokens_per_hour_value / MILLION, precision=0, commas=False)
cost_per_m_tokens_str = fmt(cost_per_m_tokens_value, precision=3, commas=False)
remaining_vram_gb_str = f"{remaining_vram_gb_value}"
```
* Ttoken ≈ `{python} LlmServingCalc.model_weight_gb_str` GB / `{python} GpuSpecs.h100_bw_tbs` TB/s ≈ `{python} LlmServingCalc.token_time_theoretical_ms_str` ms (theoretical limit).
* Accounting for kernel launch overhead, attention computation, and a conservative production safety margin, realized Ttoken is approximately `{python} LlmServingCalc.realized_tpot_ms_str` ms.
* Total decode time: `{python} LlmServingCalc.decode_tokens_str` tokens$\times$ `{python} LlmServingCalc.realized_tpot_ms_str` ms/token = `{python} LlmServingCalc.total_decode_s_str` seconds.
* **TPOT is `{python} LlmServingCalc.realized_tpot_ms_str` ms**, well within the 20 ms "fluidity" SLO.
#### Memory & Throughput {#sec-model-serving-memory-throughput-63dd}
With 4-bit weights occupying `{python} LlmServingCalc.model_weight_gb_str` GB, the remaining ~`{python} LlmServingCalc.remaining_vram_gb_str` GB of VRAM is available for the **KV Cache**. Using **PagedAttention**, we can allocate this memory with near-zero fragmentation.
* Each token requires approximately `{python} LlmServingCalc.kv_per_token_mb_str` MB of KV cache (32 layers$\times$ 4096 dim$\times$ 2 vectors$\times$ 2-byte precision, assuming standard multi-head attention; models with Grouped Query Attention use fewer KV heads, reducing this by up to 4$\times$).
* Total cache capacity ≈ `{python} LlmServingCalc.kv_cache_gb_str` GB / `{python} LlmServingCalc.kv_per_token_mb_str` MB/token ≈ `{python} LlmServingCalc.kv_capacity_tokens_str` tokens.
* At `{python} LlmServingCalc.tokens_per_req_str` tokens per request (input + output), the GPU can handle a **concurrent batch size of ~`{python} LlmServingCalc.concurrent_batch_str` requests**.
#### Unit Economics {#sec-model-serving-unit-economics-b685}
For an H100 SXM5 instance at approximately USD `{python} LlmServingCalc.hourly_cost_str` per hour (specialized cloud providers; hyperscaler rates vary from USD 2-13 per hour as of this writing):
* Total tokens per hour: `{python} LlmServingCalc.concurrent_batch_str` batch$\times$ (SEC_PER_HOUR s/hr / `{python} LlmServingCalc.req_time_s_str` s/req)$\times$ `{python} LlmServingCalc.tokens_per_req_str` tokens/req ≈ `{python} LlmServingCalc.tokens_per_hour_m_str` million tokens/hour.
* **Cost per million tokens**: USD `{python} LlmServingCalc.hourly_cost_str` / `{python} LlmServingCalc.tokens_per_hour_m_str` ≈ **USD `{python} LlmServingCalc.cost_per_m_tokens_str`**.
This analysis highlights that for LLMs, **memory capacity**\index{Memory Capacity!LLM throughput}\index{KV Cache!memory capacity} (the size of the KV cache) is the primary determinant of throughput and cost, while **memory bandwidth**\index{Memory Bandwidth!LLM latency}\index{HBM!memory bandwidth} is the primary determinant of latency.
This case study applies the core principles developed throughout this chapter: latency budgets decompose into prefill and decode phases, queuing theory governs batch sizing and capacity planning, and hardware constraints in the form of memory bandwidth and capacity determine achievable performance and cost. The quantitative framework established here enables principled engineering decisions, but only when applied correctly. Common misconceptions cause even experienced engineers to misapply these principles in practice.
## Fallacies and Pitfalls {#sec-model-serving-fallacies-pitfalls-336b}
Serving inverts training priorities in ways that violate intuitions from batch processing. The nonlinear relationship between utilization and latency, the hidden costs of preprocessing, and the silent failure modes of training-serving skew cause violated SLOs, wasted optimization effort, and accuracy degradation invisible to standard monitoring.
**Fallacy:** *Reducing model inference latency proportionally reduces user-perceived latency.*
```{python}
#| label: fallacy-latency-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACY: INFERENCE LATENCY ≠ USER LATENCY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacy "Reducing model inference latency proportionally reduces
# │ user-perceived latency"
# │
# │ Goal: Demonstrate the nonlinear interaction between inference speed and queuing.
# │ Show: That collapsing queuing wait yields system-level speedups far exceeding model-level speedups.
# │ How: Model M/M/1 queuing wait times for 5ms vs. 2ms inference latencies.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: fl_utilization_high_pct_str, fl_service_slow_ms_str,
# │ fl_wait_slow_ms_str, fl_service_fast_ms_str,
# │ fl_utilization_new_pct_str, fl_wait_fast_ms_str,
# │ fl_inference_gain_ms_str, fl_queuing_improvement_str,
# │ fl_total_slow_ms_str, fl_total_fast_ms_str,
# │ fl_system_speedup_str, fl_model_speedup_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class FallacyLatencyCalc:
"""Shows system-level speedup far exceeds model-level speedup due to nonlinear queuing dynamics."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
fl_utilization_high_value = 0.8
fl_service_slow_ms_value = 5
fl_service_fast_ms_value = 2
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# M/M/1: wait = service * rho / (1 - rho)
fl_wait_slow_ms_value = fl_service_slow_ms_value * fl_utilization_high_value / (1 - fl_utilization_high_value)
fl_total_slow_ms_value = fl_wait_slow_ms_value + fl_service_slow_ms_value
# New utilization: rho_new = rho_old * (service_new / service_old)
fl_utilization_new_value = fl_utilization_high_value * (fl_service_fast_ms_value / fl_service_slow_ms_value)
fl_wait_fast_ms_value = fl_service_fast_ms_value * fl_utilization_new_value / (1 - fl_utilization_new_value)
fl_total_fast_ms_value = fl_wait_fast_ms_value + fl_service_fast_ms_value
fl_model_speedup_value = fl_service_slow_ms_value / fl_service_fast_ms_value
fl_system_speedup_value = fl_total_slow_ms_value / fl_total_fast_ms_value
fl_queuing_improvement_value = fl_wait_slow_ms_value / fl_wait_fast_ms_value
fl_inference_gain_ms_value = fl_service_slow_ms_value - fl_service_fast_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
fl_utilization_high_pct_str = f"{fl_utilization_high_value * 100:.0f}"
fl_service_slow_ms_str = f"{fl_service_slow_ms_value}"
fl_wait_slow_ms_str = f"{fl_wait_slow_ms_value:.0f}"
fl_service_fast_ms_str = f"{fl_service_fast_ms_value}"
fl_utilization_new_pct_str = f"{fl_utilization_new_value * 100:.0f}"
fl_wait_fast_ms_str = f"{fl_wait_fast_ms_value:.1f}"
fl_inference_gain_ms_str = f"{fl_inference_gain_ms_value}"
fl_queuing_improvement_str = f"{fl_queuing_improvement_value:.0f}"
fl_total_slow_ms_str = f"{fl_total_slow_ms_value:.0f}"
fl_total_fast_ms_str = f"{fl_total_fast_ms_value:.1f}"
fl_system_speedup_str = f"{fl_system_speedup_value:.1f}"
fl_model_speedup_str = f"{fl_model_speedup_value:.1f}"
```
Engineers who optimize model inference expect proportional improvement in user-perceived latency\index{Latency!inference vs user-perceived}, but serving systems introduce latency sources absent from offline benchmarks. Under load, queuing delay dominates: @eq-mm1-wait shows that at `{python} FallacyLatencyCalc.fl_utilization_high_pct_str` percent utilization with `{python} FallacyLatencyCalc.fl_service_slow_ms_str`ms service time, average wait time is `{python} FallacyLatencyCalc.fl_wait_slow_ms_str`ms before inference even begins. Reducing inference from `{python} FallacyLatencyCalc.fl_service_slow_ms_str`ms to `{python} FallacyLatencyCalc.fl_service_fast_ms_str`ms changes service time but also shifts utilization from `{python} FallacyLatencyCalc.fl_utilization_high_pct_str` percent to `{python} FallacyLatencyCalc.fl_utilization_new_pct_str` percent, reducing queuing wait from `{python} FallacyLatencyCalc.fl_wait_slow_ms_str`ms to `{python} FallacyLatencyCalc.fl_wait_fast_ms_str`ms, a `{python} FallacyLatencyCalc.fl_queuing_improvement_str`$\times$ queuing improvement that dwarfs the `{python} FallacyLatencyCalc.fl_inference_gain_ms_str`ms inference gain. This nonlinear interaction between inference speed and queuing behavior means the *system-level* speedup (`{python} FallacyLatencyCalc.fl_total_slow_ms_str`ms → `{python} FallacyLatencyCalc.fl_total_fast_ms_str`ms, or `{python} FallacyLatencyCalc.fl_system_speedup_str`$\times$) far exceeds the *model-level* speedup (`{python} FallacyLatencyCalc.fl_service_slow_ms_str`ms → `{python} FallacyLatencyCalc.fl_service_fast_ms_str`ms, or `{python} FallacyLatencyCalc.fl_model_speedup_str`$\times$). Conversely, teams that reduce inference by only 20 percent at high utilization see negligible user-facing improvement because queuing still dominates. Serving optimization requires analyzing the complete latency budget, including serialization, queuing, preprocessing, and postprocessing, under realistic load conditions rather than profiling inference latency in isolation.
**Pitfall:** *Running serving infrastructure at high utilization to maximize cost efficiency.*
```{python}
#| label: fallacy-utilization-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: HIGH UTILIZATION LATENCY DEGRADATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Running serving infrastructure at high utilization to
# │ maximize cost efficiency"
# │
# │ Goal: Demonstrate the nonlinear latency explosion near system capacity.
# │ Show: That increasing utilization from 70% to 90% triples average latency.
# │ How: Model M/M/1 wait times to identify the practical utilization ceiling.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: fu_util_high_pct_str, fu_util_mod_pct_str,
# │ fu_wait_high_factor_str, fu_total_high_factor_str,
# │ fu_cost_reduction_str, fu_latency_increase_str,
# │ fu_service_ms_str, fu_p99_mod_ms_str, fu_p99_high_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class FallacyUtilizationCalc:
"""Moving from 70% to 90% utilization cuts costs 22% but triples average latency."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
fu_util_high_value = 0.9
fu_util_mod_value = 0.7
fu_service_ms_value = 5
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# M/M/1: wait factor = rho / (1 - rho); total time = service / (1 - rho)
fu_wait_high_factor_value = fu_util_high_value / (1 - fu_util_high_value)
fu_cost_reduction_pct_value = (1 - fu_util_mod_value / fu_util_high_value) * 100
fu_avg_latency_mod_value = fu_service_ms_value / (1 - fu_util_mod_value)
fu_avg_latency_high_value = fu_service_ms_value / (1 - fu_util_high_value)
fu_latency_increase_factor_value = fu_avg_latency_high_value / fu_avg_latency_mod_value
fu_p99_mod_value = 4.6 * fu_service_ms_value / (1 - fu_util_mod_value)
fu_p99_high_value = 4.6 * fu_service_ms_value / (1 - fu_util_high_value)
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
fu_util_high_pct_str = f"{fu_util_high_value * 100:.0f}"
fu_util_mod_pct_str = f"{fu_util_mod_value * 100:.0f}"
fu_wait_high_factor_str = f"{fu_wait_high_factor_value:.0f}"
fu_total_high_factor_str = f"{1/(1-fu_util_high_value):.0f}"
fu_cost_reduction_str = f"{fu_cost_reduction_pct_value:.0f}"
fu_latency_increase_str = f"{fu_latency_increase_factor_value:.0f}"
fu_service_ms_str = f"{fu_service_ms_value}"
fu_p99_mod_ms_str = f"{fu_p99_mod_value:.0f}"
fu_p99_high_ms_str = f"{fu_p99_high_value:.0f}"
```
Teams target `{python} FallacyUtilizationCalc.fu_util_high_pct_str` percent utilization\index{Utilization!high utilization pitfall} to minimize idle capacity. In production, latency degrades nonlinearly as utilization approaches capacity. @eq-mm1-wait shows that at `{python} FallacyUtilizationCalc.fu_util_high_pct_str` percent utilization, average time in system reaches `{python} FallacyUtilizationCalc.fu_total_high_factor_str`$\times$ service time. Moving from `{python} FallacyUtilizationCalc.fu_util_mod_pct_str` percent to `{python} FallacyUtilizationCalc.fu_util_high_pct_str` percent utilization cuts infrastructure costs by `{python} FallacyUtilizationCalc.fu_cost_reduction_str` percent but triples average latency. For a `{python} FallacyUtilizationCalc.fu_service_ms_str`ms inference service, p99 latency jumps from ~`{python} FallacyUtilizationCalc.fu_p99_mod_ms_str`ms to ~`{python} FallacyUtilizationCalc.fu_p99_high_ms_str`ms (M/M/1 model). Systems provisioned for average load violate SLOs precisely when traffic increases during business-critical periods. Production systems targeting 60 to 70 percent utilization at peak load maintain the latency headroom needed to absorb traffic spikes.
**Fallacy:** *Training accuracy guarantees serving accuracy.*
```{python}
#| label: fallacy-skew-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACY: TRAINING-SERVING SKEW
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacy "Training accuracy guarantees serving accuracy"
# │
# │ Goal: Quantify the silent accuracy degradation from preprocessing mismatches.
# │ Show: A model at 95% validation accuracy drops to 90% in production from
# │ resize interpolation differences between training and serving pipelines.
# │ How: Model validation vs. production accuracy for a mismatched vision pipeline.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: fs_val_acc_str, fs_prod_acc_str, fs_acc_drop_str,
# │ fs_resize_drop_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class FallacySkewCalc:
"""Training-serving skew: 95% validation accuracy drops to 90% from preprocessing mismatches."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
fs_val_acc_value = 95.0
fs_prod_acc_value = 90.0
fs_resize_drop_min_value = 0.5
fs_resize_drop_max_value = 1.0
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
fs_acc_drop_value = fs_val_acc_value - fs_prod_acc_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
fs_val_acc_str = f"{fs_val_acc_value:.0f}"
fs_prod_acc_str = f"{fs_prod_acc_value:.0f}"
fs_acc_drop_str = f"{fs_acc_drop_value:.0f}"
fs_resize_drop_str = f"{fs_resize_drop_min_value}-{fs_resize_drop_max_value}"
```
Engineers assume identical model weights preserve validation set performance\index{Latency!inference vs user-perceived}. In production, preprocessing differences silently shift inputs outside the training distribution. @sec-model-serving-trainingserving-skew-7b99 shows *how* training-serving skew causes accuracy degradation despite identical weights: PIL versus OpenCV resize interpolation alone can shift accuracy by `{python} FallacySkewCalc.fs_resize_drop_str` percent, float64 versus float32 normalization produces different values, or feature computation timing changes. A model achieving `{python} FallacySkewCalc.fs_val_acc_str` percent validation accuracy drops to `{python} FallacySkewCalc.fs_prod_acc_str` percent in production from these preprocessing mismatches, a `{python} FallacySkewCalc.fs_acc_drop_str` percentage point loss invisible to latency monitoring. Standard monitoring checking exceptions and latency violations fails to detect this silent degradation. Production systems require either identical preprocessing code for training and serving, or statistical monitoring comparing input distributions to catch drift before accuracy degrades.
**Pitfall:** *Using average latency to evaluate serving system performance.*
```{python}
#| label: tail-latency-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: TAIL LATENCY AMPLIFICATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Using average latency to evaluate serving system
# │ performance"
# │
# │ Goal: Demonstrate the significant gap between mean and p99 latency in queuing systems.
# │ Show: At 70% utilization, average latency is ~17ms but p99 reaches ~77ms —
# │ a 4.6× gap invisible to mean-based monitoring.
# │ How: M/M/1 mean vs. p99 approximation at typical serving utilization.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: tl_util_pct_str, tl_service_ms_str, tl_avg_ms_str,
# │ tl_p99_ms_str, tl_gap_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class TailLatencyCalc:
"""At 70% utilization, p99 latency is 4.6× the mean — invisible to average-based monitoring."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
tl_util_value = 0.7
tl_service_ms_value = 5
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
# M/M/1: avg time in system = service / (1 - rho)
tl_avg_ms_value = tl_service_ms_value / (1 - tl_util_value)
# M/M/1 p99 approximation: 4.6 * avg
tl_p99_ms_value = 4.6 * tl_avg_ms_value
tl_gap_value = tl_p99_ms_value / tl_avg_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
tl_util_pct_str = f"{tl_util_value * 100:.0f}"
tl_service_ms_str = f"{tl_service_ms_value}"
tl_avg_ms_str = f"{tl_avg_ms_value:.0f}"
tl_p99_ms_str = f"{tl_p99_ms_value:.0f}"
tl_gap_str = f"{tl_gap_value:.1f}"
```
Engineers monitor average latency\index{Mean Latency!monitoring pitfall} because it trends smoothly and is simple to compute. In production, averages hide the slowest requests that determine user satisfaction. As @sec-model-serving-tail-latency-5376 demonstrates, at `{python} TailLatencyCalc.tl_util_pct_str` percent utilization with `{python} TailLatencyCalc.tl_service_ms_str`ms service time, average latency is `{python} TailLatencyCalc.tl_avg_ms_str`ms while p99 reaches `{python} TailLatencyCalc.tl_p99_ms_str`ms, a `{python} TailLatencyCalc.tl_gap_str`$\times$ gap invisible to mean-based monitoring. Teams optimizing average latency miss the tail that determines user satisfaction: the 1 percent of users experiencing `{python} TailLatencyCalc.tl_p99_ms_str`ms delays often generate the most valuable transactions. Production SLOs specify percentile targets (p95, p99) precisely because averages mask tail behavior.
**Fallacy:** *Larger serving batches always improve throughput without affecting latency SLOs.*
```{python}
#| label: fallacy-batching-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACY: LARGER BATCHES ALWAYS IMPROVE THROUGHPUT
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacy "Larger serving batches always improve throughput without
# │ affecting latency SLOs"
# │
# │ Goal: Demonstrate the diminishing returns of increasing serving batch sizes.
# │ Show: Batch size 16→32 gains only ~12% throughput while nearly doubling
# │ inference time (14ms→25ms); padding wastes 15-30% of compute.
# │ How: Compare throughput and inference time across two batch sizes.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: fb_batch_small_str, fb_batch_large_str, fb_throughput_gain_str,
# │ fb_inf_small_ms_str, fb_inf_large_ms_str,
# │ fb_padding_waste_min_str, fb_padding_waste_max_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class FallacyBatchingCalc:
"""Batch-16 to batch-32 yields only ~12% more throughput while nearly doubling inference time."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
fb_batch_small_value = 16
fb_batch_large_value = 32
fb_throughput_small_value = 1143 # from earlier table
fb_throughput_large_value = 1280 # from earlier table
fb_inf_small_ms_value = 14
fb_inf_large_ms_value = 25
fb_padding_waste_min_pct_value = 15
fb_padding_waste_max_pct_value = 30
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
fb_throughput_gain_pct_value = (fb_throughput_large_value / fb_throughput_small_value - 1) * 100
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
fb_batch_small_str = f"{fb_batch_small_value}"
fb_batch_large_str = f"{fb_batch_large_value}"
fb_throughput_gain_str = f"{fb_throughput_gain_pct_value:.0f}"
fb_inf_small_ms_str = f"{fb_inf_small_ms_value}"
fb_inf_large_ms_str = f"{fb_inf_large_ms_value}"
fb_padding_waste_min_str = f"{fb_padding_waste_min_pct_value}"
fb_padding_waste_max_str = f"{fb_padding_waste_max_pct_value}"
```
Engineers maximize batch size\index{Batching!fallacy of larger batches} assuming GPU saturation improves cost efficiency under production load. In serving systems, however, batching introduces a latency-throughput tradeoff\index{Latency-Throughput Tradeoff!batching} governed by queuing dynamics absent from offline benchmarks. Accumulating requests into larger batches increases wait time for early arrivals: a batch window of 10 ms means the first request waits 10 ms before inference begins, directly adding to p99 latency. For ResNet-50 on V100, increasing batch size from `{python} FallacyBatchingCalc.fb_batch_small_str` to `{python} FallacyBatchingCalc.fb_batch_large_str` improves throughput only `{python} FallacyBatchingCalc.fb_throughput_gain_str` percent while nearly doubling per-batch inference time from `{python} FallacyBatchingCalc.fb_inf_small_ms_str` ms to `{python} FallacyBatchingCalc.fb_inf_large_ms_str` ms, and variable input sizes within a batch create padding overhead that wastes `{python} FallacyBatchingCalc.fb_padding_waste_min_str` to `{python} FallacyBatchingCalc.fb_padding_waste_max_str` percent of compute on padding tokens. @sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d shows that for 50 ms p99 targets, batch sizes above 32 routinely violate SLOs because batch formation delay plus increased per-batch inference time exceeds the latency budget. Serving batch optimization requires jointly tuning batch size, batch timeout, and concurrency against latency SLOs under realistic traffic patterns, not maximizing throughput in isolation.
**Pitfall:** *Calibrating quantized models with training data rather than production traffic.*
```{python}
#| label: fallacy-calibration-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: CALIBRATION DATA MISMATCH
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Calibrating quantized models with training data rather
# │ than production traffic"
# │
# │ Goal: Quantify the accuracy loss from calibrating INT8 on mismatched data.
# │ Show: ResNet-50 INT8 drops 3.2pp (76.1%→72.9%) when calibrated on ImageNet
# │ but served on wildlife camera images — invisible to latency monitoring.
# │ How: Model validation vs. OOD accuracy for a calibration-mismatched system.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: fc_acc_loss_str, fc_imagenet_acc_str, fc_ood_acc_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class FallacyCalibrationCalc:
"""INT8 model calibrated on ImageNet drops 3.2pp when serving out-of-distribution wildlife images."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
fc_acc_loss_pct_value = 3.2
fc_imagenet_acc_value = 76.1 # ResNet-50 INT8 on ImageNet
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
fc_ood_acc_value = fc_imagenet_acc_value - fc_acc_loss_pct_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
fc_acc_loss_str = f"{fc_acc_loss_pct_value}"
fc_imagenet_acc_str = f"{fc_imagenet_acc_value:.1f}"
fc_ood_acc_str = f"{fc_ood_acc_value:.1f}"
```
Teams calibrate with training data\index{Calibration!production traffic mismatch} because it is readily available and produced validation accuracy. In production, traffic distribution often differs from training data, making calibration scale factors suboptimal. Post-training quantization determines INT8 scale factors by measuring activation ranges on calibration data, but this assumes production inputs match the calibration distribution. One production system achieving `{python} FallacyCalibrationCalc.fc_imagenet_acc_str` percent accuracy on ImageNet-calibrated INT8 dropped to `{python} FallacyCalibrationCalc.fc_ood_acc_str` percent, a `{python} FallacyCalibrationCalc.fc_acc_loss_str` percentage point loss, when serving wildlife camera images with different lighting and backgrounds. @sec-model-compression shows quantization error scales with activation range: miscalibration amplifies errors precisely on out-of-distribution inputs where activations exceed calibrated ranges. Effective quantization requires calibrating with representative samples of actual serving traffic, not convenience data.
**Pitfall:** *Cold start latency only matters for the first request.*
```{python}
#| label: fallacy-coldstart-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: COLD START COMPOUNDING
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Cold start latency only matters for the first request"
# │
# │ Goal: Demonstrate how cold starts compound during traffic spikes.
# │ Show: 10 new instances × 30s TensorRT compilation = 300s aggregate delay;
# │ cold requests hit 100× latency vs. steady-state.
# │ How: Calculate aggregate cold-start delay and per-request latency multiplier.
# │
# │ Imports: mlsysim.book (fmt)
# │ Exports: cs_new_instances_str, cs_compile_time_str, cs_aggregate_cold_str,
# │ cs_steady_latency_str, cs_cold_latency_str, cs_cold_mult_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsysim.fmt import fmt
class FallacyColdstartCalc:
"""Cold starts compound: 10 new instances at 30s compile time = 300s aggregate user-facing delay."""
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
cs_new_instances_value = 10
cs_compile_time_s_value = 30 # TensorRT compilation per instance
cs_steady_latency_ms_value = 5
cs_cold_latency_ms_value = 500 # first request during cold start
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
cs_aggregate_cold_s_value = cs_new_instances_value * cs_compile_time_s_value
cs_cold_multiplier_value = cs_cold_latency_ms_value / cs_steady_latency_ms_value
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
cs_new_instances_str = f"{cs_new_instances_value}"
cs_compile_time_str = f"{cs_compile_time_s_value}"
cs_aggregate_cold_str = f"{cs_aggregate_cold_s_value}"
cs_steady_latency_str = f"{cs_steady_latency_ms_value}"
cs_cold_latency_str = f"{cs_cold_latency_ms_value}"
cs_cold_mult_str = f"{cs_cold_multiplier_value:.0f}"
```
Engineers optimize steady-state latency\index{Cold Start!bursty traffic impact} assuming most requests hit warm instances. In production, cold starts compound during the events that matter most: traffic spikes requiring scale-up, deployments rolling out new versions, and recovery from instance failures. @sec-model-serving-model-loading-initialization-cc5a details the anatomy of cold start: TensorRT compilation alone takes `{python} FallacyColdstartCalc.cs_compile_time_str` seconds per instance. During a traffic spike requiring `{python} FallacyColdstartCalc.cs_new_instances_str` new instances, aggregate cold start latency reaches `{python} FallacyColdstartCalc.cs_aggregate_cold_str` seconds of user-facing delay before new capacity becomes useful. Worse, requests hitting cold instances experience `{python} FallacyColdstartCalc.cs_cold_latency_str` ms latency versus `{python} FallacyColdstartCalc.cs_steady_latency_str` ms steady-state, a `{python} FallacyColdstartCalc.cs_cold_mult_str`$\times$ degradation that violates SLOs precisely when traffic is highest. Systems ignoring cold start meet SLOs during steady state but fail during scale-up events and deployment windows when reliability matters most.
## Summary {#sec-model-serving-summary-9635}
Serving marks the transition from model development to production deployment, where the optimization priorities that governed training must be inverted. The shift from throughput maximization to latency minimization transforms every system design decision. The queuing theory foundations\index{Queuing Theory!serving foundations} established here reveal *why* this inversion is not merely a change in metrics but a change in the governing mathematics. The nonlinear relationship between utilization and latency means that systems behaving well at moderate load can suddenly violate SLOs when traffic increases modestly. Little's Law and the M/M/1 wait time equations provide the quantitative foundation for capacity planning, replacing intuition-based provisioning with engineering rigor.
Effective serving optimization requires understanding the complete request path rather than focusing exclusively on model inference. Interface protocols like gRPC and efficient serialization formats minimize the "tax" of data movement, while preprocessing often consumes 45 to 70 percent of total latency when inference runs on optimized accelerators. The microsecond-scale overheads identified by Barroso, Patterson, and colleagues explain *why* serving latency often exceeds the sum of its measured parts, and *why* system-level optimization matters as much as model optimization. Training-serving skew represents another dimension of this complexity, silently degrading accuracy when preprocessing logic differs between training and production environments in ways that traditional testing cannot detect.
The traffic pattern analysis reveals *how* the deployment paradigm selected in @sec-ml-systems shapes every serving decision downstream. Server workloads with Poisson arrivals optimize dynamic batching windows, autonomous vehicles with streaming sensor data require synchronized batch formation, and mobile applications with single-user patterns eliminate batching entirely. Each pattern is a direct consequence of the physical constraints (power wall, memory wall, light barrier) that created the four paradigms in the first place. The MLPerf scenarios codify these patterns for standardized benchmarking, connecting the serving principles established here to the measurement frameworks explored in @sec-benchmarking. Node-level optimization techniques (graph compilation, operator fusion, and systematic profiling) bridge the gap between model-level decisions and hardware execution, often yielding 2--5$\times$ additional speedup through better utilization of the accelerator's duty cycle. Precision selection and runtime optimization extend the quantization techniques from @sec-model-compression and Tensor Core capabilities from @sec-hardware-acceleration into the serving domain. The translation of these technical metrics into unit economics, as shown by the Llama-3 case study, demonstrates *how* engineering decisions regarding batching, precision, and hardware selection directly determine the financial viability of deployment, a pressure intensified by the intelligence deflation trend (@fig-intelligence-deflation) that continually compresses per-inference margins.
::: {.callout-takeaways title="Inverting Every Training Priority"}
* **Serving inverts training priorities**: Training optimizes throughput (samples/hour); serving optimizes latency (ms/request). Different objectives require different system designs.
* **Queuing theory governs capacity planning**: At 80% utilization, wait time is 5$\times$ service time; at 90%, it reaches 10$\times$. Small load increases cause disproportionate latency spikes.
* **Preprocessing dominates optimized systems**: When model inference is fast (5 ms), preprocessing (image decode, tokenization) consumes 4570% of total latency. Optimize the pipeline, not just the model.
* **Batching strategy depends on traffic pattern**: Poisson arrivals (web APIs) use dynamic batching; streaming sensors use synchronized batches; mobile apps eliminate batching entirely.
* **Training-serving skew can degrade accuracy undetected**: Different preprocessing between training and serving (e.g., resize interpolation, normalization order) shifts inputs outside the training distribution, causing accuracy degradation that conventional monitoring cannot detect. Use identical code paths.
* **LLM serving is memory-bandwidth bound**: Token generation reads the entire model from VRAM per token, making decode latency strictly limited by memory bandwidth rather than compute. KV cache management via PagedAttention and continuous batching is the primary throughput lever, achieving 2--4$\times$ improvement over naive serving.
* **Precision and runtime selection directly determine infrastructure cost**: INT8 inference achieves ~3$\times$ higher throughput than FP32, translating directly to proportionally fewer GPUs. Runtime optimization (TensorRT, ONNX Runtime) provides an additional 2--5$\times$ speedup over framework-native serving, making these choices as impactful as model architecture decisions.
:::
The serving principles established here (queuing theory for capacity planning, preprocessing optimization, batching strategy selection, and training-serving skew prevention) form the foundation for building production ML systems that meet real-world SLAs. Whether deploying a recommendation system serving millions of users or a medical AI where every millisecond affects patient outcomes, these principles translate mathematical understanding into engineering decisions that determine whether systems succeed or fail under load.
::: {.callout-chapter-connection title="From Node to Factory"}
This chapter engineered the single serving node: latency budgets decomposed each request, queuing theory sized the hardware, batching strategies maximized throughput, and runtime optimization extracted every available microsecond. A single node, however, is fragile. Models drift as the world changes. Deployments must roll out without downtime. Monitoring must detect the silent accuracy degradation that training-serving skew causes. Scaling events demand orchestration across dozens or hundreds of replicas. In @sec-ml-operations, we scale our perspective from the single request to the full system lifecycle—building the automated machinery (CI/CD pipelines, feature stores, model registries, and observability platforms) that keeps production ML systems running reliably through crashes, model drift, and continuous updates.
:::
<!-- This is here to make sure that quizzes are inserted properly before a part begins. -->
::: { .quiz-end }
:::