cs249r_book/book/quarto/contents/vol1/model_serving/model_serving.qmd

---
quiz: serving_quizzes.json
concepts: serving_concepts.yml
glossary: serving_glossary.json
engine: jupyter
---

# Model Serving {#sec-model-serving}

```{python}
#| echo: false
#| label: chapter-start
from mlsys.registry import start_chapter

start_chapter("vol1:serving")
```

::: {layout-narrow}
::: {.column-margin}
\chapterminitoc
:::

\noindent

:::

## Purpose {.unnumbered}

\begin{marginfigure}
\mlsysstack{35}{30}{25}{15}{90}{40}{20}{20}
\end{marginfigure}

_Why does serving invert every optimization priority that made training successful?_

Training and serving demand opposite physics. Training maximizes throughput (\(T\), in samples per second): large batches and long epochs where latency spikes get absorbed invisibly. Serving minimizes latency (\(L_{lat}\), in milliseconds per request): individual requests answered fast enough that a single slow response is a *broken product*. Training amortizes hardware costs across billions of examples; serving pays a tax on every request, where small inefficiencies compound into operational debt. This inversion is why models that train beautifully often serve poorly: the batch-heavy architectures and memory-intensive optimizations designed to saturate accelerators during training are fundamentally ill-suited for the bursty, latency-critical, cost-sensitive reality of production traffic. But serving is more than a latency problem. A serving system must handle traffic that varies by orders of magnitude between peak and trough, route requests across model versions during progressive rollouts, degrade gracefully when upstream dependencies fail, and do all of this continuously—not for the duration of a training run but for the lifetime of the product. Every model that proved its value during training and survived compression and benchmarking eventually arrives at the serving layer—the deployment and integration stage of the ML lifecycle—where the question shifts from "does it work?" to "does it work *reliably, at scale, under production conditions, every second of every day*?" The serving infrastructure is where ML systems finally meet users, and the engineering that sustains that meeting is qualitatively different from the engineering that created the model.

::: {.content-visible when-format="pdf"}
\newpage
:::

::: {.callout-tip title="Learning Objectives"}

- Explain the inversion from throughput optimization to latency minimization that distinguishes serving from training
- Decompose request latency into preprocessing, inference, and postprocessing phases to identify bottlenecks using the **latency budget framework**
- Apply **queuing theory** (**Little's Law**, **M/M/1 models**) and capacity planning to meet percentile latency SLOs
- Identify sources of **training-serving skew** and **cold start latency**, and select appropriate prevention and mitigation strategies
- Select batching and runtime strategies based on traffic patterns (**Server**, **SingleStream**, **MultiStream**, **Offline**), latency constraints, and cost requirements
- Evaluate the memory-bandwidth and **KV-cache** constraints unique to LLM serving, including **TTFT**/**TPOT** metrics, **continuous batching**, and **PagedAttention**
- Evaluate deployment tradeoffs across precision, runtime selection, and infrastructure cost

:::

```{python}
#| label: gpu-specs
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ GPU SPECIFICATIONS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Multiple sections including @tbl-resolution-bottleneck, LLM serving
# │          case study, and "Carbon Cost of Chat" callout
# │
# │ Goal: Provide hardware specifications for V100, A100, and H100 GPUs.
# │ Show: The memory bandwidth and compute ceiling for each generation.
# │ How: Retrieve constants from Hardware Digital Twins.
# │
# │ Imports: mlsys (Hardware), mlsys.constants (units), mlsys.formatting (fmt)
# │ Exports: v100_tflops_fp32, v100_bw, a100_bw_tbs, h100_bw_tbs, h100_mem,
# │          h100_tdp
# └─────────────────────────────────────────────────────────────────────────────
from mlsys import Hardware
from mlsys.constants import (
    TFLOPs, second, GB, TB, GiB, watt,
    H100_MEM_BW,
)
from mlsys.formatting import fmt

# --- Hardware Twins ---
h_v100 = Hardware.Cloud.V100
h_a100 = Hardware.Cloud.A100
h_h100 = Hardware.Cloud.H100

# --- Outputs (formatted strings for prose) ---
v100_tflops_fp32_value = h_v100.peak_flops_fp32.to(TFLOPs / second).magnitude
v100_tflops_fp32 = f"{v100_tflops_fp32_value:.1f}"                         # e.g. "14.1" TFLOPS

v100_bw_value = h_v100.memory_bw.to(GB / second).magnitude
v100_bw = f"{v100_bw_value:.0f}"                                           # e.g. "900" GB/s

a100_bw_tbs_value = h_a100.memory_bw.to(TB / second).magnitude
a100_bw_tbs = f"{a100_bw_tbs_value:.1f}"                                   # e.g. "2.0" TB/s

h100_bw_tbs_value = h_h100.memory_bw.to(TB / second).magnitude
h100_bw_tbs = f"{h100_bw_tbs_value:.2f}"                                   # e.g. "3.35" TB/s

h100_mem_value = h_h100.memory_capacity.to(GiB).magnitude
h100_mem = f"{h100_mem_value:.0f}"                                         # e.g. "80" GB

h100_tdp_value = h_h100.tdp.to(watt).magnitude
h100_tdp = f"{h100_tdp_value:.0f}"                                         # e.g. "700" W
```

## Serving Paradigm {#sec-model-serving-serving-paradigm-9634}

Serving\index{Serving!production deployment}\index{Model Serving!paradigm shift} marks the transition from model development to production deployment. The four deployment paradigms introduced in @sec-ml-systems—Cloud, Edge, Mobile, and TinyML—each impose distinct serving challenges, but all share a common inversion: the throughput-to-latency shift introduced in the Purpose. This inversion has concrete engineering implications that ripple through every technique established in prior chapters. The Iron Law of ML Systems (@sec-introduction-iron-law-ml-systems-c32a) undergoes a decisive shift: the latency term\index{Latency!serving constraint} ($L_{lat}$), representing the irreducible overhead of request scheduling, network round-trips, and system orchestration, becomes the dominant constraint rather than a rounding error. @sec-benchmarking measured performance under controlled conditions, but serving faces traffic patterns that no benchmark could anticipate; @sec-model-compression provided quantization methods that reduced model size, but serving must confirm those optimizations preserve accuracy under real traffic distributions. These revalidations define the *serving inversion*\index{Serving Inversion!throughput to latency}.

::: {.callout-perspective title="The Serving Inversion"}
Applying the **D·A·M taxonomy** reveals how deployment inverts your engineering priorities:

*   **Data (Information)**: In training, you maximize **Volume** (shuffling billions of samples). In serving, you maximize **Freshness** (processing one request *right now*).
*   **Algorithm (Logic)**: In training, the math is **Mutable** (updating weights via backprop). In serving, the math is **Frozen** (fixed weights, forward pass only).
*   **Machine (Physics)**: In training, you maximize **Utilization** (keeping GPUs at 100% to saturate throughput). In serving, you maximize **Headroom** (keeping GPUs at 40–60% to absorb traffic spikes before tail latency explodes; observe the exponential rise in @fig-tail-latency-explosion).
:::

```{python}
#| label: fig-tail-latency-explosion
#| echo: false
#| fig-cap: "**The Tail Latency Explosion**: Request Latency vs. System Utilization ($\\rho$). While mean latency (Blue) remains moderate, tail latency (Red, p99) explodes once utilization passes the 'Knee' at ~70%. This uses a simple M/M/1 approximation (p99 ≈ 4.6× mean), so the curve is illustrative rather than workload-specific."
#| fig-alt: "Line plot showing latency growing with utilization. Blue line (Mean) rises gradually then steeply. Red line (Tail, p99) curves upward sharply at 70% utilization. Shaded regions indicate 'Safe Zone' and 'Danger Zone'."

import numpy as np
from mlsys import viz

fig, ax, COLORS, plt = viz.setup_plot()

# =============================================================================
# PLOT: The Tail Latency Explosion
# =============================================================================
utilization = np.linspace(0, 0.95, 100)
mean_latency = 1 / (1 - utilization)
p99_latency = mean_latency * 4.6

ax.plot(utilization, mean_latency, '--', color=COLORS['BlueLine'], label='Mean Latency', linewidth=2)
ax.plot(utilization, p99_latency, '-', color=COLORS['RedLine'], label='Tail Latency (p99)', linewidth=2.5)

ax.set_xlabel('System Utilization (%)')
ax.set_ylabel('Request Latency (normalized to service time)')
ax.set_xlim(0, 1.0)
ax.set_ylim(0, 50)

ax.axvspan(0, 0.5, color=COLORS['GreenL'], alpha=0.2)
ax.text(0.25, 5, "Safe Zone", color=COLORS['GreenLine'], fontweight='bold', ha='center', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.axvspan(0.7, 1.0, color=COLORS['RedL'], alpha=0.2)
ax.text(0.85, 40, "Danger Zone\n(Queue Explosion)", color=COLORS['RedLine'], fontweight='bold', ha='center', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))

ax.annotate("The Knee", xy=(0.7, 15), xytext=(0.5, 25),
            arrowprops=dict(facecolor=COLORS['primary'], arrowstyle='->', lw=1.5), fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))

ax.set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_xticklabels(['0%', '20%', '40%', '60%', '80%', '100%'])
ax.legend(loc='upper left', fontsize=8)
plt.show()
```

The consequences of ignoring this inversion become apparent during a *traffic spike* that pushes the system beyond what it was designed to handle.

```{python}
#| label: black-friday-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BLACK FRIDAY TRAFFIC SPIKE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The 'Black Friday' Traffic Spike"
# │
# │ Goal: Demonstrate the nonlinear failure mode of serving systems under load.
# │ Show: That a 10× traffic spike causes system collapse, not just 10× latency.
# │ How: Model queue explosion using Little's Law parameters.
# │
# │ Imports: (none)
# │ Exports: bf_latency_ms_str, bf_qps_normal_str, bf_qps_spike_str,
# │          bf_spike_factor_str, bf_collapse_latency_s_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (traffic spike scenario) ---
bf_latency_ms_value = 50              # normal operation latency (ms)
bf_qps_normal_value = 1000            # normal queries per second
bf_qps_spike_value = 10000            # Black Friday peak QPS
bf_spike_factor_value = 10            # spike multiplier (10x)
bf_collapse_latency_s_value = 10      # latency during collapse (seconds)

# --- Outputs (formatted strings for prose) ---
bf_latency_ms_str = f"{bf_latency_ms_value}"              # e.g. "50" ms
bf_qps_normal_str = f"{bf_qps_normal_value:,}"            # e.g. "1,000" QPS
bf_qps_spike_str = f"{bf_qps_spike_value:,}"              # e.g. "10,000" QPS
bf_spike_factor_str = f"{bf_spike_factor_value}"          # e.g. "10" x
bf_collapse_latency_s_str = f"{bf_collapse_latency_s_value}"  # e.g. "10" seconds
```

::: {.callout-example title="The 'Black Friday' Traffic Spike"}

**The Scenario**: An e-commerce recommendation system runs comfortably at `{python} bf_latency_ms_str` ms latency with `{python} bf_qps_normal_str` queries per second (QPS).

**The Event**: On Black Friday, traffic spikes `{python} bf_spike_factor_str` $\times$ to `{python} bf_qps_spike_str` QPS.

**The Failure**: The system does not just slow down `{python} bf_spike_factor_str` $\times$. It **collapses**. Latency hits `{python} bf_collapse_latency_s_str` seconds, then requests start timing out. The servers are 100% utilized, but *useful* throughput drops to near zero because most completed requests have already timed out from the client's perspective.

**The Physics**: This is Little's Law and queueing theory in action. As utilization approaches 100%, queue lengths grow exponentially, not linearly. The system spends more time managing the queue (context switching, thrashing) than doing useful work.

**The Fix**:

1.  **Load Shedding**: Reject excess requests immediately to keep the queue short.
2.  **Autoscaling**: Spin up more replicas *before* utilization hits the "knee" of the curve.
3.  **Degradation**: Serve cached/dumber recommendations to reduce compute cost per query.
:::

Trace the curve in @fig-tail-latency-explosion and notice how latency remains manageable until utilization crosses roughly 70%, then explodes—this is *why* production systems must run at relatively low utilization (40–60%) to guarantee stable tail latency\index{Tail Latency!utilization threshold} (p99). For a mathematical treatment of long-tailed distributions and why P99 latency becomes the *median* user experience at scale, see @sec-data-foundations-distributions-long-tail-901f. The curve is a simple queueing approximation intended for intuition rather than a specific workload.

Beyond the technical limits of latency, the economics of serving have undergone a radical transformation. As models become more efficient and hardware becomes more specialized, the cost of "intelligence" is collapsing. To grasp the speed of this collapse, examine the log-scale price trajectory in @fig-intelligence-deflation, which tracks public API list prices as a market proxy.

```{python}
#| label: fig-intelligence-deflation
#| echo: false
#| fig-cap: "**Intelligence Deflation**: Cost per 1M output tokens (USD) over time (Log Scale). Prices are based on public API list prices (2020–2025) and are intended as a market trend indicator, not a controlled comparison. The cost of token generation has collapsed by multiple orders of magnitude, transforming the economics of automated AI workflows."
#| fig-alt: "Line plot showing token pricing collapsing from $20/M tokens in 2020 to <$0.10/M tokens in 2025. Log scale highlights the deflationary trend with models from OpenAI, Anthropic, Google, and DeepSeek."

import numpy as np
from mlsys import viz

fig, ax, COLORS, plt = viz.setup_plot()

# =============================================================================
# DATA: Token pricing over time
# =============================================================================
data = [
    (2020.5, 20.0, "GPT-3 (Davinci)"), (2023.1, 2.0, "GPT-3.5 Turbo"),
    (2023.2, 30.0, "GPT-4 (Original)"), (2024.2, 15.0, "Claude 3 Opus"),
    (2024.2, 0.25, "Claude 3 Haiku"), (2024.3, 5.0, "GPT-4o"),
    (2024.4, 0.075, "Gemini 1.5 Flash"), (2024.6, 0.15, "GPT-4o-mini"),
    (2024.9, 0.27, "DeepSeek-V3")
]
data.sort(key=lambda x: x[0])
years = np.array([d[0] for d in data])
prices = np.array([d[1] for d in data])
labels = [d[2] for d in data]

# =============================================================================
# PLOT: Intelligence Deflation
# =============================================================================
trend_years = np.array([2020.5, 2023.1, 2024.2, 2024.4, 2024.9])
trend_prices = np.array([20.0, 2.0, 0.25, 0.075, 0.27])
slope, intercept = np.polyfit(trend_years, np.log10(trend_prices), 1)
line_years = np.linspace(2020, 2025.5, 100)
line_prices = 10**(slope * line_years + intercept)

ax.plot(line_years, line_prices, '--', color=COLORS['grid'], linewidth=1.5, label='Deflation Trend', zorder=1)
ax.scatter(years, prices, color=COLORS['GreenLine'], s=50, zorder=3, edgecolors='white', linewidth=1.5)

for y, p, l in zip(years, prices, labels):
    off_x, off_y, ha, va = 5, 5, 'left', 'bottom'
    if "Haiku" in l: off_x, off_y, ha = -8, 8, 'right'
    elif "Flash" in l: off_x, off_y, ha, va = -8, -15, 'right', 'top'
    ax.annotate(l, (y, p), xytext=(off_x, off_y), textcoords='offset points',
                fontsize=8, fontweight='bold', ha=ha, va=va, color=COLORS['primary'],
                bbox=dict(facecolor='white', alpha=0.7, edgecolor='none', pad=1))

ax.set_yscale('log')
ax.set_yticks([100, 10, 1, 0.1, 0.01])
ax.set_yticklabels(['$100', '$10', '$1', '$0.10', '$0.01'])
ax.set_xlabel('Year')
ax.set_ylabel('Price per 1M Tokens (USD)')
ax.set_ylim(0.01, 500)
ax.set_xlim(2020, 2025.5)
ax.text(2021, 0.05, "Trend: ~10× Cheaper\nEvery 18 Months", color=COLORS['grid'], fontsize=9, style='italic', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
plt.show()
```

These priorities motivate a formal definition of model serving.

::: {.callout-definition title="Model Serving"}

***Model Serving***\index{Model Serving} is the operational phase that inverts the throughput priority of training into a latency constraint. It requires a distinct architectural stack designed to minimize the **tail latency**\index{Tail Latency!SLO constraint} of individual inferences under stochastic load, bounded by the **Service Level Objective (SLO)**\index{SLO (Service Level Objective)!latency targets}.

:::

The SLO[^fn-slo-sla] defines the latency target that shapes every architectural decision in the serving stack.

[^fn-slo-sla]: **Service Level Objective (SLO) vs. Service Level Agreement (SLA)**\index{SLO (Service Level Objective)!vs. SLA}\index{SLA (Service Level Agreement)!vs. SLO}: Formalized in Google's Site Reliability Engineering practice [@beyer2016sre]. An SLO is an *internal* target (e.g., "p99 latency under 50 ms"); an SLA is an *external* contractual commitment with penalties for violation. SLOs are set tighter than SLAs to provide a safety margin. For ML systems, model accuracy and inference latency both contribute to SLOs, creating multi-dimensional optimization targets that traditional SRE did not face.

Serving systems must execute a complete inference pipeline under latency constraints, not just the neural network computation. A common misconception is that "inference time" equals "serving time"—in reality, the neural network is just one stage in a longer pipeline. Follow the stages in @fig-serving-inference-pipeline from left to right: raw inputs pass through preprocessing (traditional computing), neural network inference (deep learning), and postprocessing (traditional computing) before producing final outputs. Notice that any of these stages—not just the neural network—can become the latency bottleneck. @sec-model-serving-latency-budget-ef40 quantifies exactly where time goes, revealing a counterintuitive result about which stages dominate.

::: {#fig-serving-inference-pipeline fig-env="figure" fig-pos="htb" fig-cap="**The Inference Pipeline**: ML serving systems transform raw inputs into final outputs through sequential stages: preprocessing, neural network computation, and postprocessing. The neural network represents just one component; preprocessing and postprocessing rely on traditional computing and often dominate total latency in optimized systems." fig-alt="Flow diagram showing six connected boxes: Raw Input, Preprocessing, Neural Network, Raw Output, Postprocessing, Final Output. Preprocessing and postprocessing are labeled Traditional Computing; neural network is labeled Deep Learning."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n},line width=0.75pt]
\tikzset{%
  Line/.style={line width=1.0pt,black!50,text=black},
  Box/.style={inner xsep=3pt,
    node distance=0.6,
    draw=GreenLine, line width=0.75pt,
    fill=GreenL,
    align=flush center,
    minimum width=15mm,
    minimum height=10mm
  },
}
%
\node[Box](B1){Raw\\ Input};
\node[Box,right=of B1](B2){Pre-processing};
\node[Box,node distance=1, right=of B2,fill=BlueL,draw=BlueLine](B3){Neural\\ Network};
\node[Box,node distance=1, right=of B3,fill=VioletL2,draw=VioletLine2](B4){Raw\\ Output};
\node[Box,right=of B4,fill=VioletL2,draw=VioletLine2](B5){Post-processing};
\node[Box, right=of B5,fill=VioletL2,draw=VioletLine2](B6){Final\\ Output};
%
\draw[Line,-latex](B1)--(B2);
\draw[Line,-latex](B2)--(B3);
\draw[Line,-latex](B3)--(B4);
\draw[Line,-latex](B4)--(B5);
\draw[Line,-latex](B5)--(B6);
%
\scoped[on background layer]
\node[draw=BackLine,inner xsep=3mm,inner ysep=5mm,yshift=2mm,
            fill=BackColor,fit=(B1)(B2),line width=0.75pt](BB){};
\node[below=3pt of  BB.north,anchor=north]{Traditional Computing};
%
\scoped[on background layer]
\node[draw=OrangeLine,inner xsep=4mm,inner ysep=5mm,yshift=2mm,
            fill=OrangeL!70!red!10,fit=(B3),line width=0.75pt](BB){};
\node[below=3pt of  BB.north,anchor=north]{Deep Learning};
%
\scoped[on background layer]
\node[draw=BackLine,inner xsep=3mm,inner ysep=5mm,yshift=2mm,
            fill=BackColor,fit=(B4)(B6),line width=0.75pt](BB){};
\node[below=3pt of  BB.north,anchor=north]{Traditional Computing};
\end{tikzpicture}
```
:::

This chapter develops the engineering principles needed to orchestrate this pipeline under production constraints. This chapter first establishes the system fundamentals—serving architectures, server anatomy, and the protocols connecting clients to models—then traces the request lifecycle to reveal where latency accumulates, and finally turns to the optimization strategies that maximize throughput under these constraints.

### Static vs Dynamic Inference {#sec-model-serving-static-vs-dynamic-inference-e864}

The preceding examples explain *why* serving systems must maintain capacity headroom. But before diving into *how* to optimize inference latency, we must address a prior question: *when* should predictions be computed at all? The first architectural decision in any serving system is whether predictions happen before or during user requests [@google2024staticdynamic]. This choice shapes system design, cost structure, and capability boundaries.

#### Static Inference {#sec-model-serving-static-inference-35f4}

Static inference\index{Static Inference!pre-computed predictions} (also called offline or batch inference) pre-computes predictions for anticipated inputs and stores them for retrieval. Consider a recommendation system that generates predictions for all user-item pairs nightly. When a user requests recommendations, the system retrieves pre-computed results from a lookup table rather than running inference. This approach eliminates inference latency entirely since results already exist, enables quality verification before deployment, and reduces serving costs. However, static inference cannot handle novel inputs that were not anticipated during the batch computation and introduces hours or days of latency when models update.

#### Dynamic Inference {#sec-model-serving-dynamic-inference-d2d5}

Dynamic inference\index{Dynamic Inference!real-time prediction} (also called online or real-time inference) computes predictions on demand when requests arrive. This handles any input, including rare edge cases and novel combinations, and immediately reflects model updates. The cost is strict latency requirements that constrain model complexity and demand robust monitoring infrastructure.

```{python}
#| label: static-batch-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ STATIC VS DYNAMIC INFERENCE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Static vs Dynamic inference narrative (photo organization example)
# │
# │ Goal: Contrast the economics of static vs. dynamic inference.
# │ Show: That static pre-computation is superior for predictable inputs.
# │ How: Compare total batch time to per-request latency for photo classification.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: n_photos_str, inference_ms_str, batch_total_s_str,
# │          dynamic_latency_budget_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (photo classification scenario) ---
n_photos_value = 10_000               # photos in user library
inference_ms_value = 5                # ResNet-50 inference time (ms)
dynamic_latency_budget_ms_value = 100 # real-time latency budget (ms)

# --- Process (batch total time) ---
batch_total_s_value = n_photos_value * inference_ms_value / 1000

# --- Outputs (formatted strings for prose) ---
n_photos_str = f"{n_photos_value:,}"                                       # e.g. "10,000" photos
inference_ms_str = f"{inference_ms_value}"                                 # e.g. "5" ms
batch_total_s_str = fmt(batch_total_s_value, precision=0, commas=False)    # e.g. "50" seconds
dynamic_latency_budget_ms_str = f"{dynamic_latency_budget_ms_value}"       # e.g. "100" ms
```

For our ResNet-50 image classifier, consider two deployment scenarios. A **static approach** suits a photo organization app that pre-classifies all images in a user's library overnight. With `{python} n_photos_str` photos and `{python} inference_ms_str` ms inference each, batch processing takes ~`{python} batch_total_s_str` seconds total, and users see instant classification when browsing. A **dynamic approach** suits a content moderation API that must classify user-uploaded images in real-time, with each image requiring the full preprocessing→inference→postprocessing pipeline and a `{python} dynamic_latency_budget_ms_str`ms latency budget. Most production image classification systems use a **hybrid approach**: frequently requested images (popular products, known memes) are pre-classified and cached, while novel uploads trigger dynamic inference.

The choice between static and dynamic serving has direct economic implications. Stricter latency requirements directly translate into higher infrastructure costs, and quantifying *the cost of latency* in dollar terms reveals how much infrastructure premium each millisecond of latency reduction demands.

```{python}
#| label: cost-latency-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COST OF LATENCY CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Cost of Latency" (Serving Paradigm section)
# │
# │ Goal: Quantify the economic tradeoff between response time and hardware bill.
# │ Show: That reducing latency by 50% can increase costs by 4x.
# │ How: Calculate cost per million queries across different batch sizes.
# │
# │ Imports: mlsys.constants, mlsys.formatting
# │ Exports: gpu_cost_per_hour_str, latency_a_ms_str, throughput_a_rps_str,
# │          latency_b_ms_str, throughput_b_rps_str, cost_a_str, cost_b_str,
# │          cost_increase_str, cost_ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import SEC_PER_HOUR, MILLION
from mlsys.formatting import fmt, check

# --- Inputs (GPU rental and batching scenarios) ---
gpu_cost_per_hour_value = 4.0         # GPU rental cost ($/hour)
latency_a_ms_value = 5                # Scenario A: low latency (ms)
throughput_a_rps_value = 200          # Scenario A: throughput (req/s)
latency_b_ms_value = 10               # Scenario B: higher latency (ms)
throughput_b_rps_value = 800          # Scenario B: throughput (req/s)

# --- Process (cost per million queries) ---
queries_per_hour_a_value = throughput_a_rps_value * SEC_PER_HOUR
cost_per_million_a_value = gpu_cost_per_hour_value / (queries_per_hour_a_value / MILLION)
queries_per_hour_b_value = throughput_b_rps_value * SEC_PER_HOUR
cost_per_million_b_value = gpu_cost_per_hour_value / (queries_per_hour_b_value / MILLION)
cost_increase_pct_value = (cost_per_million_a_value / cost_per_million_b_value - 1) * 100
cost_ratio_value = cost_per_million_a_value / cost_per_million_b_value

# --- Outputs (formatted strings for prose) ---
gpu_cost_per_hour_str = fmt(gpu_cost_per_hour_value, precision=0, commas=False)
latency_a_ms_str = f"{latency_a_ms_value}"
throughput_a_rps_str = f"{throughput_a_rps_value}"
latency_b_ms_str = f"{latency_b_ms_value}"
throughput_b_rps_str = f"{throughput_b_rps_value}"
cost_a_str = fmt(cost_per_million_a_value, precision=2, commas=False)
cost_b_str = fmt(cost_per_million_b_value, precision=2, commas=False)
cost_increase_str = fmt(cost_increase_pct_value, precision=0, commas=False)
cost_ratio_str = fmt(cost_ratio_value, precision=0, commas=False)
```

::: {.callout-notebook #notebook-cost-latency title="The Cost of Latency"}

Latency constraints directly dictate infrastructure costs. Consider a GPU server renting for USD `{python} gpu_cost_per_hour_str`/hour.

**Scenario A (Low Latency):** Batch size 1.

*   Latency: `{python} latency_a_ms_str` ms.
*   Throughput: `{python} throughput_a_rps_str` req/s.
*   Cost per million queries: **USD `{python} cost_a_str`**.

**Scenario B (High Throughput):** Batch size 8.

*   Latency: `{python} latency_b_ms_str` ms (doubled due to batching overhead).
*   Throughput: `{python} throughput_b_rps_str` req/s (quadrupled due to parallel efficiency).
*   Cost per million queries: **USD `{python} cost_b_str`**.

**The Trade-off:** Reducing latency from `{python} latency_b_ms_str` ms to `{python} latency_a_ms_str` ms increases the hardware bill by **`{python} cost_increase_str`%**. Engineers must quantify whether that `{python} latency_a_ms_str` ms speedup generates enough business value to justify the `{python} cost_ratio_str` $\times$ cost increase.

:::

Most production systems combine both approaches. Common queries hit a cache populated by batch inference while uncommon requests trigger dynamic computation. Understanding this spectrum matters because it determines which subsequent optimization strategies apply. Static inference optimizes for throughput during batch computation and storage efficiency for serving. Dynamic inference optimizes for per-request latency under concurrent load, which requires understanding *where* time goes within each request.

The static-versus-dynamic decision is the first of several architectural choices that shape serving system design. Equally important is *where* the model executes, since deployment context constrains every subsequent optimization.

::: {.callout-perspective title="Looking Ahead: The Rise of Inference-Time Compute (System 2)"}
Traditional serving optimizes for minimizing latency ($L_{\text{lat}} \to 0$). Emerging "Reasoning Models" (like OpenAI o1) invert this goal, deliberately spending more compute cycles ("thinking") to improve answer quality. Individual token generation remains memory-bandwidth-bound, but these models generate far more tokens per request (often 10–100 $\times$ more internal reasoning tokens), dramatically increasing the total compute and energy spent per query. The aggregate effect brings "Training-like" compute budgets into the Serving phase, even though each token is still governed by the memory wall.
:::

### The Spectrum of Serving Architectures {#sec-model-serving-spectrum-serving-architectures-8966}

Although "serving" often implies a networked server processing API requests, the architectural pattern varies drastically by deployment environment. @sec-ml-systems-deployment-paradigm-framework-0d25 introduced the four deployment paradigms—Cloud, Edge, Mobile, and TinyML—and the physical constraints (the light barrier, the power wall, and the memory wall) that give rise to them. Those constraints do not disappear at serving time; they *intensify*, because serving adds latency SLOs and cost pressure on top of the hardware limits that training could absorb through patience. The same model may require radically different serving strategies depending on *where* it executes.

#### Networked Serving (Cloud/Datacenter) {#sec-model-serving-networked-serving-clouddatacenter-0328}

The model\index{Serving!cloud/datacenter}\index{Microservice!model serving} runs as a standalone service (microservice), the deployment paradigm @sec-ml-systems-cloud-ml-maximizing-computational-power-a338 characterized as trading latency for virtually unlimited compute. The primary interface is the network (HTTP/gRPC). Optimization focuses on **throughput** (batching) and **concurrency**.

*   *Key Constraint:* Network bandwidth and serialization cost.
*   *Typical Hardware:* NVIDIA GPUs (V100, A100, H100), Google TPUs, AWS Inferentia.
*   *Cold Start:*\index{Cold Start!cloud serving} Seconds to minutes (container startup, model loading, warmup).

#### Application-Embedded Serving (Mobile/Edge) {#sec-model-serving-applicationembedded-serving-mobileedge-8bd1}

The model\index{Serving!mobile/edge}\index{Edge Inference!embedded serving} runs within the user application process (e.g., a smartphone app using CoreML or TensorFlow Lite), the embedded paradigm @sec-ml-systems-edge-ml-reducing-latency-privacy-risk-2625 and @sec-ml-systems-mobile-ml-personal-offline-intelligence-0983 analyzed for its latency, privacy, and offline advantages. There is no "server." The interface is a function call. Optimization focuses on **energy** and **responsiveness** (SingleStream).

*   *Key Advantage:* **Zero-Copy Inference**\index{Zero-Copy Inference!mobile optimization}. When data moves through a system, each copy consumes CPU cycles and memory bandwidth. In cloud serving, a camera frame might be copied four times: from network buffer to application memory, then to a preprocessing buffer, then to GPU-accessible memory, and finally to GPU VRAM. Mobile NPUs can eliminate most of these copies by sharing memory directly with the camera hardware. The camera writes pixels into a buffer that the NPU reads directly, avoiding the CPU entirely. This reduces both latency (no copy operations) and energy (memory copies consume significant power). The mechanism requires hardware support: the camera, CPU, and NPU must share a unified memory architecture, which modern mobile SoCs like Apple's M-series and Qualcomm Snapdragon provide.
*   *Typical Hardware:* Mobile NPUs (Apple Neural Engine, Qualcomm Hexagon), embedded GPUs (Jetson).
*   *Cold Start:* Milliseconds (model already in app memory); first inference may trigger JIT compilation (100–500 ms).
*   *Power Budget:* 1–5 W sustained, with thermal throttling after prolonged inference.

#### Bare-Metal Serving (TinyML) {#sec-model-serving-baremetal-serving-tinyml-28cf}

The model\index{Serving!TinyML}\index{TinyML!bare-metal serving} is compiled into the firmware of a microcontroller, the extreme end of the deployment spectrum @sec-ml-systems-tinyml-ubiquitous-sensing-scale-a67b introduced as ubiquitous sensing at microwatt power budgets. There is no operating system or dynamic memory allocator. "Serving" is a tight loop reading sensors and invoking the interpreter. Optimization focuses on **static memory usage** (fitting in SRAM).

*   *Key Difference:* All memory is pre-allocated (Tensor Arena)\index{Tensor Arena!TinyML memory}. Dynamic batching is impossible.
*   *Typical Hardware:* ARM Cortex-M series, ESP32, specialized TinyML accelerators.
*   *Cold Start:* Microseconds (model weights in flash, tensor arena pre-allocated).
*   *Power Budget:* Microwatts to milliwatts; battery operation for months or years.

@tbl-serving-spectrum summarizes *how* these deployment contexts shape serving system design:

| **Characteristic**   | **Cloud/Datacenter** | **Mobile/Edge**      | **TinyML**       |
|:---------------------|:---------------------|:---------------------|:-----------------|
| **Latency Target**   | 10–100 ms            | 20–50 ms             | 1–100 ms         |
| **Batch Size**       | 1–128 (dynamic)      | 1 (fixed)            | 1 (fixed)        |
| **Memory**           | 16–80 GB VRAM        | 2–8 GB shared        | 256 KB–2 MB SRAM |
| **Power**            | 300–700 W            | 1–10 W               | 1–100 mW         |
| **Update Mechanism** | Container deploy     | App store update     | Firmware OTA     |
| **Failure Mode**     | Retry/failover       | Graceful degradation | Silent or reset  |
| **Monitoring**       | Full telemetry       | Limited analytics    | Heartbeat only   |

: **Serving Architecture Spectrum**: The deployment paradigm selected in @sec-ml-systems-comparative-analysis-paradigm-selection-bf66 shapes every aspect of serving system design. Cloud systems optimize for throughput with dynamic batching; mobile systems optimize for energy with fixed batch-1; TinyML systems operate under extreme memory and power constraints with no dynamic allocation. The physical walls (light, power, memory) that created these paradigms now dictate the serving constraints each must satisfy. {#tbl-serving-spectrum}

To make these architectural differences concrete, consider *how* a single model must adapt to each deployment context:

```{python}
#| label: resnet-spectrum-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESNET-50 ACROSS THE SERVING SPECTRUM
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50 Across the Serving Spectrum"
# │
# │ Goal: Contrast serving requirements across Cloud, Mobile, and TinyML.
# │ Show: That the same model requires different formats and architectures.
# │ How: Calculate model sizes and compare NPU vs. CPU efficiency.
# │
# │ Imports: mlsys, mlsys.constants, mlsys.formatting
# │ Exports: cloud_*, mobile_*, tiny_* formatted strings
# └─────────────────────────────────────────────────────────────────────────────
from mlsys import Models, Tiers
from mlsys.constants import BYTES_FP16, BYTES_INT8
from mlsys.formatting import fmt, check

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class ResNetServingSpectrum:
    """
    Namespace for ResNet-50 Serving Spectrum comparison.
    Scenario: Mapping the same architecture (or alternatives) to Cloud, Mobile, TinyML.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    m_resnet = Models.ResNet50
    m_mobilenet = Models.MobileNetV2

    t_cloud = Tiers.Cloud
    t_mobile = Tiers.Mobile
    t_tiny = Tiers.Tiny

    # Cloud (V100)
    cloud_inf_b1_ms = 1.4
    cloud_inf_b16_ms = 14.0
    cloud_throughput = 1143
    cloud_vram_gb = 2

    # Mobile (Pixel 6)
    mobile_inf_npu_ms = 12.0
    mobile_inf_cpu_ms = 45.0
    mobile_throughput = 80
    mobile_energy_npu_mj = 0.8
    mobile_energy_cpu_mj = 4.2

    # TinyML (Cortex-M7)
    tiny_inf_ms = 120.0
    tiny_energy_mj = 12.0

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    # Calculate sizes using the Digital Twins
    cloud_size_mb = m_resnet.size_in_bytes(BYTES_FP16).to('MB').magnitude
    mobile_size_mb = m_resnet.size_in_bytes(BYTES_INT8).to('MB').magnitude
    tiny_original_mb = m_resnet.size_in_bytes(BYTES_INT8).to('MB').magnitude
    tiny_alt_mb = m_mobilenet.size_in_bytes(BYTES_INT8).to('MB').magnitude

    tiny_limit_mb = t_tiny.storage.to('MB').magnitude
    tiny_feasibility = tiny_original_mb < tiny_limit_mb

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(not tiny_feasibility,
          f"ResNet-50 ({tiny_original_mb:.1f}MB) should NOT fit on TinyML (<{tiny_limit_mb:.1f}MB).")
    check(mobile_energy_cpu_mj >= mobile_energy_npu_mj * 3,
          "NPU should be significantly more energy efficient than CPU.")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    cloud_model_mb_str = fmt(cloud_size_mb, precision=0)
    cloud_inf_b1_ms_str = f"{cloud_inf_b1_ms}"
    cloud_inf_b16_ms_str = f"{cloud_inf_b16_ms}"
    cloud_throughput_str = f"{cloud_throughput:,}"
    cloud_vram_gb_str = f"{cloud_vram_gb}"

    mobile_model_mb_str = fmt(mobile_size_mb, precision=0)
    mobile_inf_npu_ms_str = f"{mobile_inf_npu_ms}"
    mobile_inf_cpu_ms_str = f"{mobile_inf_cpu_ms}"
    mobile_throughput_str = f"{mobile_throughput}"
    mobile_energy_npu_mj_str = f"{mobile_energy_npu_mj}"
    mobile_energy_cpu_mj_str = f"{mobile_energy_cpu_mj}"
    mobile_mem_mb_str = "150"

    tiny_model_mb_str = fmt(tiny_original_mb, precision=0)
    tiny_alt_mb_str = fmt(tiny_alt_mb, precision=1)
    tiny_inf_ms_str = f"{tiny_inf_ms}"
    tiny_throughput_str = "8"
    tiny_arena_kb_str = "320"
    tiny_sram_kb_str = "512"
    tiny_energy_mj_str = f"{tiny_energy_mj}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
cloud_model_mb_str = ResNetServingSpectrum.cloud_model_mb_str
cloud_inf_b1_ms_str = ResNetServingSpectrum.cloud_inf_b1_ms_str
cloud_inf_b16_ms_str = ResNetServingSpectrum.cloud_inf_b16_ms_str
cloud_throughput_str = ResNetServingSpectrum.cloud_throughput_str
cloud_vram_gb_str = ResNetServingSpectrum.cloud_vram_gb_str
mobile_model_mb_str = ResNetServingSpectrum.mobile_model_mb_str
mobile_inf_npu_ms_str = ResNetServingSpectrum.mobile_inf_npu_ms_str
mobile_inf_cpu_ms_str = ResNetServingSpectrum.mobile_inf_cpu_ms_str
mobile_throughput_str = ResNetServingSpectrum.mobile_throughput_str
mobile_energy_npu_mj_str = ResNetServingSpectrum.mobile_energy_npu_mj_str
mobile_energy_cpu_mj_str = ResNetServingSpectrum.mobile_energy_cpu_mj_str
mobile_mem_mb_str = ResNetServingSpectrum.mobile_mem_mb_str
tiny_model_mb_str = ResNetServingSpectrum.tiny_model_mb_str
tiny_alt_mb_str = ResNetServingSpectrum.tiny_alt_mb_str
tiny_inf_ms_str = ResNetServingSpectrum.tiny_inf_ms_str
tiny_throughput_str = ResNetServingSpectrum.tiny_throughput_str
tiny_arena_kb_str = ResNetServingSpectrum.tiny_arena_kb_str
tiny_sram_kb_str = ResNetServingSpectrum.tiny_sram_kb_str
tiny_energy_mj_str = ResNetServingSpectrum.tiny_energy_mj_str
```

::: {.callout-perspective #perspective-resnet-serving title="ResNet-50 Across the Serving Spectrum"}

The same ResNet-50 architecture requires dramatically different serving strategies across deployment contexts:

**Cloud (V100 GPU):**

- Model format: TensorRT FP16 engine (`{python} cloud_model_mb_str`MB)
- Inference: `{python} cloud_inf_b1_ms_str`ms at batch-1, `{python} cloud_inf_b16_ms_str`ms at batch-16
- Throughput: `{python} cloud_throughput_str` images/second (batched)
- Memory: `{python} cloud_vram_gb_str`GB VRAM (model + activations for batch-32)

**Mobile (Pixel 6 NPU):**

- Model format: TensorFlow Lite INT8 (`{python} mobile_model_mb_str`MB)
- Inference: `{python} mobile_inf_npu_ms_str`ms at batch-1 (NPU), `{python} mobile_inf_cpu_ms_str`ms (CPU fallback)
- Throughput: ~`{python} mobile_throughput_str` images/second (single-stream)
- Memory: `{python} mobile_mem_mb_str`MB peak (shared with app)
- Energy: `{python} mobile_energy_npu_mj_str`mJ per inference (NPU), `{python} mobile_energy_cpu_mj_str`mJ (CPU)

**TinyML (Cortex-M7):**

- Model format: Not feasible; ResNet-50 requires `{python} tiny_model_mb_str`MB weights
- Alternative: MobileNetV2-0.35 quantized to INT8 (`{python} tiny_alt_mb_str`MB)
- Inference: `{python} tiny_inf_ms_str`ms at batch-1
- Throughput: ~`{python} tiny_throughput_str` images/second
- Memory: `{python} tiny_arena_kb_str`KB tensor arena (fits in `{python} tiny_sram_kb_str`KB SRAM)
- Energy: `{python} tiny_energy_mj_str`mJ per inference

**Key insight**: The "same model" claim is misleading: each deployment requires not just different optimization but often different architectures entirely. TinyML serving cannot use ResNet-50; it requires architectures designed for the constraints from the start.

:::

### The Load Balancer Layer {#sec-model-serving-load-balancer-layer-9c4d}

The preceding spectrum focused on *how* deployment context shapes serving constraints, from datacenter GPUs to microcontroller SRAM. What happens when traffic exceeds what a single machine can handle? For cloud and datacenter deployments specifically—where multiple replicas serve the same model—an additional infrastructure layer becomes essential: the load balancer. Production serving systems place load balancers\index{Load Balancer!serving infrastructure} between clients and model servers, providing three essential functions for serving infrastructure.

Request distribution, the first function, routes incoming requests to available model replicas using algorithms like round-robin or least-connections. For latency-sensitive ML serving, algorithms that route away from slow or overloaded replicas improve tail latency. The second, health monitoring\index{Health Monitoring!replica readiness}, continuously verifies that replicas are ready to serve, routing traffic away from unhealthy instances. For ML systems, health checks must verify not just process liveness but model readiness, confirming that weights are loaded and warmup is complete. The third, deployment support, enables safe model updates by gradually shifting traffic between versions. @sec-ml-operations examines deployment strategies including canary testing, blue-green deployments, and shadow mode validation.

For single-machine serving with multiple model instances, such as running several ONNX Runtime sessions, the framework and operating system handle request queuing. The full complexity of load balancing becomes necessary when scaling to distributed inference systems, where multiple machines serve the same model. The implementation details of request distribution algorithms and multi-replica architectures belong to that distributed context.

When capacity planning considers "the server" in this chapter, it means the single machine's model serving capacity. The queuing dynamics analyzed in @sec-model-serving-queuing-theory-tail-latency-29a6 apply to understanding single-machine behavior and determining when scaling to multiple machines becomes necessary.

While load balancers distribute requests across replicas, achieving predictable latency also requires controlling what happens *within* each machine. The operating system environment introduces its own sources of variability.

### Deterministic Latency and Resource Isolation {#sec-model-serving-deterministic-latency-resource-isolation-4d1c}

An inference server does not operate in isolation. On a single machine, the operating system manages multiple competing processes (logging agents, monitoring tools, and system interrupts) that can intermittently steal CPU cycles from the inference pipeline. These "noisy neighbors" are a primary source of **latency jitter**, where the time required to process identical requests varies significantly, causing the 99th percentile (P99) latency to spike even when the hardware is under-utilized. Recall the tail latency explosion from @fig-tail-latency-explosion—the same spike occurs here, but the trigger is resource contention rather than queuing.

Achieving deterministic performance\index{Latency!deterministic}\index{Resource Isolation!serving} on a single node requires isolating the inference process from the operating system's normal resource-sharing behavior. The most impactful technique is CPU affinity (pinning)\index{CPU Affinity!latency reduction}, which restricts the inference server's threads to specific physical cores. Without pinning, the OS freely migrates threads between cores, evicting warm cache lines and introducing 10–50 μs context-switch penalties that appear as latency jitter. Pinning eliminates this migration, ensuring that preprocessing always has immediate access to computational resources and that the CPU cache remains warm between requests.

Memory locking (`mlock`)\index{Memory Locking!mlock} addresses a related but distinct source of jitter. By default, the OS can page any memory region to disk under memory pressure. If the GPU's DMA engine begins reading model weights from a region that has been paged out, the transfer stalls until the data is faulted back into RAM—a penalty measured in milliseconds rather than microseconds. Locking model weights and KV caches in physical RAM guarantees consistent access times, though the trade-off is that pinned memory cannot be reclaimed by other processes.

The third technique, interrupt shielding\index{Interrupt Shielding!latency isolation}, completes the isolation picture. Network and storage interrupts routed to inference cores can preempt GPU command submission at unpredictable moments. Steering these interrupts to non-inference cores ensures that bursts of incoming traffic do not disrupt the GPU's command stream, which is particularly important for maintaining stable tail latency under load.

These isolation principles transform a simple "model script" into a **deterministic service**, a transition essential for safety-critical applications like autonomous driving or real-time industrial control. With the deployment spectrum, load balancing, and resource isolation established, we have defined *where* models serve and *what* infrastructure supports them. The next question is *how* the serving software itself is organized: what components comprise an inference server, and how do they coordinate to turn irregular user traffic into efficient hardware utilization?

## Serving System Architecture {#sec-model-serving-serving-system-architecture-4879}

The serving paradigm establishes *where* models execute; now we examine *how* the serving software itself is organized. A modern inference server must bridge the gap between irregular user traffic and the batch-oriented requirements of accelerators—a challenge that requires careful architectural decomposition.

### Internal Architecture and Request Flow {#sec-model-serving-anatomy-inference-server-f12e}

Model optimization focuses on the mathematical artifact, while model serving requires a specialized software architecture to manage high-frequency request streams and hardware utilization. An inference server\index{Inference Server!architecture}[^fn-inference-server] (such as NVIDIA Triton, TensorFlow Serving\index{TensorFlow Serving}, or TorchServe) is not a simple wrapper around a model script; it is a high-performance scheduler that manages concurrency, memory, and data movement.

[^fn-inference-server]: **Inference Server**: The concept emerged from Google's TensorFlow Serving [@olston2017tensorflow], open-sourced February 2016, which pioneered the separation of model logic from serving infrastructure. NVIDIA's Triton [@nvidia2024triton], originally TensorRT Inference Server with GA release in March 2019, extended this to multi-framework support. These servers implement dynamic batching that can improve GPU utilization by up to 70% compared to naive single-request serving. The architecture mirrors the separation of concerns in traditional web servers like Apache (1995) and nginx (2004), applying decades of distributed systems knowledge to ML deployment.

The internal anatomy of these servers reveals *how* they bridge the gap between irregular user traffic and the highly regular, batch-oriented requirements of accelerators. The core challenge is that user requests arrive unpredictably (one millisecond apart, then five seconds of silence), while GPUs perform best with steady streams of uniformly-sized batches.

Every request traverses a multi-stage pipeline designed to maximize hardware throughput while minimizing latency overhead. Walk through the six stages in @fig-server-anatomy to see how each component absorbs a different source of complexity.

::: {#fig-server-anatomy fig-env="figure" fig-pos="htb" fig-cap="**Inference Server Anatomy**: A modern inference server decouples network handling from accelerator execution through a staged pipeline. Each stage isolates a concern, from absorbing bursty traffic to forming efficient batches, so the hardware accelerator stays highly utilized despite irregular arrival patterns." fig-alt="Flowchart showing 6-stage inference server pipeline: Client to Network Ingress to Request Queue (cylinder) to Dynamic Batcher, then down to Inference Runner to Accelerator. Arrows connect stages sequentially."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, node distance=1.2cm]
  \tikzset{
    Box/.style={draw=BackLine, thick, rounded corners=2pt, align=center, minimum width=2.5cm, minimum height=1cm},
    Hardware/.style={Box, fill=GreenL, draw=GreenLine},
    Software/.style={Box, fill=BlueL, draw=BlueLine},
    Queue/.style={draw=OrangeLine, thick, shape=cylinder, shape border rotate=90, aspect=0.25, minimum width=1.5cm, minimum height=1.2cm, fill=OrangeL}
  }

  \node[Box, fill=white] (client) {Client\\(Request)};
  \node[Software, right=of client] (ingress) {Network Ingress\\(HTTP/gRPC)};
  \node[Queue, right=of ingress] (queue) {Request\\Queue};
  \node[Software, right=of queue] (scheduler) {Dynamic\\Batcher};
  \node[Software, below=1.5cm of scheduler] (runtime) {Inference Runner\\(TensorRT/ONNX)};
  \node[Hardware, below=1.0cm of runtime] (gpu) {Accelerator\\(GPU/TPU)};

  \draw[->, thick] (client) -- (ingress);
  \draw[->, thick] (ingress) -- (queue);
  \draw[->, thick] (queue) -- (scheduler);
  \draw[->, thick] (scheduler) -- (runtime);
  \draw[->, thick] (runtime) -- (gpu);

  % Labels
  \node[right=0.2cm of queue, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] {Request Buffering};
  \node[right=0.2cm of scheduler, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] {Throughput Opt.};
  \node[right=0.2cm of runtime, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] {Execution Opt.};

\end{tikzpicture}
```
:::

This architecture serves three functions. First, *concurrency management*: servers use asynchronous event loops or thread pools to handle thousands of concurrent client connections without blocking, ensuring that network I/O wait times do not idle the accelerator. Second, *request transformation*\index{Request Transformation!tensor formats}: the server converts network payloads (JSON/Protobuf) into the specific tensor formats required by the optimized model runtime. Image tensors, for example, can be stored as NCHW[^fn-nchw-nhwc]\index{NCHW!tensor layout} (batch, channels, height, width) or NHWC\index{NHWC!tensor layout} (batch, height, width, channels). PyTorch and TensorRT prefer NCHW because it places channel data contiguously, enabling efficient convolution on GPUs. TensorFlow defaults to NHWC, which is more efficient on CPUs.

[^fn-nchw-nhwc]: **NCHW and NHWC**: These acronyms encode the memory layout order of 4D image tensors: N (batch), C (channels), H (height), W (width). The layout determines which elements are contiguous in memory, with profound performance implications. NCHW places all values for one channel together, enabling vectorized convolution filters to read contiguous memory blocks on GPUs. NHWC interleaves channels at each spatial position, which aligns better with CPU SIMD instructions that process multiple channels simultaneously. The choice dates to framework design decisions: Caffe [@jia2014caffe] established NCHW as the GPU convention; TensorFlow (2015) chose NHWC for CPU compatibility. A format mismatch between client and server silently corrupts inference: the model interprets pixel rows as color channels, producing garbage outputs without raising errors.

Third, *model management*: inference servers manage the lifecycle of models, including loading weights into VRAM, managing versioning, and ensuring that warmup inferences are completed before exposing the model to live traffic.

Of these components, the scheduler deserves special attention because it embodies the core serving tradeoff between throughput and latency.

### The Scheduler: Where Throughput Meets Latency {#sec-model-serving-scheduler-throughput-meets-latency-d022}

The **Scheduler**\index{Scheduler!inference server} is the "brain" of the inference server. It implements the dynamic batching logic discussed in @sec-model-serving-throughput-optimization-18d1. The scheduler must decide: "Should I run this one request now to minimize its latency, or wait 5 milliseconds for a second request to arrive and process them together to maximize throughput?"

Systems designers use the **Batching Window**\index{Batching Window!latency-throughput tradeoff} parameter to tune this trade-off. A window of 0 ms optimizes for pure latency (no batching), while a window of 10–50 ms is common for high-throughput cloud services. This decision determines the "duty cycle" of the GPU, the percentage of time the hardware is actually computing versus waiting for work.

### Interface Protocols and Serialization {#sec-model-serving-interface-protocols-serialization-5510}

The mechanism used to transport data between client and server directly affects the latency budget. Model inference is often highly optimized, yet the cost of moving data into the model (serialization and network protocol overhead) can become the dominant bottleneck, especially for lightweight models where inference time is small.

#### The Serialization Bottleneck {#sec-model-serving-serialization-bottleneck-aaa0}

Text-based\index{Serialization!overhead} formats like JSON are ubiquitous but computationally expensive. Parsing a JSON object requires reading every byte, validating syntax, and converting text representations into machine-native types. For high-throughput systems, this consumes CPU cycles that could otherwise be used for request handling or preprocessing.

\index{FlatBuffers!zero-copy serialization}
Binary formats like Protocol Buffers[^fn-protobuf] (Protobuf) or FlatBuffers[^fn-flatbuffers] reduce this overhead by designing the wire format to map directly to in-memory data structures. This enables "zero-copy" deserialization in optimal cases, where the network buffer can be used directly without allocating new memory.

[^fn-protobuf]: **Protocol Buffers (Protobuf)**: Google's language-neutral binary serialization format, first developed internally circa 2001 and open-sourced in 2008. "Protocol" refers to the message format specification (the `.proto` schema), while "Buffers" refers to the serialized byte buffers. Protobuf uses a schema-first design: message structures are defined in `.proto` files, then compiled to language-specific code. The binary encoding is 3--10 $\times$ more compact than JSON and 20--100 $\times$ faster to parse because the schema eliminates runtime type checking. For ML serving, Protobuf's fixed-size integer encoding and packed repeated fields make tensor serialization efficient, though the format still requires a deserialization step (unlike FlatBuffers).

\index{FlatBuffers!etymology}

[^fn-flatbuffers]: **FlatBuffers**: Created by Wouter van Oortmerssen at Google and released in 2014. Originally designed for mobile game development where memory allocation and serialization overhead were unacceptable, FlatBuffers stores data in a format that can be accessed directly without parsing or unpacking. The "flat" in the name refers to the flat binary buffer that serves simultaneously as the serialized and in-memory representation. For ML inference, FlatBuffers enables true zero-copy access to tensor metadata: the serving system can read tensor shapes and data pointers directly from the network buffer without allocating new memory, reducing per-request overhead to near zero. TensorFlow Lite uses FlatBuffers as its model format for exactly this reason.

#### REST vs gRPC {#sec-model-serving-rest-vs-grpc-c7b7}

Two dominant paradigms define modern serving interfaces, each with distinct system characteristics. REST (Representational State Transfer)[^fn-rest]\index{REST!HTTP/1.1 protocol} typically uses HTTP/1.1 and JSON. It is universally supported, human-readable, and stateless, making it the default choice for public-facing APIs. However, standard HTTP/1.1 requires a new TCP handshake for each request (unless keep-alive is carefully tuned), and JSON serialization adds significant latency for numerical data like tensors.

In contrast, gRPC (gRPC Remote Procedure Call)\index{gRPC!inference protocol}[^fn-grpc] uses HTTP/2 and Protobuf\index{Protocol Buffers!serialization}. HTTP/2 enables multiplexing multiple requests over a single persistent TCP connection, eliminating handshake latency and allowing efficient binary streaming. Protobuf provides strict type safety and efficient binary serialization, making it the standard for internal service-to-service communication where latency is critical.

[^fn-rest]: **REST (Representational State Transfer)**\index{REST!etymology}: Defined by Roy Fielding [@fielding2000rest] in his 2000 PhD dissertation at UC Irvine. An *architectural style* rather than a protocol, REST distills the web's design principles into six constraints: client-server separation, statelessness, cacheability, uniform interface, layered system, and code on demand. Its simplicity made it the dominant paradigm for web APIs, though text-based HTTP/1.1 and JSON serialization create overhead for ML inference, where binary tensor data dominates the payload.

[^fn-grpc]: **gRPC**: Open-sourced by Google in February 2015, gRPC evolved from Stubby, Google's internal RPC framework that had been handling tens of billions of calls per second across their datacenters since approximately 2001. The combination of HTTP/2 multiplexing and Protocol Buffers binary serialization achieves roughly 10 $\times$ lower serialization overhead than REST/JSON, making it the de facto standard for latency-sensitive ML inference APIs.

The following example compares *JSON vs Protobuf serialization*.

```{python}
#| label: serialization-comparison-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ JSON VS PROTOBUF SERIALIZATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "JSON vs Protobuf Serialization"
# │
# │ Goal: Quantify the serialization tax in high-throughput inference.
# │ Show: The 10× efficiency gain of Protobuf over JSON for vector data.
# │ How: Calculate parsing overhead and wire size for a 1000-float payload.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: serial_floats_str, json_size_str, json_parse_str, protobuf_size_str,
# │          protobuf_parse_str, requests_per_sec_str, efficiency_gain_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class SerializationEfficiency:
    """
    Namespace for Serialization Efficiency calculation.
    Scenario: Comparing JSON vs Protobuf for a 1000-float payload.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    floats_count = 1000
    json_parse_us = 50.0
    proto_parse_us = 5.0

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    efficiency_gain = json_parse_us / proto_parse_us

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(efficiency_gain >= 5, f"Protobuf gain ({efficiency_gain:.1f}x) is too small to justify switching.")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    serial_floats_str = f"{floats_count:,}"
    json_size_str = "9"
    json_parse_str = f"{int(json_parse_us)}"
    protobuf_size_str = "4"
    protobuf_parse_str = f"{int(proto_parse_us)}"
    requests_per_sec_str = "10,000"
    efficiency_gain_str = fmt(efficiency_gain, precision=0, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
serial_floats_str = SerializationEfficiency.serial_floats_str
json_size_str = SerializationEfficiency.json_size_str
json_parse_str = SerializationEfficiency.json_parse_str
protobuf_size_str = SerializationEfficiency.protobuf_size_str
protobuf_parse_str = SerializationEfficiency.protobuf_parse_str
requests_per_sec_str = SerializationEfficiency.requests_per_sec_str
efficiency_gain_str = SerializationEfficiency.efficiency_gain_str
```

::: {.callout-notebook title="JSON vs Protobuf Serialization"}

Consider a request payload containing `{python} serial_floats_str` floating point numbers (e.g., an embedding vector).

*   **JSON**: Uses ~`{python} json_size_str` KB on the wire. Requires ~`{python} json_parse_str` μs to parse.
*   **Protobuf**: Uses ~`{python} protobuf_size_str` KB on the wire. Requires ~`{python} protobuf_parse_str` μs to parse.

For a system processing `{python} requests_per_sec_str` requests per second, switching to Protobuf saves nearly half a core of CPU time just in serialization overhead. This `{python} efficiency_gain_str` $\times$ efficiency gain makes gRPC essential for high-throughput internal microservices.

:::

The system choice is clear: use REST for public APIs to maximize developer accessibility, and use gRPC for high-performance internal communication to minimize the serialization tax.

The architectural components and protocols examined so far describe *how* serving systems are built. Understanding *why* certain configurations perform better requires analyzing what happens to individual requests as they traverse these components.

## Request Lifecycle {#sec-model-serving-request-lifecycle-d9c6}

With the serving architecture established, we now trace *what* happens to a single request as it flows through the system. Understanding *where* time goes within each request is essential for effective optimization: one cannot improve what one does not measure.

### The Latency Budget {#sec-model-serving-latency-budget-ef40}

For dynamic inference systems\index{Latency Budget!optimization objectives}, the serving inversion established in @sec-model-serving-serving-paradigm-9634 has concrete implications for system design [@gujarati2020serving]. A serving system with 1000 ms per-request latency has failed, even if it achieves excellent throughput.

```{python}
#| label: tail-latency-ratio-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ TAIL LATENCY RATIO
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Latency Budget introduction paragraph
# │
# │ Goal: Demonstrate why mean latency is a misleading metric for user experience.
# │ Show: That p99 users can wait 40× longer than the median.
# │ How: Calculate the ratio between mean and tail response times.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: tail_ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (latency distribution example) ---
mean_latency_ms_value = 50            # mean latency (ms)
p99_latency_ms_value = 2000           # p99 latency (ms)

# --- Process (ratio calculation) ---
tail_ratio_value = p99_latency_ms_value / mean_latency_ms_value

# --- Output (formatted string for prose) ---
tail_ratio_str = fmt(tail_ratio_value, precision=0, commas=False)            # e.g. "40" times
```

The metrics that matter change from aggregate throughput to latency distributions. Mean latency tells you little about user experience; p50\index{p50 Latency!median response time}, p95\index{p95 Latency!percentile target}, and p99 latencies\index{Latency!percentiles (p50, p95, p99)} reveal *how* the system performs across the full range of requests. If your mean latency is 50 ms but p99 is 2 seconds, one in a hundred users waits `{python} tail_ratio_str` times longer than average. For consumer-facing applications, these tail latencies often determine user satisfaction and retention.[^fn-tail-latency-impact]

[^fn-tail-latency-impact]: **Tail Latency Impact**: Research at Google and Amazon in the mid-2000s established that users are more sensitive to latency variance than mean latency. Industry experience suggests that latency increases of 100 ms can measurably impact user engagement and conversion rates for e-commerce applications, though the magnitude varies by context. This is why service level objectives (SLOs) typically specify percentile targets rather than averages.

Managing these percentile constraints requires decomposing the total allowed response time into a *latency budget*\index{Latency Budget!request lifecycle breakdown} that allocates time across each processing phase.

::: {.callout-definition title="Latency Budget"}

***Latency Budget***\index{Latency Budget!SLO constraints}\index{SLO (Service Level Objective)!latency budget} is the time capital allocated to a request, strictly bounded by the end-to-end SLO (the internal performance target; the contractual commitment is the SLA\index{SLA (Service Level Agreement)}). It acts as a zero-sum constraint system where any milliseconds consumed by serialization or network overhead directly reduce the computational budget available for model inference.

:::

Before computing a full budget, we pose the foundational *latency analysis questions* that every serving engineer must answer.

::: {.callout-notebook title="ResNet-50: Latency Analysis Questions"}
Serving is about optimizing the **Tail Latency** under load.

**The Physics of Latency**

Consider these foundational questions:

1. **Queuing Theory**: Why do latency spikes occur non-linearly as utilization approaches 100%? The M/M/1 queue model explains this behavior.
2. **Batching Trade-offs**: Why does increasing batch size improve throughput (images/sec) yet degrade latency (ms/request)?

**Optimization Targets**

3. **The Bottleneck**: In a highly optimized inference server, why does **Preprocessing** often consume more time than the model itself?
:::

Every serving request decomposes into three phases that each consume part of the latency budget. Preprocessing\index{Preprocessing!latency impact} transforms raw input such as image bytes or text strings into model-ready tensors. Inference\index{Inference!pipeline phase} executes the model computation. Postprocessing\index{Postprocessing!response formatting} transforms model outputs into user-facing responses.

Faster hardware does not automatically mean faster serving\index{Amdahl's Law!preprocessing bottleneck}. In practice, preprocessing and postprocessing often dominate total latency. Studies of production systems show preprocessing consuming 60 to 70 percent of total request time when inference runs on optimized accelerators [@nvidia_triton]. Optimizing only the inference phase yields diminishing returns when the surrounding pipeline remains bottlenecked on CPU operations.

### Latency Distribution Analysis {#sec-model-serving-latency-distribution-analysis-b0f8}

Understanding *where* time goes requires instrumenting each phase independently. A *ResNet-50 latency budget breakdown* reveals exactly how each millisecond is spent when our classifier receives a JPEG image:

```{python}
#| label: latency-table-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LATENCY BUDGET BREAKDOWN TABLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Latency Budget Breakdown" table
# │
# │ Goal: Decompose the request lifecycle into processing phases.
# │ Show: That non-inference tasks (JPEG decode, resize) consume 50% of the latency budget.
# │ How: Sum millisecond-scale components for a standard vision inference request.
# │
# │ Imports: (none)
# │ Exports: l_jpeg_str, l_resize_str, l_norm_str, l_transfer_str, l_inf_str,
# │          l_post_str, l_total_str, p_jpeg_str, p_resize_str, p_norm_str,
# │          p_transfer_str, p_inf_str, p_post_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (latency per phase, ms) ---
l_jpeg_value = 3.0                    # JPEG decode
l_resize_value = 1.0                  # resize to 224×224
l_norm_value = 0.5                    # normalize (mean/std)
l_transfer_value = 0.5                # CPU→GPU transfer
l_inf_value = 5.0                     # ResNet-50 forward pass
l_post_value = 0.1                    # softmax + top-5

# --- Process (totals and percentages) ---
l_total_value = l_jpeg_value + l_resize_value + l_norm_value + l_transfer_value + l_inf_value + l_post_value
p_jpeg_value = l_jpeg_value / l_total_value * 100
p_resize_value = l_resize_value / l_total_value * 100
p_norm_value = l_norm_value / l_total_value * 100
p_transfer_value = l_transfer_value / l_total_value * 100
p_inf_value = l_inf_value / l_total_value * 100
p_post_value = l_post_value / l_total_value * 100

# --- Outputs (formatted strings for table) ---
l_jpeg_str = f"{l_jpeg_value:.1f}ms"                                         # e.g. "3.0ms"
l_resize_str = f"{l_resize_value:.1f}ms"                                     # e.g. "1.0ms"
l_norm_str = f"{l_norm_value:.1f}ms"                                         # e.g. "0.5ms"
l_transfer_str = f"{l_transfer_value:.1f}ms"                                 # e.g. "0.5ms"
l_inf_str = f"{l_inf_value:.1f}ms"                                           # e.g. "5.0ms"
l_post_str = f"{l_post_value:.1f}ms"                                         # e.g. "0.1ms"
l_total_str = f"{l_total_value:.1f}ms"                                       # e.g. "10.1ms"

p_jpeg_str = f"{p_jpeg_value:.0f}%"                                          # e.g. "30%"
p_resize_str = f"{p_resize_value:.0f}%"                                      # e.g. "10%"
p_norm_str = f"{p_norm_value:.0f}%"                                          # e.g. "5%"
p_transfer_str = f"{p_transfer_value:.0f}%"                                  # e.g. "5%"
p_inf_str = f"{p_inf_value:.0f}%"                                            # e.g. "50%"
p_post_str = f"~{p_post_value:.0f}%"                                         # e.g. "~1%"
```

```{python}
#| label: latency-budget-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ PREPROCESSING SHARE OF LATENCY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Latency distribution narrative (key insight paragraph)
# │
# │ Goal: Demonstrate the shifting bottleneck from inference to preprocessing.
# │ Show: That optimized inference (TensorRT) makes preprocessing 68% of total latency.
# │ How: Compare phase durations before and after model acceleration.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: preprocess_ms_str, cpu_gpu_ms_str, resnet_inference_ms_str,
# │          tensorrt_inference_ms_str, model_10x_ms_str, total_latency_str,
# │          preprocess_pct_str, tensorrt_preprocess_pct_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (latency components, ms) ---
jpeg_decode_ms_value = 3.0            # JPEG decode
resize_ms_value = 1.0                 # resize
normalize_ms_value = 0.5              # normalize
cpu_gpu_ms_value = 0.5                # CPU→GPU transfer
resnet_inference_ms_value = 5.0       # PyTorch inference
postprocess_ms_value = 0.1            # postprocessing
tensorrt_inference_ms_value = 2.0     # TensorRT optimized inference

# --- Process (preprocessing percentage) ---
preprocess_ms_value = jpeg_decode_ms_value + resize_ms_value + normalize_ms_value
total_latency_ms_value = preprocess_ms_value + cpu_gpu_ms_value + resnet_inference_ms_value + postprocess_ms_value
preprocess_pct_value = preprocess_ms_value / total_latency_ms_value * 100
tensorrt_total_ms_value = preprocess_ms_value + cpu_gpu_ms_value + tensorrt_inference_ms_value + postprocess_ms_value
tensorrt_preprocess_pct_value = preprocess_ms_value / tensorrt_total_ms_value * 100

# --- Outputs (formatted strings for prose) ---
preprocess_ms_str = fmt(preprocess_ms_value, precision=1, commas=False)                # e.g. "4.5" ms
cpu_gpu_ms_str = fmt(cpu_gpu_ms_value, precision=1, commas=False)                      # e.g. "0.5" ms
resnet_inference_ms_str = fmt(resnet_inference_ms_value, precision=0, commas=False)    # e.g. "5" ms
tensorrt_inference_ms_str = fmt(tensorrt_inference_ms_value, precision=0, commas=False)# e.g. "2" ms
model_10x_ms_str = fmt(resnet_inference_ms_value / 10, precision=1, commas=False)      # e.g. "0.5" ms
total_latency_str = fmt(total_latency_ms_value, precision=1, commas=False)             # e.g. "10.1" ms
preprocess_pct_str = fmt(preprocess_pct_value, precision=0, commas=False)              # e.g. "45" %
tensorrt_preprocess_pct_str = fmt(tensorrt_preprocess_pct_value, precision=0, commas=False)  # e.g. "68" %
```

::: {.callout-notebook title="ResNet-50: Latency Budget Breakdown"}

A typical serving request for our ResNet-50 classifier shows the following latency distribution:

| **Phase**          | **Operation**              | **Time**                   | **Percentage**            |
|:-------------------|:---------------------------|:---------------------------|:--------------------------|
| **Preprocessing**  | JPEG decode                | `{python} l_jpeg_str`      | `{python} p_jpeg_str`     |
| **Preprocessing**  | Resize to 224 $\times$ 224 | `{python} l_resize_str`    | `{python} p_resize_str`   |
| **Preprocessing**  | Normalize (mean/std)       | `{python} l_norm_str`      | `{python} p_norm_str`     |
| **Data Transfer**  | CPU→GPU copy               | `{python} l_transfer_str`  | `{python} p_transfer_str` |
| **Inference**      | **ResNet-50 forward pass** | **`{python} l_inf_str`**   | **`{python} p_inf_str`**  |
| **Postprocessing** | Softmax + top-5            | `{python} l_post_str`      | `{python} p_post_str`     |
| **Total**          |                            | **`{python} l_total_str`** | **100%**                  |

Key insight: preprocessing consumes `{python} preprocess_pct_str`% of latency despite model inference being the computationally intensive phase. With TensorRT optimization reducing inference to `{python} tensorrt_inference_ms_str` ms, preprocessing would dominate at `{python} tensorrt_preprocess_pct_str`%.

:::

The ResNet example represents compute-bound inference where math dominates. Recommendation systems exhibit a different bottleneck profile entirely.

```{python}
#| label: dlrm-latency-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ DLRM SERVING LATENCY (IO-BOUND EXAMPLE)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Lighthouse example "DLRM Serving"
# │
# │ Goal: Contrast compute-bound and I/O-bound serving bottlenecks.
# │ Show: That embedding lookups consume 67% of recommendation latency.
# │ How: Model DLRM latency across parsing, embedding, and MLP phases.
# │
# │ Imports: (none)
# │ Exports: dlrm_input_str, dlrm_embed_str, dlrm_mlp_str, dlrm_post_str,
# │          dlrm_total_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (DLRM latency components, ms) ---
dlrm_input_ms_value = 0.5             # request parsing (CPU)
dlrm_embed_ms_value = 6.0             # embedding lookups (memory BW)
dlrm_mlp_ms_value = 1.5               # MLP forward pass (compute)
dlrm_post_ms_value = 1.0              # ranking & filtering (CPU)

# --- Process (total latency) ---
dlrm_total_ms_value = dlrm_input_ms_value + dlrm_embed_ms_value + dlrm_mlp_ms_value + dlrm_post_ms_value

# --- Outputs (formatted strings for table) ---
dlrm_input_str = f"{dlrm_input_ms_value}ms"                                  # e.g. "0.5ms"
dlrm_embed_str = f"{dlrm_embed_ms_value}ms"                                  # e.g. "6.0ms"
dlrm_mlp_str = f"{dlrm_mlp_ms_value}ms"                                      # e.g. "1.5ms"
dlrm_post_str = f"{dlrm_post_ms_value}ms"                                    # e.g. "1.0ms"
dlrm_total_str = f"{dlrm_total_ms_value}ms"                                  # e.g. "9.0ms"
```

::: {.callout-lighthouse title="Lighthouse Example: DLRM Serving"}

**The Scenario**: Serving a Recommendation System (DLRM) with a 10 ms P99 latency budget.

**The Contrast**: While ResNet-50 serving is limited by math (CNN ops), DLRM serving is strictly limited by I/O and memory capacity.

| **Phase**          | **Operation**                | **Time**                      | **Bottleneck** |
|:-------------------|:-----------------------------|:------------------------------|:---------------|
| **Input Parsing**  | Request parsing              | `{python} dlrm_input_str`     | CPU            |
| **Embedding Look** | **Fetch 100+ dense vectors** | **`{python} dlrm_embed_str`** | **Memory BW**  |
| **Inference**      | MLP forward pass             | `{python} dlrm_mlp_str`       | Compute        |
| **Postprocessing** | Ranking & Filtering          | `{python} dlrm_post_str`      | CPU            |
| **Total**          |                              | **`{python} dlrm_total_str`** |                |

**Key Systems Insight**:
In DLRM, the "Inference" (MLP) is only ~15% of the latency. The majority of time is spent in embedding lookups, retrieving massive 128-dim vectors from terabyte-scale tables. This is an IO-bound workload where adding more GPUs does not help unless memory bandwidth and capacity also increase.
:::

This breakdown reveals why straightforward optimization efforts often fail. Engineers focus on model optimization (quantization, pruning) because that is where ML expertise applies, but the actual bottleneck is image decoding running on CPU. Adopting *the quantitative approach to serving* exposes these hidden bottlenecks before engineering effort is misallocated.

```{python}
#| label: amdahl-serving-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ AMDAHL'S LAW IN SERVING
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Quantitative Approach to Serving"
# │
# │ Goal: Demonstrate why model-only optimization yields diminishing returns.
# │ Show: That a 10× model speedup produces only 1.8× end-to-end improvement.
# │ How: Apply Amdahl's Law using the non-inference latency share.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: non_model_pct_str, optimized_total_str, amdahl_speedup_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (re-derived from latency-budget-calc constants) ---
preprocess_ms_value = 3.0 + 1.0 + 0.5      # JPEG decode + resize + normalize
cpu_gpu_ms_value = 0.5                       # CPU→GPU transfer
resnet_inference_ms_value = 5.0              # PyTorch inference
postprocess_ms_value = 0.1                   # postprocessing
total_latency_ms_value = preprocess_ms_value + cpu_gpu_ms_value + resnet_inference_ms_value + postprocess_ms_value

non_model_ms_value = preprocess_ms_value + cpu_gpu_ms_value

# --- Process (Amdahl's Law calculation) ---
non_model_pct_value = non_model_ms_value / total_latency_ms_value * 100
model_10x_ms_value = resnet_inference_ms_value / 10
optimized_total_ms_value = non_model_ms_value + model_10x_ms_value + postprocess_ms_value
amdahl_speedup_value = total_latency_ms_value / optimized_total_ms_value

# --- Outputs (formatted strings for prose) ---
non_model_pct_str = fmt(non_model_pct_value, precision=0, commas=False)       # e.g. "50" %
optimized_total_str = fmt(optimized_total_ms_value, precision=1, commas=False)# e.g. "5.6" ms
amdahl_speedup_str = fmt(amdahl_speedup_value, precision=1, commas=False)     # e.g. "1.8" x
```

::: {.callout-notebook title="The Quantitative Approach to Serving"}

**Amdahl's Law at Work** (see @sec-machine-foundations-amdahls-law-gustafsons-law-b741 for the formal derivation): preprocessing (`{python} preprocess_ms_str` ms) and data transfer (`{python} cpu_gpu_ms_str` ms) consume `{python} non_model_pct_str`% of total latency. Optimizing the model 10 $\times$ faster (`{python} resnet_inference_ms_str` ms → `{python} model_10x_ms_str` ms) yields only `{python} amdahl_speedup_str` $\times$ end-to-end speedup (from `{python} total_latency_str` ms to `{python} optimized_total_str` ms). This is why focusing exclusively on model optimization (quantization, pruning) often disappoints: the bottleneck is elsewhere.

**DSA Efficiency**: General-purpose CPUs achieve only 1–2% of peak performance at batch-1 because instruction overhead dominates. DSAs like TPUs and Tensor Cores replace complex logic with dense MAC arrays, achieving 10–100 $\times$ higher arithmetic intensity. This makes hardware acceleration a requirement for economically viable serving.

**Engineering Implication**: Profile before optimizing. If preprocessing dominates, GPU-accelerated pipelines (NVIDIA DALI) may outperform model quantization.
:::

Moving preprocessing to GPU[^fn-dali]\index{GPU Preprocessing!accelerated pipelines} can reduce total latency by 6 $\times$ in some pipelines by eliminating CPU-GPU data transfers between stages [@nvidia_triton].

[^fn-dali]: **NVIDIA DALI (Data Loading Library)**: Released by NVIDIA in 2018, DALI moves image preprocessing operations (decoding, resizing, color space conversion, normalization) from the CPU to the GPU. The name "DALI" evokes the surrealist painter Salvador Dalí, though NVIDIA uses the acronym for Data Loading Library. The key insight is that CPU-based preprocessing becomes the bottleneck when inference runs on optimized accelerators: a V100 can classify an image in under 1 ms, but CPU-based JPEG decoding and resizing may take 3--5 ms. DALI's GPU-accelerated pipeline processes these operations in parallel with inference, achieving 2--6 $\times$ end-to-end speedup by eliminating the CPU preprocessing bottleneck identified in Amdahl's Law analysis. For serving, DALI also reduces training-serving skew risk by providing identical preprocessing implementations across both pipelines.

Effective optimization targets the largest time consumers first.

#### The Serving Tax Bill {#sec-model-serving-serving-tax-bill-dc6c}

Beyond the model execution itself, every request pays a "tax" to the serving infrastructure. @tbl-serving-tax quantifies these overheads for a typical high-performance inference request (e.g., ResNet-50 classification).

| **Tax Component** |      **Typical Cost** | **Scaling Behavior** | **Tax Evasion Strategy**        |
|:------------------|----------------------:|:---------------------|:--------------------------------|
| **Network I/O**   |                1-5 ms | Linear with payload  | Compression, Region Colocation  |
| **Serialization** |  50–500 $\mu\text{s}$ | Linear with payload  | gRPC/Protobuf (vs JSON)         |
| **Queuing**       |             0.1-10 ms | Exponential w/ load  | Dynamic Batching, Autoscaling   |
| **Dispatch**      |   10–50 $\mu\text{s}$ | Constant per batch   | Kernel Fusion (reduce launches) |
| **Data Copy**     | 100–500 $\mu\text{s}$ | Linear with tensor   | Zero-Copy / Shared Memory       |

: **The Serving Tax Bill**: A breakdown of non-inference latency sources. While individual components like serialization seem small ($<1$ ms), they compound. In a 5ms inference service, this "tax" can easily consume 50% of the latency budget. The primary engineering goal is to drive these costs to zero through architectural choices like gRPC and Zero-Copy data paths. {#tbl-serving-tax}

#### The Killer Microseconds Problem {#sec-model-serving-killer-microseconds-problem-bc00}

Barroso, Patterson, and colleagues identified a critical gap in *how* systems handle latency at different time scales\index{Killer Microseconds!latency gap} [@barroso2017attack]. Operations in the microsecond range are too short for traditional OS scheduling (which operates at millisecond granularity) yet too long to simply spin-wait without wasting CPU cycles. This "killer microseconds" regime dominates modern serving workloads. Consider the compound effect visible in @tbl-serving-tax: serialization at 50 μs, dispatch at 10–50 μs, and data copy at 100–500 μs are each individually negligible, but for a 5 ms inference service, these microsecond-scale overheads collectively consume half the latency budget. No single overhead justifies optimization in isolation, yet together they determine whether the system meets its SLO.

The latency budget framework provides a systematic approach to this compound problem. Measurement comes first: without per-phase instrumentation, engineers cannot distinguish a preprocessing bottleneck from a serialization bottleneck, and optimization effort gets misallocated to the most visible component (the model) rather than the most expensive one. Once measurement reveals the true distribution of time, engineering effort should flow proportionally—a phase consuming 50% of latency deserves more attention than one consuming 5%, regardless of which feels more tractable. Architectural changes such as GPU-accelerated preprocessing or aggressive batching can shift work between phases entirely, sometimes eliminating a bottleneck rather than merely reducing it.

### Resolution and Input Size Tradeoffs {#sec-model-serving-resolution-input-size-tradeoffs-155d}

Input resolution affects both preprocessing and inference latency, but the relationship differs depending on whether the system is compute-bound\index{Compute-Bound!resolution scaling} (limited by arithmetic throughput) or memory-bound\index{Memory-Bound!activation tensors} (limited by data movement). A compute-bound system slows proportionally to increased computation; a memory-bound system may show minimal slowdown if activation tensors still fit in fast memory. @sec-hardware-acceleration covers this distinction in depth through roofline model analysis; understanding it is essential for making informed resolution decisions.

For compute-bound models, @eq-resolution-throughput formalizes how throughput ($X$) scales inversely with resolution squared:

$$\frac{X(r_2)}{X(r_1)} = \left(\frac{r_1}{r_2}\right)^2$$ {#eq-resolution-throughput}

```{python}
#| label: resolution-scaling-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESOLUTION SCALING SLOWDOWN
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Resolution and input size tradeoffs narrative
# │
# │ Goal: Quantify the relationship between input resolution and latency.
# │ Show: The quadratic relationship between resolution and computation time.
# │ How: Calculate theoretical slowdown for doubled resolution.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: r1_str, r2_str, theoretical_str, measured_slowdown_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (resolution comparison) ---
r1_value = 224                        # original resolution
r2_value = 448                        # doubled resolution
measured_slowdown_value = 3.6         # actual measured slowdown

# --- Process (theoretical slowdown from equation) ---
theoretical_slowdown_value = (r2_value / r1_value) ** 2

# --- Outputs (formatted strings for prose) ---
r1_str = f"{r1_value}"                                                       # e.g. "224"
r2_str = f"{r2_value}"                                                       # e.g. "448"
theoretical_str = fmt(theoretical_slowdown_value, precision=0, commas=False) # e.g. "4" x
measured_slowdown_str = f"{measured_slowdown_value}"                         # e.g. "3.6" x
```

Doubling resolution from `{python} r1_str` to `{python} r2_str` theoretically yields `{python} theoretical_str` $\times$ slowdown (measured: `{python} measured_slowdown_str` $\times$ due to fixed overhead amortization). However, at high resolutions, models transition from compute-bound to memory-bound as activation tensors exceed cache capacity. @tbl-resolution-bottleneck quantifies this transition for ResNet-50, showing how arithmetic intensity decreases with resolution:

```{python}
#| label: resolution-bottleneck-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESOLUTION AND COMPUTE BOTTLENECK TABLE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @tbl-resolution-bottleneck (Resolution and Compute Bottleneck)
# │
# │ Goal: Demonstrate the shift from compute-bound to memory-bound operation.
# │ Show: That increasing resolution decreases arithmetic intensity.
# │ How: Compare activation sizes and FLOPs per element against the ridge point.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: act_*_mb_str, ai_*_str, ridge_point_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (activation sizes per resolution, MB) ---
act_224_mb_value = 12.5               # 224×224 activation size
act_384_mb_value = 36.8               # 384×384 activation size
act_512_mb_value = 65.5               # 512×512 activation size
act_640_mb_value = 102.4              # 640×640 activation size

# --- Inputs (arithmetic intensity, FLOPs/byte) ---
ai_224_value = 85                     # 224×224 arithmetic intensity
ai_384_value = 49                     # 384×384 arithmetic intensity
ai_512_value = 28                     # 512×512 arithmetic intensity
ai_640_value = 18                     # 640×640 arithmetic intensity

ridge_point_value = 16                # V100 ridge point (FLOPs/byte)

# --- Outputs (formatted strings for table) ---
act_224_mb_str = f"{act_224_mb_value}"                                       # e.g. "12.5" MB
act_384_mb_str = f"{act_384_mb_value}"                                       # e.g. "36.8" MB
act_512_mb_str = f"{act_512_mb_value}"                                       # e.g. "65.5" MB
act_640_mb_str = f"{act_640_mb_value}"                                       # e.g. "102.4" MB

ai_224_str = f"{ai_224_value}"                                               # e.g. "85" FLOPs/byte
ai_384_str = f"{ai_384_value}"                                               # e.g. "49" FLOPs/byte
ai_512_str = f"{ai_512_value}"                                               # e.g. "28" FLOPs/byte
ai_640_str = f"{ai_640_value}"                                               # e.g. "18" FLOPs/byte

ridge_point_str = f"{ridge_point_value}"                                     # e.g. "16" FLOPs/byte
```

The resulting shift from compute-bound to memory-bound operation is evident in @tbl-resolution-bottleneck:

| **Resolution**       |         **Activation Size** |             **Arith. Intensity** | **Bottleneck** |
|:---------------------|----------------------------:|---------------------------------:|:---------------|
| **224 $\times$ 224** | `{python} act_224_mb_str`MB | `{python} ai_224_str` FLOPs/byte | Compute        |
| **384 $\times$ 384** | `{python} act_384_mb_str`MB | `{python} ai_384_str` FLOPs/byte | Transitional   |
| **512 $\times$ 512** | `{python} act_512_mb_str`MB | `{python} ai_512_str` FLOPs/byte | Memory BW      |
| **640 $\times$ 640** | `{python} act_640_mb_str`MB | `{python} ai_640_str` FLOPs/byte | Memory BW      |

: **Resolution and Compute Bottleneck**: ResNet-50 arithmetic intensity decreases with resolution as activation sizes grow. For a V100 PCIe (15.7 TFLOPS FP32, 900 GB/s bandwidth), the ridge point is approximately 16 FLOPs/byte. At 224 $\times$ 224, compute dominates; by 512 $\times$ 512, memory bandwidth becomes the limiting factor. {#tbl-resolution-bottleneck}

#### Resolution Strategies in Production {#sec-model-serving-deploymentspecific-resolution-decisions-1d76}

Different deployment contexts impose distinct resolution requirements shaped by their dominant constraints. Mobile applications often accept lower resolution (224 $\times$ 224) for object detection in camera viewfinders, where latency and battery life outweigh marginal accuracy gains. Medical imaging sits at the opposite extreme, requiring 512 $\times$ 512 or higher for diagnostic accuracy, with relaxed latency requirements that permit the additional compute. Autonomous vehicles split the difference by using multiple resolutions for different tasks: low resolution for rapid detection across wide fields of view and high-resolution crops for fine-grained recognition of detected objects. Cloud APIs face yet another challenge—they typically receive images at whatever resolution the client uploads and must handle the resulting range gracefully. This variability makes cloud APIs ideal candidates for adaptive resolution strategies, where the system selects resolution dynamically based on content characteristics.

```{python}
#| label: adaptive-resolution-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ADAPTIVE RESOLUTION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Adaptive Resolution paragraph (content-based selection)
# │
# │ Goal: Demonstrate the throughput gain from content-aware resolution.
# │ Show: A 1.4× throughput improvement while maintaining high accuracy.
# │ How: List benchmark results for adaptive vs. static high resolution.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: adaptive_throughput_improvement_str, adaptive_accuracy_retention_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (adaptive resolution results) ---
adaptive_throughput_improvement_value = 1.4    # throughput gain factor
adaptive_accuracy_retention_value = 99.2       # accuracy retention (%)

# --- Outputs (formatted strings for prose) ---
adaptive_throughput_improvement_str = fmt(adaptive_throughput_improvement_value, precision=1, commas=False)  # e.g. "1.4" x
adaptive_accuracy_retention_str = fmt(adaptive_accuracy_retention_value, precision=1, commas=False)          # e.g. "99.2" %
```

#### Adaptive Resolution {#sec-model-serving-adaptive-resolution-cb4e}

Production systems\index{Adaptive Resolution!content-based selection} can select resolution dynamically based on content. One approach runs a lightweight classifier at 128 $\times$ 128 to categorize content type, then selects task-appropriate resolution with documents at 512 $\times$ 512, landscapes at 224 $\times$ 224, and faces at 384 $\times$ 384. This achieves `{python} adaptive_throughput_improvement_str` $\times$ throughput improvement with `{python} adaptive_accuracy_retention_str` percent accuracy retention versus fixed high resolution. This pattern trades preprocessing cost from running the lightweight classifier for inference savings on the main model.

The latency analysis so far has focused on sequential processing: one request completing before the next begins. The preprocessing, inference, and postprocessing stages use different hardware resources. This separation creates an opportunity to process multiple requests simultaneously.

### Hardware Utilization and Request Pipelining {#sec-model-serving-utilization-request-pipelining-c61c}

The preceding analysis examined where time goes within individual pipeline stages. Optimizing each stage in isolation, however, misses a critical opportunity: the stages use different hardware resources. The latency budget analysis in @sec-model-serving-latency-budget-ef40 reveals that model inference is only one component of the request lifecycle. From a hardware perspective, the primary goal of a serving system is to maximize the **duty cycle** of the accelerator, the percentage of time the GPU is performing useful computation.

In a serialized serving system, the hardware sits idle during network I/O and CPU-based preprocessing. High-performance serving systems use **Request Pipelining**\index{Request Pipelining!GPU utilization} to overlap these stages, ensuring the GPU is fed a continuous stream of tensors.

#### Overlapping I/O and Compute {#sec-model-serving-overlapping-io-compute-966c}

Compare\index{I/O Overlap!compute pipelining} the two timing diagrams in @fig-serving-pipeline-timing. In the serial case (A), each request must complete its entire lifecycle (Network $\rightarrow$ CPU Preprocessing $\rightarrow$ GPU Inference $\rightarrow$ Postprocessing) before the next request begins—notice the grey idle gaps that leave the GPU unused for more than 50% of the time. Now look at the pipelined case (B), where those gaps disappear.

::: {#fig-serving-pipeline-timing fig-env="figure" fig-pos="htb" fig-cap="**Request Pipelining**: Pipelining hides latency by overlapping independent operations across different hardware resources. In pipelined execution (B), the CPU processes the next request's data while the GPU executes the current request's inference. This increases the GPU duty cycle toward 100%, effectively doubling or tripling throughput on the same hardware without changing the model." fig-alt="Two timing diagrams. A (Serial): alternating CPU preprocessing, GPU inference, and idle blocks in sequence. B (Pipelined): two parallel rows where CPU preprocessing overlaps with GPU inference, eliminating idle time."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=0.8]
  \definecolor{CPUColor}{RGB}{173,216,230}
  \definecolor{GPUColor}{RGB}{144,238,144}
  \definecolor{WaitColor}{RGB}{240,240,240}

  % Serial Execution
  \node[anchor=west] at (0, 3.5) {\textbf{A. Serial Execution} (Low Utilization)};
  \draw[fill=CPUColor] (0, 2.5) rectangle (1.5, 3) node[midway] {Pre};
  \draw[fill=GPUColor] (1.5, 2.5) rectangle (3.0, 3) node[midway] {GPU};
  \draw[fill=WaitColor] (3.0, 2.5) rectangle (4.5, 3) node[midway, text=gray] {Idle};
  \draw[fill=CPUColor] (4.5, 2.5) rectangle (6.0, 3) node[midway] {Pre};
  \draw[fill=GPUColor] (6.0, 2.5) rectangle (7.5, 3) node[midway] {GPU};

  % Overlapped Execution
  \node[anchor=west] at (0, 1.5) {\textbf{B. Pipelined Execution} (High Utilization)};
  % CPU Row
  \draw[fill=CPUColor] (0, 0.5) rectangle (1.5, 1) node[midway] {Pre 1};
  \draw[fill=CPUColor] (1.5, 0.5) rectangle (3.0, 1) node[midway] {Pre 2};
  \draw[fill=CPUColor] (3.0, 0.5) rectangle (4.5, 1) node[midway] {Pre 3};
  \draw[fill=CPUColor] (4.5, 0.5) rectangle (6.0, 1) node[midway] {Pre 4};

  % GPU Row
  \draw[fill=GPUColor] (1.5, 0) rectangle (3.0, 0.5) node[midway] {GPU 1};
  \draw[fill=GPUColor] (3.0, 0) rectangle (4.5, 0.5) node[midway] {GPU 2};
  \draw[fill=GPUColor] (4.5, 0) rectangle (6.0, 0.5) node[midway] {GPU 3};
  \draw[fill=GPUColor] (6.0, 0) rectangle (7.5, 0.5) node[midway] {GPU 4};

\end{tikzpicture}
```
:::

Pipelining is enabled by **Asynchronous I/O**\index{Asynchronous I/O!pipelining} and **Concurrency Models**\index{Concurrency!serving models}. Instead of waiting for a GPU kernel to finish, the server's CPU thread submits the work to the GPU's command queue and immediately begins preprocessing the next incoming request.

#### The Systems Metric: Hardware Duty Cycle {#sec-model-serving-systems-metric-hardware-duty-cycle-7530}

In the "Quantitative Approach" to ML systems, we define the efficiency of a serving system by its ability to saturate the bottleneck resource. For most ML systems, this is the GPU's compute cores or memory bandwidth. We quantify this in @eq-system-efficiency:

$$\text{System Efficiency} = \frac{\sum T_{\text{compute}}}{\text{Wall Clock Time} \times \text{Resource Count}}$$ {#eq-system-efficiency}

If a ResNet-50 request takes 10 ms total (5 ms GPU, 5 ms CPU), a serial system achieves only 50% efficiency. By pipelining just two requests, efficiency approaches 100% (assuming the CPU can keep up with the GPU). If the CPU is too slow to feed the GPU, the system becomes CPU-bound, and further model optimization provides zero throughput gain—a direct application of Amdahl's Law (introduced in @sec-ml-systems) to serving: if preprocessing consumes 50% of latency, maximum speedup is 2 $\times$ regardless of how fast the model runs.

### Postprocessing {#sec-model-serving-postprocessing-3b24}

The request lifecycle concludes with postprocessing\index{Postprocessing!logits to predictions}, the phase that transforms model outputs into actionable results. A neural network produces raw tensors (floating-point arrays that carry no inherent meaning to applications or users). A 0.95 probability becomes a confident "dog" label only after postprocessing converts it; a sequence of token IDs becomes readable text; a bounding box tensor becomes a highlighted region in an image. Postprocessing significantly impacts both latency and the usefulness of predictions.

#### From Logits to Predictions {#sec-model-serving-logits-predictions-09df}

Classification models output logits\index{Logits!classification output} or probabilities across classes. Converting these raw outputs to predictions involves several steps. The simplest is argmax selection\index{Argmax!prediction selection}, which returns the highest-probability class. Thresholding applies a confidence cutoff, returning predictions only when the model is sufficiently certain. Top-k extraction returns multiple high-probability classes with their scores, useful when applications need ranked alternatives. Calibration adjusts raw probabilities to better reflect true likelihoods—a step that adds computation but is essential when downstream systems make decisions based on confidence scores.

For ResNet-50 image classification, typical postprocessing includes transforming logits to probabilities, extracting top predictions, and formatting responses. @lst-resnet-postprocessing shows a complete postprocessing pipeline with timing annotations, demonstrating each step from raw logits to API-ready response. Total postprocessing time is approximately 0.1ms, negligible compared to preprocessing and inference.

::: {#lst-resnet-postprocessing lst-cap="**ResNet-50 Postprocessing**: Transforms raw logits to calibrated probabilities, extracts top-k predictions, and formats the API response."}
```{.python}
# Transform raw logits to calibrated probabilities
# Input: logits tensor of shape (batch_size, 1000) - one score per
# ImageNet class
probs = torch.softmax(
    logits, dim=-1
)  # Normalize to sum=1; ~0.05ms on GPU

# Extract top-5 predictions for multi-class response
# topk returns (values, indices) sorted by probability
top5_probs, top5_indices = probs.topk(5)  # ~0.02ms; GPU operation

# Map class indices to human-readable labels
# IMAGENET_CLASSES: list of 1000 class names from synset mapping
labels = [
    IMAGENET_CLASSES[i] for i in top5_indices
]  # ~0.01ms; CPU lookup

# Format response with predictions and metadata for API contract
response = {
    "predictions": [
        {"label": label, "confidence": float(prob)}
        for label, prob in zip(labels, top5_probs)
    ],
    "model_version": "resnet50-v2.1",  # Client-side version tracking
    "inference_time_ms": 5.2,  # Observability for latency monitoring
}
```
:::

Each step adds latency but improves response utility. Calibration in particular can add significant computation but is necessary when downstream systems make decisions based on confidence scores.

#### Output Formatting {#sec-model-serving-output-formatting-753f}

Production systems rarely return raw predictions. Outputs must conform to API contracts that specify JSON serialization schemas, confidence score formatting, and thresholding rules. Error handling must address edge cases: what does the system return when no prediction exceeds the confidence threshold, or when the input appears out-of-distribution? Response metadata (model version, inference time, feature attributions) enables downstream monitoring and debugging.

The latency budget analysis reveals *where* time goes within a single request. Production systems, however, do not process requests in isolation: they must handle hundreds or thousands of concurrent requests competing for finite resources. Understanding this concurrency requires a different analytical framework.

## Queuing Theory {#sec-model-serving-queuing-theory-tail-latency-29a6}

The preceding lifecycle analysis assumed sequential processing. In production, concurrent requests compete for finite resources, and queuing theory\index{Queuing Theory!serving systems} predicts how this competition affects latency. These principles explain the counterintuitive behavior that causes well-provisioned systems to violate latency SLOs when load increases modestly.

### Queuing Fundamentals {#sec-model-serving-queuing-fundamentals-10d3}

Serving engineers routinely face a concrete question: given a latency SLO\index{SLO (Service Level Objective)!capacity planning} and an expected request rate, *how* many GPUs must be provisioned? Answering this question requires predicting *how* latency changes as load increases, which is precisely what queuing theory provides. Two mathematical foundations govern serving system behavior. Little's Law (@sec-machine-foundations-littles-law-21a3) relates queue depth to throughput. The M/M/1 model predicts how latency degrades under load. Together, they provide the quantitative framework for capacity planning.

### Little's Law {#sec-model-serving-littles-law-9352}

```{python}
#| label: littles-law-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LITTLE'S LAW CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "Little's Law" (Capacity Planning section)
# │
# │ Goal: Connect observable request rates to hardware capacity requirements.
# │ Show: The required concurrency (memory) to sustain a 1000 QPS target.
# │ How: Apply Little's Law (L = λW) using throughput and latency SLO.
# │
# │ Imports: mlsys.constants, mlsys.formatting
# │ Exports: littles_lambda_str, littles_w_ms_str, littles_w_str, littles_l_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import MS_PER_SEC
from mlsys.formatting import fmt, check

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class CapacityPlanningAnchor:
    """
    Namespace for serving capacity anchor.
    """
    qps_target = 1000
    slo_ms = 50
    concurrency_slots = int(qps_target * (slo_ms / 1000))

    qps_str = f"{qps_target}"
    slo_str = f"{slo_ms}ms"
    slots_str = f"{concurrency_slots}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
serving_qps_str = CapacityPlanningAnchor.qps_str
serving_slo_str = CapacityPlanningAnchor.slo_str
serving_concurrency_slots_str = CapacityPlanningAnchor.slots_str

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class CapacityPlanning:
    """
    Namespace for Little's Law Capacity calculation.
    Scenario: Determining concurrency requirements for a 1000 QPS target.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    lambda_qps = 1000.0
    latency_slo_s = 0.050 # 50ms

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    # L = lambda * W
    concurrency = lambda_qps * latency_slo_s

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(concurrency == 50, f"Math broken: 1000 * 0.05 should be 50, got {concurrency}")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    littles_lambda_str = f"{lambda_qps:,.0f}"
    littles_w_ms_str = f"{int(latency_slo_s * MS_PER_SEC)}"
    littles_w_str = fmt(latency_slo_s, precision=2, commas=False)
    littles_l_str = fmt(concurrency, precision=0, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
littles_lambda_str = CapacityPlanning.littles_lambda_str
littles_w_ms_str = CapacityPlanning.littles_w_ms_str
littles_w_str = CapacityPlanning.littles_w_str
littles_l_str = CapacityPlanning.littles_l_str
```

Serving engineers need a tool that connects observable metrics to capacity requirements. The most celebrated result in queuing theory is Little's Law,\index{Little's Law!concurrency calculation}\index{Little's Law!L=λW formula}[^fn-littles-law] [^fn-littles-law-intuition] which @eq-littles-law expresses as a simple relationship between three quantities in any stable system: Concretely, a server targeting `{python} serving_qps_str` QPS with a `{python} serving_slo_str` SLO requires `{python} serving_concurrency_slots_str` concurrent request slots, which sets the hard memory floor for activation storage on that node.

[^fn-littles-law]: **Little's Law**: Proven by John D.C. Little in 1961 [@little1961proof], this theorem establishes that $L = \lambda W$ holds for any stable queuing system regardless of arrival patterns, service distributions, or scheduling policies. The remarkable generality makes it one of the most useful results in operations research. For serving systems, it enables capacity planning from observable metrics: measuring queue depth and arrival rate directly yields average latency without instrumenting individual requests.

[^fn-littles-law-intuition]: **Little's Law in the Coffee Shop**: Throughput ($\lambda$) is the rate of arriving customers; Latency ($W$) is the time to make one drink; Queue ($L$) is the number of people waiting. If the barista takes 1 minute per drink ($W=1$) and customers arrive every 30 seconds ($\lambda=2$), the queue ($L$) will grow indefinitely unless more baristas are added.

$$L = \lambda \cdot W$$ {#eq-littles-law}

where $L$ is the average number of requests in the system, $\lambda$ is the arrival rate (requests per second), and $W$ is the average time each request spends in the system.

::: {.callout-perspective title="Notation Alert: L vs. Latency"}
In queuing theory, $L$ traditionally denotes the *length* of the queue (number of items in the system), and $W$ denotes *wait time* (time in system per request). Elsewhere in this book, we use $L_{\text{lat}}$ for latency with descriptive subscripts ($L_{\text{lat,wait}}$, $L_{\text{lat,compute}}$) to denote latency components. To preserve standard queuing notation, we retain $L$ for queue length and $W$ for time in system in this section. In the batching analysis that follows (@sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d), $L_{\text{lat,wait}}$ corresponds to the queueing wait component $W_q$, and $L_{\text{lat,compute}}$ includes inference time.
:::

This relationship holds regardless of arrival distribution, service time distribution, or scheduling policy. The following notebook quantifies this capacity relationship through a practical application of *Little's Law*.


::: {.callout-notebook #notebook-littles-law title="Little's Law"}

**The Capacity Physics**: How much memory do you need to serve 1,000 queries per second?

**The Law**: $L = \lambda W$ (Concurrency = Throughput $\times$ Latency) (see @sec-machine-foundations-littles-law-21a3 for the derivation).

**Scenario**:

*   **Throughput Target (lambda)**: `{python} littles_lambda_str` requests/sec.
*   **Latency Target (W)**: `{python} littles_w_ms_str` ms (0.05 s).

**The Calculation**:
L = `{python} littles_lambda_str` $\times$ `{python} littles_w_str` = **`{python} littles_l_str` concurrent requests**

**The Constraint**: Your server *must* have enough RAM to hold `{python} littles_l_str` requests simultaneously (batch size + queue).

*   If your GPU runs out of memory at Batch Size 32, you physically **cannot** hit 1,000 QPS at 50ms latency.
*   You must either reduce latency ($W$) or buy more memory ($L$).
:::

Little's Law has immediate practical implications. If your inference service averages 10 ms per request ($W = 0.01$s) and you observe 50 concurrent requests in the system on average ($L = 50$), then your arrival rate must be $\lambda = L/W = 5000$ requests per second. Conversely, if you need to limit concurrent requests to 10 (perhaps due to GPU memory constraints), and your service time is 10 ms, you can sustain at most 1000 requests per second.

### The Utilization-Latency Relationship {#sec-model-serving-utilizationlatency-relationship-a2f0}

Little's Law describes average system behavior, but it does not reveal *how* latency changes as load approaches capacity. To answer the critical question of *how* much spare capacity a serving system needs, we turn to the M/M/1 queue model. For a system with Poisson arrivals\index{Poisson Arrivals!queuing model} and exponential service times, the average time in system follows:

$$W = \frac{1}{\mu - \lambda} = \frac{\text{service time}}{1 - \rho}$$ {#eq-mm1-wait}

where $\mu$ is the service rate (requests per second the server can handle), and $\rho = \lambda/\mu$ is the utilization\index{Utilization!latency relationship}\index{M/M/1 Queue!wait time formula} (fraction of time the server is busy).

This equation reveals why serving systems exhibit nonlinear behavior: small increases in load near capacity cause disproportionate latency increases. @tbl-utilization-latency quantifies this relationship, showing how average time in system grows rapidly as utilization approaches 100%.

The M/M/1 model assumes exponentially distributed service times, but ML inference typically has near-constant service time for fixed batch sizes, making the M/D/1\index{M/D/1 Queue!deterministic service} (deterministic service) model more accurate in practice. We use M/M/1 here because it yields closed-form solutions and produces conservative estimates. For M/D/1 queues, average wait time is approximately half of M/M/1 at the same utilization, which matters for capacity planning: M/M/1 analysis will slightly over-provision, erring on the side of meeting SLOs rather than violating them.[^fn-queuing-models]

[^fn-queuing-models]: **Kendall Notation**: The M/M/1 notation was introduced by British statistician David Kendall [@kendall1953stochastic] and follows the pattern A/S/c (Arrivals/Service/servers). "M" stands for "Markovian" (memoryless, meaning exponential distributions), honoring Russian mathematician Andrey Markov (1856-1922). "D" means deterministic. So M/M/1 describes a single server with exponential arrivals and service times, while M/D/1 has deterministic service. ML inference is closer to M/D/1 since inference time is nearly constant, but M/M/1 yields conservative estimates suitable for capacity planning. The mathematics underlying all queuing models traces to Agner Krarup Erlang, a Danish engineer at the Copenhagen Telephone Company who published the foundational formulas in 1909 to determine how many telephone circuits were needed to handle call traffic—the same capacity planning problem ML engineers face today with inference requests instead of phone calls.

| **Utilization ($\rho$)** | **Latency Multiple** | **Example (5ms service)** |
|:-------------------------|---------------------:|--------------------------:|
| 50%                      |         2.0 $\times$ |                      10ms |
| 70%                      |         3.3 $\times$ |                      17ms |
| 80%                      |         5.0 $\times$ |                      25ms |
| 90%                      |        10.0 $\times$ |                      50ms |
| 95%                      |        20.0 $\times$ |                     100ms |

: **Utilization-Latency Relationship**: Average **time in system** (wait + service) as a multiple of service time for an M/M/1 queue. At 50% utilization, time in system is 2 $\times$ service time; at 90%, it reaches 10 $\times$. This nonlinear growth explains why systems that perform well at moderate load suddenly violate SLOs when traffic increases: moving from 80% to 90% utilization doubles latency. {#tbl-utilization-latency}

### Multi-Server Considerations {#sec-model-serving-multiserver-considerations-00fc}

The preceding analysis focuses on a single ML node (one machine serving inference requests). This scope aligns with this book's focus on mastering the basic unit of ML systems. Single-node queuing dynamics are prerequisite to effective scaling. Engineers cannot optimize a distributed system without first understanding the behavior of its components.

#### When Single-Node Analysis Applies {#sec-model-serving-singlenode-analysis-applies-305d}

M/M/1 analysis remains the foundation for:

- **Right-sizing individual nodes**: Determining whether a single GPU can meet latency SLOs at expected traffic
- **Identifying the scaling trigger**: Calculating when traffic exceeds single-node capacity
- **Cost-effective provisioning**: Avoiding premature scale-out that wastes resources

For traffic exceeding single-node capacity, production systems deploy multiple replicas behind a load balancer. The M/M/c queuing model\index{M/M/c Queue!multi-server} extends M/M/1 to c parallel servers, showing that multiple replicas\index{Replica!tail latency improvement} dramatically improve tail latency: the probability of all servers being simultaneously slow drops exponentially with server count. At c=4 replicas and moderate utilization, p99 latency can be 3 $\times$ lower than the single-server case at the same total throughput. This chapter establishes single-node serving foundations; distributed inference systems (model sharding across GPUs, tensor parallelism, pipeline parallelism) introduce coordination overhead and consistency challenges that require advanced scaling principles beyond our scope here.

### Tail Latency {#sec-model-serving-tail-latency-5376}

Production SLOs\index{SLO (Service Level Objective)!percentile targets} typically specify percentile targets (p95, p99) rather than averages because tail latency determines user experience for the slowest requests [@dean2013tail]. For an M/M/1 queue, the p99 latency follows:

$$W_{p99} \approx \frac{\text{service time}}{1 - \rho} \cdot \ln\left(\frac{1}{1 - 0.99}\right) \approx \frac{4.6 \cdot \text{service time}}{1 - \rho}$$ {#eq-p99-latency}

At 70 percent utilization, p99 latency is approximately fifteen times the service time ($4.6 / 0.3 \approx 15.3$), while average latency is only 3.3 times. For the M/D/1 model (more representative of ML inference with near-constant service times), p99 values are roughly half these M/M/1 estimates. This explains *why* systems that seem healthy with low average latency can have unacceptable tail latency, since the average hides the experience of the unluckiest requests.

#### The Tail at Scale Problem {#sec-model-serving-tail-scale-problem-958d}

Dean and Barroso's analysis reveals *why* tail latency\index{Tail at Scale!fan-out amplification} becomes critical as systems scale beyond single machines [@dean2013tail]. When requests fan out to multiple servers, the probability of experiencing at least one slow response grows rapidly with server count. This "tail at scale" effect makes individual server tail latency critical for overall system performance.

For single-machine serving, this principle has two implications. First, tail latency on individual machines matters because it will compound when systems eventually scale. Second, the tail-tolerant techniques described below (hedging, graceful degradation) provide value even on single machines and become indispensable at scale.

Tail-tolerant techniques such as request hedging send redundant requests after a timeout, accepting whichever response arrives first. Backup requests and load balancing away from slow servers directly address latency variance. These techniques apply to single-machine serving with multiple GPU streams or model replicas, and become essential when scaling to distributed inference systems.

With the queuing model and tail latency analysis established, we can now apply these tools to a concrete capacity planning exercise.

```{python}
#| label: capacity-planning-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RESNET-50 CAPACITY PLANNING
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50 Capacity Planning"
# │
# │ Goal: Demonstrate the complete capacity planning workflow.
# │ Show: That 8 V100 GPUs are needed for 5000 QPS at 50ms p99 with N+1 redundancy.
# │ How: Apply queuing theory to find safe utilization and required service rate.
# │
# │ Imports: math, mlsys.formatting (fmt)
# │ Exports: cp_* and mm1_* formatted strings
# └─────────────────────────────────────────────────────────────────────────────
import math
from mlsys.formatting import fmt

# --- Inputs (capacity planning requirements) ---
cp_peak_qps_value = 5000              # peak traffic (QPS)
cp_service_ms_value = 5               # TensorRT FP16 service time (ms)
cp_p99_target_ms_value = 50           # p99 latency SLO (ms)
cp_rho_safe_value = 0.72              # safe utilization (M/D/1 adjusted)
cp_v100_throughput_value = 1143       # V100 throughput at batch-16 (img/s)
cp_headroom_value = 1.3               # 30% headroom for variance

# --- Inputs (precision comparison) ---
cp_fp32_bits_value = 32               # FP32 bit width
cp_int8_bits_value = 8                # INT8 bit width

# --- Inputs (M/M/1 parameters) ---
mm1_p99_factor_value = 4.6            # p99 multiplier for M/M/1
mm1_rho_example_value = 0.7           # example utilization

# --- Process (capacity planning steps) ---
cp_mu_required_value = cp_peak_qps_value / cp_rho_safe_value
cp_gpus_raw_value = cp_mu_required_value / cp_v100_throughput_value
cp_gpus_ceil_value = math.ceil(cp_gpus_raw_value)
cp_final_raw_value = cp_gpus_ceil_value * cp_headroom_value
cp_final_ceil_value = math.ceil(cp_final_raw_value)
cp_gpus_after_fail_value = cp_final_ceil_value - 1
cp_util_after_fail_value = (cp_peak_qps_value / cp_v100_throughput_value) / cp_gpus_after_fail_value * 100
cp_precision_ratio_value = cp_fp32_bits_value // cp_int8_bits_value
mm1_wait_factor_value = mm1_p99_factor_value / (1 - mm1_rho_example_value)

# --- Outputs (formatted strings for prose) ---
cp_mu_required_str = fmt(cp_mu_required_value, precision=0, commas=False)    # e.g. "6944" req/s
cp_gpus_raw_str = fmt(cp_gpus_raw_value, precision=1, commas=False)          # e.g. "6.1" GPUs
cp_gpus_ceil_str = f"{cp_gpus_ceil_value}"                                   # e.g. "7" GPUs
cp_final_raw_str = fmt(cp_final_raw_value, precision=1, commas=False)        # e.g. "9.1" GPUs
cp_final_ceil_str = f"{cp_final_ceil_value}"                                 # e.g. "10" GPUs
cp_gpus_after_fail_str = f"{cp_gpus_after_fail_value}"                       # e.g. "9" GPUs
cp_util_after_fail_str = fmt(cp_util_after_fail_value, precision=1, commas=False)  # e.g. "48.7" %

cp_peak_qps_str = f"{cp_peak_qps_value:,}"                                   # e.g. "5,000" QPS
cp_v100_throughput_str = f"{cp_v100_throughput_value:,}"                     # e.g. "1,143" img/s
cp_rho_safe_str = fmt(cp_rho_safe_value, precision=2, commas=False)          # e.g. "0.72"
cp_rho_safe_pct_str = f"{cp_rho_safe_value * 100:.0f}"                       # e.g. "72" %
cp_mu_required_comma_str = f"{cp_mu_required_value:,.0f}"                    # e.g. "6,944" req/s
cp_precision_ratio_str = f"{cp_precision_ratio_value}"                       # e.g. "4" x

mm1_p99_factor_str = f"{mm1_p99_factor_value}"                               # e.g. "4.6"
mm1_wait_factor_str = fmt(mm1_wait_factor_value, precision=1, commas=False)  # e.g. "15.3" x
```

We can formalize this through *ResNet-50 capacity planning*.

::: {.callout-notebook title="ResNet-50 Capacity Planning"}

Consider designing a ResNet-50 serving system with these requirements:

- **Target p99 latency**: 50ms
- **Peak expected traffic**: `{python} cp_peak_qps_str` requests per second
- **Service time** (TensorRT FP16): 5ms

#### Step 1: Find Safe Utilization {.unnumbered}

From @eq-p99-latency, $W_{p99}$ ≈ `{python} mm1_p99_factor_str` $\times$ service time / (1 − $\rho$). Setting $W_{p99}$ ≤ 50ms with 5ms service time gives $\rho$ ≤ 1 − (`{python} mm1_p99_factor_str` $\times$ 5)/50 = 0.54. However, the M/M/1 model is conservative for ML inference, which has near-deterministic service times (closer to M/D/1). For M/D/1 queues, average wait is roughly half of M/M/1 at the same utilization, allowing a higher safe operating point. Using the M/D/1-adjusted threshold yields $\rho$ ≤ `{python} cp_rho_safe_str` (`{python} cp_rho_safe_pct_str`% maximum utilization).

#### Step 2: Calculate Required Service Rate {.unnumbered}

mu_required = `{python} cp_peak_qps_str` / `{python} cp_rho_safe_str` = `{python} cp_mu_required_str` requests/second

#### Step 3: Determine GPU Count {.unnumbered}

Single V100 throughput at batch=16: `{python} cp_v100_throughput_str` images/second

GPUs needed = `{python} cp_mu_required_str` / `{python} cp_v100_throughput_str` = `{python} cp_gpus_raw_str` → `{python} cp_gpus_ceil_str` GPUs

#### Step 4: Add Headroom for Variance {.unnumbered}

Production systems add 30% headroom for traffic spikes and variance:

Final count = `{python} cp_gpus_ceil_str` $\times$ 1.3 = `{python} cp_final_raw_str` → `{python} cp_final_ceil_str` GPUs

#### Step 5: Verify Fault Tolerance {.unnumbered}

The 30% headroom addresses traffic variance, but production systems also need fault tolerance. With `{python} cp_final_ceil_str` GPUs, losing one leaves `{python} cp_gpus_after_fail_str` GPUs handling `{python} cp_peak_qps_str` QPS:

Utilization after failure = (`{python} cp_peak_qps_str` / `{python} cp_v100_throughput_str`) / `{python} cp_gpus_after_fail_str` = `{python} cp_util_after_fail_str`%

This remains well below the `{python} cp_rho_safe_pct_str`% safe utilization threshold, confirming N+1 redundancy is satisfied. For stricter fault tolerance requirements, N+2 redundancy (tolerating two simultaneous failures) would require 11–12 GPUs.

**Result**: Provision `{python} cp_final_ceil_str` V100 GPUs to serve `{python} cp_peak_qps_str` QPS at 50ms p99 latency with N+1 fault tolerance.

:::

The queuing analysis explains the capacity planning approach detailed in @sec-model-serving-capacity-planning-96a3 and connects directly to the MLPerf Server scenario. @sec-benchmarking explains how MLPerf measures throughput only for requests meeting the latency SLO: a system achieving 10,000 QPS but violating the SLO on 5% of requests reports only 9,500 valid QPS.

### Tail-Tolerant Techniques {#sec-model-serving-tailtolerant-techniques-066e}

Eliminating all sources of latency variability is often impractical. Production systems instead employ techniques that tolerate variability while still meeting SLOs [@dean2013tail; @dean2012rapid]. These techniques treat latency variance as a given and design around it.

#### Hedged Requests {#sec-model-serving-hedged-requests-b923}

When\index{Hedged Requests!tail tolerance} a request has not completed within the expected time, the system sends a duplicate request to another server.[^fn-hedging-etymology] The client uses whichever response arrives first and cancels the other. For ML serving, this means maintaining multiple model replicas and routing slow requests to alternative replicas. The overhead is modest: if you hedge at the 95th percentile, only 5% of requests generate duplicates, increasing load by just 5% while dramatically reducing tail latency.

[^fn-hedging-etymology]: **Hedging**: Borrowed from finance, where "hedging" means reducing risk by making offsetting bets. The term derives from the literal hedge (a boundary of shrubs) that protects a garden. Financial hedging dates to the 1600s Dutch tulip markets. Google's Jeff Dean introduced "hedged requests" in his influential "Tail at Scale" paper [@dean2013tail], applying the financial concept to distributed systems: send redundant requests to protect against the risk of slow responses.

CUDA kernels cannot be interrupted mid-execution. When a hedged request completes, the duplicate must be cancelled, but if inference has already begun on the GPU, cancellation approaches include checking a cancellation flag before launching inference, accepting wasted compute for the in-flight kernel, or using request prioritization to deprioritize the duplicate. Since hedging typically applies only to the slowest 5 percent of requests, the overhead from occasional wasted compute remains acceptable.

#### Tied Requests {#sec-model-serving-tied-requests-961c}

Tied requests\index{Tied Requests!latency reduction} send the request to multiple servers simultaneously, but include a tag allowing servers to cancel execution once another server begins processing. This eliminates the delay of waiting to detect a slow response before hedging. For inference servers with significant startup overhead from model loading and memory allocation, tied requests ensure at least one server begins immediately.

#### Canary Requests {#sec-model-serving-canary-requests-83b2}

For\index{Canary Requests!fan-out protection} requests that fan out to many backends, first send the request to a small subset of 1 to 2 servers.[^fn-canary-etymology] If these return within expected time, send to the remainder. If the canary is slow, the system can take corrective action by retrying elsewhere or using cached results before committing to the full fan-out. This prevents a single slow backend from stalling an entire distributed inference request.

\index{Canary!etymology}

[^fn-canary-etymology]: **Canary**: From the practice of using canary birds in coal mines from the early 1900s through the 1980s. Miners brought caged canaries underground because the birds' high metabolic rate made them sensitive to carbon monoxide and methane, dying before gas concentrations became lethal to humans. In software, "canary" describes any small-scale test that detects problems before they affect the full system, whether canary deployments, canary requests, or canary tests.

#### Graceful Degradation {#sec-model-serving-graceful-degradation-d1d8}

When\index{Graceful Degradation!overload handling} load exceeds capacity, return approximate results rather than timing out. For classification, return cached predictions for similar inputs. For generative models, return shorter outputs. For ensemble systems, return predictions from a subset of models. This maintains responsiveness at the cost of some accuracy, which users often prefer to outright failures.

#### Admission Control {#sec-model-serving-admission-control-c852}

When\index{Admission Control!queue depth threshold} traffic exceeds capacity, accepting all requests can trigger widespread SLO violations. Admission control proactively rejects requests when queue depth exceeds a threshold, returning immediate 503 responses rather than accepting requests that are likely to timeout. This sacrifices throughput to protect latency for admitted requests.

A practical starting point for setting the threshold is 2 to 3 times service time multiplied by the number of workers. For a system with 4 workers and 10 ms service time, this yields a queue depth threshold of 80 to 120 requests. Adaptive admission control adjusts thresholds based on observed p99 latency, tightening when latency increases above target and relaxing when latency remains healthy.

#### Retry Storm Prevention {#sec-model-serving-retry-storm-prevention-4bf0}

A subtle\index{Retry Storm!load shedding coordination} failure mode occurs when all replicas are overloaded simultaneously. If the load balancer retries rejected requests at other replicas that are also overloaded, retry traffic amplifies the overload. Coordinated load shedding addresses this by sharing load information across replicas, enabling system-wide decisions about which requests to accept. When global load exceeds capacity, replicas collectively reject the same fraction of requests rather than each rejecting independently and triggering retries.

These techniques become essential at scale when fan-out amplification makes individual server tail latency visible to users. Single-machine serving systems can implement hedged and tied requests across GPU streams or model replicas. The queuing analysis here assumes FIFO processing, but production systems often implement priority scheduling such as deadline-aware or shortest-job-first approaches to further reduce tail latency for heterogeneous workloads [@harchol2013performance].

The tail-tolerant techniques examined in this section optimize the flow of requests through a functioning serving system. The queuing analysis, however, assumes two critical preconditions: that models are loaded and ready to process requests, and that predictions match what was validated during development. In production, this assumption fails regularly: during deployments, new instances must load models from scratch; during scaling events, cold start latency affects the first requests to new replicas; and when preprocessing pipelines diverge from training, accuracy silently degrades. The next section examines these lifecycle challenges that must be solved before queuing optimization becomes relevant.

:::: {.callout-checkpoint title="Queuing and SLO Headroom" collapse="false"}
Latency SLOs are not enforced by "fast inference" alone; they are enforced by *headroom*.

- [ ] **Little's Law**: Can you use \(L = \lambda W\) to explain why rising queue depth implies rising latency even if per-request compute time is unchanged?
- [ ] **Utilization cliff**: Can you explain why latency grows non-linearly as utilization \(\rho\) approaches 1, and why production systems target a conservative \(\rho\) rather than "100% busy"?
- [ ] **Wait vs. compute**: Given an end-to-end latency budget, can you separate \(L_{\text{lat,compute}}\) from \(L_{\text{lat,wait}}\) and explain which one queuing theory primarily predicts?
- [ ] **Capacity planning**: Can you explain why a throughput number is only "real" if requests still meet the percentile latency SLO under load?
::::

## Model Lifecycle Management {#sec-model-serving-model-lifecycle-management-ff2e}

Queuing theory and tail-tolerant techniques optimize the steady-state flow of requests, but they cannot help if the system never reaches steady state. A newly deployed replica that takes 35 seconds to compile its TensorRT engine violates every SLO during that window. A model whose OpenCV-based serving pipeline resizes images differently than the PIL-based training pipeline silently drops 5 percentage points of accuracy—a degradation invisible to latency dashboards. These lifecycle failures are not edge cases; they occur at every deployment, every scaling event, and every framework migration. Addressing them requires engineering discipline in two areas: getting models ready to serve (cold start and initialization) and keeping predictions faithful to what was validated (training-serving skew).

### Training-Serving Skew {#sec-model-serving-trainingserving-skew-7b99}

A model that performed well during validation may silently degrade when deployed. This phenomenon, known as **training-serving skew**\index{Training-Serving Skew!silent accuracy degradation}, represents one of the most subtle failure modes in production ML because it is invisible to latency monitoring and exception tracking.

::: {.callout-definition title="Training-Serving Skew"}

***Training-Serving Skew***\index{Training-Serving Skew!definition} is the **Distributional Divergence** between the training and inference environments. It arises when the function $f_{train}(x)$ differs from $f_{serve}(x)$ due to inconsistent preprocessing logic or environmental state, violating the **Consistency Imperative** and causing silent accuracy degradation.

:::

@sec-ml-operations provides comprehensive coverage of skew diagnosis, monitoring, and organizational prevention strategies. Here we focus on the *serving-specific* manifestation: **preprocessing divergence**\index{Preprocessing Divergence!training vs serving}. This occurs when the real-time inference pipeline processes raw data differently than the batch training pipeline, a common failure mode when training uses Python/Pandas while serving uses C++/Java or optimized inference servers. Unlike data drift (which @sec-ml-operations addresses through monitoring), preprocessing divergence is deterministic and preventable through careful engineering.

::: {.callout-example title="ResNet-50: Image Preprocessing Skew"}

For ResNet-50 serving, common sources of skew include:

**Resize interpolation**\index{Resize Interpolation!skew source}: Training uses PIL.BILINEAR while OpenCV defaults to cv2.INTER_LINEAR. These produce pixel-level differences that can shift accuracy by 0.5–1%.

**Color space handling**: JPEG loading in different libraries may produce BGR vs RGB ordering. If the model trained on RGB but serves BGR inputs, predictions are essentially random.

**Normalization constants**: ImageNet normalization uses specific mean/std values. Using `mean=[0.5, 0.5, 0.5]` instead of `mean=[0.485, 0.456, 0.406]` shifts inputs out of the training distribution.

**Prevention**: The safest approach is to export the exact preprocessing code used during training and run it identically in serving, or use a framework like NVIDIA DALI that can help standardize preprocessing across training and serving environments.

:::

### Cold Start and Initialization Dynamics {#sec-model-serving-model-loading-initialization-cc5a}

With preprocessing pipelines designed to avoid training-serving skew, the next challenge is getting models ready to serve. Before processing any request, models must load from storage into memory and prepare for inference [@romero2021infaas]. This initialization latency, known as **cold start**\index{Cold Start!scaling events}, affects system responsiveness during deployments, scaling events, and recovery from failures.

::: {.callout-definition title="Cold Start"}

***Cold Start***\index{Cold Start!initialization latency}—a metaphor borrowed from automotive engineering where engines operate inefficiently until reaching thermal equilibrium—is the **Initialization Latency** incurred when instantiating a new model replica. It represents the fixed cost of **State Hydration** (loading weights, compiling graphs) that effectively blocks the system's ability to scale elastically in response to traffic bursts.
:::

Cold start dynamics determine whether systems meet latency requirements from the moment they begin serving traffic. A *cold start timeline* for a representative model reveals where each phase contributes to total initialization latency.

Cold start\index{Cold Start!anatomy} latency compounds from multiple sources, each adding to the time between deployment and serving readiness. Weight loading\index{Weight Loading!cold start} reads model parameters from disk or network storage. Graph compilation\index{Graph Compilation!JIT overhead} performs just-in-time compilation of operations for the specific hardware. Memory allocation reserves GPU memory for activations and intermediate values. Warmup\index{Warmup!cache population}[^fn-warmup-etymology] execution performs initial inferences that populate caches and trigger lazy initialization.

[^fn-warmup-etymology]: **Warmup**: The computing metaphor derives from physical warming, where engines and machines perform better after reaching operating temperature. In JIT-compiled systems like the JVM (1990s), "warmup" specifically refers to the period when the runtime gathers profiling data and compiles hot paths. For ML serving, warmup serves a dual purpose: triggering lazy memory allocation and populating CPU/GPU caches with frequently-accessed data, ensuring the first real user request does not pay these one-time costs.

```{python}
#| label: cold-start-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COLD START TIMELINE
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Cold Start Timeline"
# │
# │ Goal: Decompose the cold start latency of serverless inference.
# │ Show: That pre-compiling models reduces cold start from ~35s to ~1.5s.
# │ How: Sum phase durations for data transfer, CUDA initialization, and model loading.
# │
# │ Imports: (none)
# │ Exports: cs_*_str formatted strings for timeline table
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (cold start phases, seconds) ---
cs_ssd_value = 0.5                    # weight loading from SSD
cs_s3_value = 4.0                     # weight loading from S3
cs_cuda_value = 0.4                   # CUDA context initialization
cs_compile_value = 30.0               # TensorRT compilation
cs_warmup_value = 0.2                 # warmup inferences
cs_runtime_overhead_value = 0.4       # runtime overhead

# --- Process (total cold start times) ---
cs_local_total_value = cs_ssd_value + cs_cuda_value + cs_warmup_value + cs_runtime_overhead_value
cs_cloud_total_value = cs_s3_value + cs_cuda_value + cs_compile_value + cs_warmup_value

# --- Outputs (formatted strings for table) ---
cs_local_str = f"~{cs_local_total_value:.1f}s"                               # e.g. "~1.5s"
cs_cloud_str = f"~{cs_cloud_total_value:.0f}s"                               # e.g. "~35s"
cs_ssd_str = f"{cs_ssd_value}s"                                              # e.g. "0.5s"
cs_s3_str = "3-5s"                                                           # range for S3
cs_cuda_str = "0.3-0.5s"                                                     # range for CUDA
cs_compile_str = "15-30s"                                                    # range for TensorRT
cs_warmup_str = f"{cs_warmup_value}s"                                        # e.g. "0.2s"
```

::: {.callout-notebook title="ResNet-50: Cold Start Timeline"}

| **Phase**                       | **Duration**                | **Notes**                                         |
|:--------------------------------|:----------------------------|:--------------------------------------------------|
| **Weight loading (SSD)**        | `{python} cs_ssd_str`       | 98MB FP32 weights from local storage              |
| **Weight loading (S3)**         | `{python} cs_s3_str`        | Network latency dominates for cloud storage       |
| **CUDA context**                | `{python} cs_cuda_str`      | GPU driver initialization and memory setup        |
| **TensorRT compilation**        | `{python} cs_compile_str`   | Converts PyTorch model to optimized engine        |
| **Warmup (10 inferences)**      | `{python} cs_warmup_str`    | Triggers remaining lazy initialization            |
| **Total (local, optimized)**    | **`{python} cs_local_str`** | With pre-compiled TensorRT engine, warm container |
| **Total (cloud, first deploy)** | **`{python} cs_cloud_str`** | Including compilation from cold state             |

**Key insight**: Pre-compiling models and storing the optimized engine eliminates the 30-second compilation phase on subsequent deployments.

:::

The CUDA context[^fn-cuda-serving]\index{CUDA Context!initialization overhead} is the first cost in the cold start timeline. Before any GPU operation, the CUDA runtime must establish a *context*: a data structure that tracks memory allocations, loaded kernels, and device state. Creating a context requires communicating with the GPU driver and allocating GPU memory for internal bookkeeping. This one-time cost (0.3–0.5 s) affects every new process that uses the GPU. CUDA 11+ introduced lazy initialization that defers some setup until first use, reducing apparent startup time but shifting cost to the first inference.

CUDA MPS (Multi-Process Service)[^fn-cuda-mps]\index{CUDA MPS!context sharing} addresses the context overhead for multi-model deployments. Normally, each process creates its own CUDA context, and the GPU time-slices between contexts. MPS allows multiple processes to share a single context, eliminating redundant initialization and enabling concurrent kernel execution. For serving systems running multiple model replicas, MPS can reduce aggregate cold start time and improve GPU utilization. The trade-off is reduced isolation: a crash in one process can affect others sharing the MPS server.

[^fn-cuda-serving]: **CUDA (Compute Unified Device Architecture)**: NVIDIA's parallel computing platform, first released in June 2007. The name reflects its original goal: unifying the diverse shader programming models of GPUs into a single general-purpose computing architecture. Before CUDA, GPU programming required disguising computations as graphics operations (rendering triangles to perform matrix math). CUDA exposed the GPU's thousands of cores through a C-like programming model, enabling the direct parallel computation that made deep learning on GPUs practical. The CUDA context—the data structure tracking memory allocations, loaded kernels, and device state—is the runtime's per-process gateway to GPU resources.

[^fn-cuda-mps]: **CUDA MPS (Multi-Process Service)**: Introduced in CUDA 5.0 (2012) and substantially improved in CUDA 11.4 (2021). MPS creates a daemon process that mediates GPU access for multiple client processes through a shared CUDA context, enabling true concurrent kernel execution rather than the time-sliced scheduling that occurs with separate contexts. For serving workloads, MPS enables multiple model replicas to share a GPU efficiently—each replica submits kernels independently, and the GPU schedules them across its streaming multiprocessors. The primary limitation is fault isolation: because all clients share a context, a segfault or illegal memory access in one process can corrupt the GPU state for all others.

Without warmup, the first real request triggers compilation and memory allocation mid-inference, often causing timeout failures. A request that normally takes 5 ms might require 500 ms during cold start, violating SLOs and degrading user experience.

### Loading Strategies {#sec-model-serving-loading-strategies-eb38}

Different loading strategies trade off cold start duration against serving performance and memory efficiency. The simplest approach, *full loading*\index{Model Loading!full loading}, reads the entire model into memory before serving begins. This maximizes inference speed since all weights are immediately available, but extends cold start duration and limits model size to available memory. The approach is appropriate when cold start latency is acceptable and models comfortably fit in memory.

When models are too large for immediate full loading, *memory mapping*\index{Memory Mapping!on-demand loading}\index{mmap!model loading} offers an alternative by mapping model files directly into the address space and loading pages on demand as accessed. This reduces cold start time since inference can begin before the full model loads, but causes unpredictable latency as pages fault in during initial requests. Memory mapping works well for infrequently accessed model components but can cause latency spikes if critical weights are not preloaded.

A third strategy, *lazy initialization*\index{Lazy Initialization!deferred compilation}, defers compilation and allocation until first use. This minimizes startup time but shifts latency to the first request. Production systems often combine lazy initialization with synthetic warmup requests to trigger initialization before real traffic arrives.

### Model Caching Infrastructure {#sec-model-serving-model-caching-infrastructure-4f1a}

Production systems cache model weights at the infrastructure level to reduce cold start for common deployment scenarios. One approach, *container image embedding*\index{Model Caching!container embedding}, bundles model weights directly in the container image. This produces a single deployment artifact and eliminates network fetches at startup, but creates large images (often 10-50 GB) that slow container pulls and consume registry storage. This approach works best for models that rarely update.

For organizations with many models and frequent updates, a *shared filesystem* (EFS, GCS FUSE) containing model weights provides a more flexible alternative. Multiple replicas share cached weights, and updates propagate immediately without redeployment. The tradeoff is that network latency affects cold start, and filesystem availability becomes a critical dependency.

When cold start latency is critical for high-traffic models, *node-local SSD caching*\index{SSD Cache!model loading} pre-populates local SSDs on inference nodes with frequently-used models. This approach provides fast loading (500MB/s+ for NVMe) without network dependency, but requires cache management to handle model updates and capacity limits. The choice among these strategies depends on model update frequency: infrequent updates favor container embedding, frequent updates favor shared filesystem, and performance-critical deployments benefit from local caching with background refresh.

### Multi-Model Serving {#sec-model-serving-multimodel-serving-a9c1}

Production systems often serve multiple models from a single machine\index{Multi-Model Serving!GPU memory management}, whether different model versions for A/B testing, ensemble components, or entirely different models sharing infrastructure. GPU memory becomes the limiting resource, requiring careful management strategies.

Three strategies address multi-model memory management. Time-multiplexing\index{Time-Multiplexing!model swapping} loads one model at a time and swaps based on request routing—simple but introduces swap latency. Memory sharing\index{Memory Sharing!GPU multi-model} partitions GPU memory among models, limiting concurrent execution count but enabling more models to remain resident. Model virtualization, as implemented by frameworks like Triton, manages model lifecycle automatically, loading and unloading models based on traffic patterns [@nvidia2024triton]. The choice depends on request patterns: if models receive traffic evenly, concurrent loading works; if traffic is bursty and model-specific, time-multiplexing with intelligent preloading reduces average latency while maximizing GPU utilization.

#### Multi-Stream Execution {#sec-model-serving-multistream-execution-1b1f}

When multiple models or multiple instances of the same model must run concurrently on a single GPU, the hardware must partition resources between them. NVIDIA's Multi-Instance GPU[^fn-mig]\index{MIG (Multi-Instance GPU)!hardware isolation} technology enables hardware-level isolation, dividing an A100 into up to 7 independent GPU instances, each with dedicated memory and compute resources. MIG is available on A100, A30 (up to 4 instances), H100, H200, and newer data center GPUs. For older GPUs such as V100 or T4, CUDA stream scheduling provides time-multiplexed sharing without hardware isolation.

[^fn-mig]: **MIG (Multi-Instance GPU)**: Introduced with NVIDIA's A100 GPU (Ampere architecture, 2020). MIG partitions a single physical GPU into up to seven independent instances, each with dedicated streaming multiprocessors, memory controllers, and L2 cache. Unlike software-based sharing (MPS or time-slicing), MIG provides hardware-level isolation: a runaway kernel in one partition cannot affect another's performance or memory. The trade-off is granularity—partitions must follow fixed profiles (e.g., 1g.5gb, 2g.10gb, 3g.20gb on A100), so resources cannot be divided arbitrarily. For multi-model serving, MIG enables running different models on the same GPU with guaranteed SLOs per model, eliminating the "noisy neighbor" problem that plagues time-shared GPU access.

The choice depends on whether consistent latency with MIG or maximum utilization with shared streams is the priority.

#### Model Swapping and Host Memory {#sec-model-serving-model-swapping-host-memory-c54f}

When the aggregate size of all models exceeds GPU memory capacity, the serving system must swap models between host memory (DRAM)\index{DRAM!host memory} and device memory (VRAM)\index{VRAM!device memory} on demand. This introduces a new latency component determined by the PCIe bus bandwidth\index{PCIe Bandwidth!model swapping}.

```{python}
#| label: model-swap-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ MODEL SWAP TIME
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Model swapping and host memory discussion
# │
# │ Goal: Quantify the latency cost of model swapping.
# │ Show: That loading a 10 GB model over PCIe takes 300 ms, exceeding most SLOs.
# │ How: Calculate transfer duration using PCIe Gen4 bandwidth constants.
# │
# │ Imports: mlsys.constants (PCIE_GEN4_BW), mlsys.formatting (fmt)
# │ Exports: model_size_gb_str, pcie_bw_gbs_str, model_swap_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import PCIE_GEN4_BW, GB, second
from mlsys.formatting import fmt

# --- Inputs (model and PCIe specs) ---
model_size_gb_value = 10              # model size (GB)
pcie_bw_gbs_value = PCIE_GEN4_BW.to(GB / second).magnitude  # PCIe Gen4 x16 bandwidth

# --- Process (swap time calculation) ---
model_swap_ms_value = model_size_gb_value / pcie_bw_gbs_value * 1000

# --- Outputs (formatted strings for prose) ---
model_size_gb_str = f"{model_size_gb_value}"                                 # e.g. "10" GB
pcie_bw_gbs_str = fmt(pcie_bw_gbs_value, precision=0, commas=False)          # e.g. "32" GB/s
model_swap_ms_str = fmt(model_swap_ms_value, precision=0, commas=False)      # e.g. "312" ms
```

For a `{python} model_size_gb_str` GB model on PCIe Gen4 x16 (`{python} pcie_bw_gbs_str` GB/s theoretical bandwidth), loading takes at least:
Tload = `{python} model_size_gb_str` GB / `{python} pcie_bw_gbs_str` GB/s ≈ `{python} model_swap_ms_str` ms

To mitigate this, systems use *pinned memory*\index{Pinned Memory!DMA transfer} (page-locked host memory). By default, the operating system can move ("page") any memory region to disk when RAM is under pressure. This creates a problem for GPU transfers: if the GPU's DMA (Direct Memory Access) engine begins reading a memory region that gets paged out mid-transfer, the transfer fails or stalls. To avoid this, the CPU must first copy data to a temporary pinned buffer before the GPU can safely read it, adding both latency and CPU overhead.

Pinning memory instructs the OS to keep that region permanently in physical RAM. The GPU's DMA engine can then transfer data directly from the pinned region at full PCIe bandwidth without CPU involvement. The trade-off is that pinned memory reduces the RAM available for other processes and cannot be reclaimed under memory pressure. For model serving, the performance gain (2–3 $\times$ faster transfers) typically justifies pinning model weights and frequently-used input buffers, while leaving less critical memory pageable.

The lifecycle management strategies examined so far ensure models are ready to serve: loaded into memory, warmed up, and producing predictions consistent with training. With these prerequisites satisfied, the queuing dynamics from @sec-model-serving-queuing-theory-tail-latency-29a6 become relevant. The next optimization opportunity lies in how requests are grouped for processing, which directly affects both the throughput and latency terms in our queuing equations.

## Throughput Optimization {#sec-model-serving-throughput-optimization-18d1}

Consider a ResNet-50 classifier running on a V100 GPU at batch size 1: the GPU processes one image, then sits idle while the CPU fetches and preprocesses the next—achieving only 15% hardware utilization and 200 images per second. The same GPU processing 32 images at once reaches 95% utilization and 1,280 images per second, a 6.4 $\times$ throughput improvement on identical hardware. The difference is batching, the core lever for improving serving economics. Batching\index{Batching!training vs serving}\index{Batching!throughput optimization}[^fn-batch-etymology] differs sharply between training and serving [@crankshaw2017clipper]. Training batches maximize throughput by processing hundreds or thousands of samples together with no concern for individual sample latency. Serving batches must balance throughput against individual request latency, typically processing single digits of requests together while ensuring no request waits too long. This adaptive approach is called **dynamic batching** because the system adjusts batch composition in real time based on arriving requests.

[^fn-batch-etymology]: **Batch**: From Old French "bache" (a quantity baked at one time), the term entered computing in the 1950s to describe jobs processed together without human interaction, as contrasted with interactive computing. IBM's batch processing systems of the 1960s would collect punch cards overnight and process them sequentially. The ML usage preserves this core meaning: group samples together for efficient processing, trading individual response time for aggregate throughput.

::: {.callout-definition title="Dynamic Batching"}

***Dynamic Batching***\index{Dynamic Batching!definition} is the runtime optimization of trading latency for throughput under stochastic arrival patterns. By buffering requests into a batching window, the scheduler amortizes fixed overheads (kernel launch, weight IO) across multiple inputs, pushing the system away from the memory-bound regime.

:::

### Why Batching Helps {#sec-model-serving-batching-helps-f1dc}

Modern accelerators achieve peak efficiency only at sufficient batch sizes\index{GPU Utilization!batch size dependency} [@shen2019nexus]. A single inference request leaves most compute units idle because GPUs are designed for parallel execution across thousands of threads. Batching amortizes fixed costs across multiple requests and enables parallel execution across the batch dimension.

Two fixed costs dominate at small batch sizes. **Kernel launch overhead**\index{Kernel Launch Overhead!fixed cost}[^fn-kernel-etymology-serving] is the time for the CPU to prepare and submit work to the GPU. Each layer in a neural network typically requires a separate kernel launch: the CPU must assemble kernel parameters, copy them to GPU-accessible memory, and signal the GPU to begin execution. This overhead is typically 5–20 μs per kernel, independent of batch size. ResNet-50 has approximately 50 layers, so kernel launch alone adds 250–1000 μs per inference. At batch size 1, this overhead may exceed the actual compute time; at batch size 32, the same overhead is amortized across 32 images. **Weight loading**\index{Weight Loading!memory efficiency} reads model parameters from GPU memory (VRAM) to the compute units. At batch size 1, the GPU reads all weights to process one image; at batch size 32, the same weight read processes 32 images, achieving 32 $\times$ better memory efficiency. Measuring *batching efficiency* on a concrete model quantifies how these fixed costs amortize in practice.

[^fn-kernel-etymology-serving]: **Kernel**: From Old English "cyrnel" meaning seed or grain, the essential core of something. In operating systems (1960s), the kernel is the core that manages hardware resources. CUDA borrowed this term around 2007 for GPU functions because they represent the computational "core" of parallel algorithms. Unlike OS kernels that run continuously, GPU kernels are discrete units of parallel work launched by the CPU and executed across thousands of GPU threads simultaneously.

```{python}
#| label: batch-throughput-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING THROUGHPUT AND LATENCY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50 Batching Efficiency" key insight
# │
# │ Goal: Quantify the throughput-latency trade-off of batching.
# │ Show: That batch-32 achieves 6.4× throughput at the cost of 7× higher latency.
# │ How: Contrast batch-1 and batch-32 performance including window wait times.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: throughput_ratio_str, batch_window_ms_str, batch32_inference_ms_str,
# │          batch32_total_str, batch1_inference_total_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (batching comparison) ---
batch1_throughput_value = 200         # batch-1 throughput (img/s)
batch32_throughput_value = 1280       # batch-32 throughput (img/s)
batch32_inference_ms_value = 25.0     # batch-32 inference time (ms)
batch_window_ms_value = 10.0          # batching window (ms)
batch1_inference_total_ms_value = 5.0 # batch-1 total latency (ms)

# --- Process (throughput ratio and total latency) ---
throughput_ratio_value = batch32_throughput_value / batch1_throughput_value
batch32_total_ms_value = batch_window_ms_value + batch32_inference_ms_value

# --- Outputs (formatted strings for prose) ---
throughput_ratio_str = fmt(throughput_ratio_value, precision=1, commas=False)         # e.g. "6.4" x
batch_window_ms_str = fmt(batch_window_ms_value, precision=0, commas=False)           # e.g. "10" ms
batch32_inference_ms_str = fmt(batch32_inference_ms_value, precision=0, commas=False) # e.g. "25" ms
batch32_total_str = fmt(batch32_total_ms_value, precision=0, commas=False)            # e.g. "35" ms
batch1_inference_total_ms_str = fmt(batch1_inference_total_ms_value, precision=0, commas=False)  # e.g. "5" ms
```

::: {.callout-notebook title="ResNet-50 Batching Efficiency"}

The throughput-latency tradeoff for ResNet-50 on a V100 GPU illustrates the power of batching:

| **Batch Size** | **Inference Time*** | **Per-Image Compute** | **Throughput** | **GPU Util.** |
|:---------------|--------------------:|----------------------:|---------------:|--------------:|
| 1              |               5.0ms |                 5.0ms |      200 img/s |           15% |
| 4              |               7.2ms |                 1.8ms |      556 img/s |           42% |
| 8              |               9.1ms |                 1.1ms |      879 img/s |           65% |
| 16             |              14.0ms |                 0.9ms |    1,143 img/s |           85% |
| 32             |              25.0ms |                 0.8ms |    1,280 img/s |           95% |

Note: Times shown are pure inference time, excluding queue wait. @sec-model-serving-traffic-patterns-batching-strategy-2e6b analyzes how user-perceived latency includes batching window wait.

**Key insight**: Batch size 32 achieves `{python} throughput_ratio_str` $\times$ higher throughput than batch size 1. However, user-perceived latency includes both queue wait and inference time. With a `{python} batch_window_ms_str` ms batching window and `{python} batch32_inference_ms_str` ms inference, total latency reaches `{python} batch32_total_str` ms versus `{python} batch1_inference_total_ms_str` ms at batch size 1.

:::

The table reveals the throughput-latency tradeoff in stark terms: larger batches dramatically improve hardware efficiency but increase per-request latency. In practice, the optimal batch size depends on both the latency Service Level Objective (SLO) and the arrival rate of requests. The question facing every serving engineer is therefore quantitative: given a specific latency budget, what is the largest batch size that still meets the SLO? The following analysis shows how to find *the batching sweet spot*.

```{python}
#| label: batching-sweetspot-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING SWEET SPOT
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Batching Sweet Spot"
# │
# │ Goal: Demonstrate the economic "sweet spot" for batching.
# │ Show: That batch-8 yields 3× throughput gain while remaining within typical SLOs.
# │ How: Model throughput and latency for small batches (1 to 8).
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: batch1_ms_str, batch1_imgs_str, batch8_* strings, latency_increase_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (batching sweet spot scenario) ---
batch1_ms_value = 5.0                 # batch-1 inference (ms)
batch1_imgs_value = 200               # batch-1 throughput (img/s)
batch8_wait_ms_value = 5.0            # batch-8 wait time (ms)
batch8_inference_ms_value = 9.0       # batch-8 inference time (ms)

# --- Process (user latency and throughput) ---
batch8_user_latency_ms_value = batch8_wait_ms_value + batch8_inference_ms_value
batch8_throughput_value = 8 / (batch8_user_latency_ms_value / 1000)
latency_increase_value = batch8_user_latency_ms_value / batch1_ms_value

# --- Outputs (formatted strings for prose) ---
batch1_ms_str = fmt(batch1_ms_value, precision=0, commas=False)                       # e.g. "5" ms
batch1_imgs_str = f"{batch1_imgs_value}"                                              # e.g. "200" img/s
batch8_wait_ms_str = fmt(batch8_wait_ms_value, precision=0, commas=False)             # e.g. "5" ms
batch8_inference_ms_str = fmt(batch8_inference_ms_value, precision=0, commas=False)   # e.g. "9" ms
batch8_user_latency_str = fmt(batch8_user_latency_ms_value, precision=0, commas=False)# e.g. "14" ms
batch8_throughput_str = fmt(batch8_throughput_value, precision=0, commas=False)       # e.g. "571" img/s
latency_increase_str = fmt(latency_increase_value, precision=0, commas=False)         # e.g. "3" x
```

::: {.callout-notebook title="The Batching Sweet Spot"}

**Problem**: You are serving a ResNet-50 model. At batch=1, the GPU is mostly idle (15% utilization). You want to increase throughput to save money, but you have a **20 ms** latency budget.

**The Math**:

1.  **Baseline (Batch 1)**: Inference = **`{python} batch1_ms_str` ms**. Throughput = **`{python} batch1_imgs_str` img/s**.
2.  **Optimized (Batch 8)**:
    - **Wait Time**: You set a **`{python} batch8_wait_ms_str` ms** batching window to collect requests.
    - **Inference Time**: Batch 8 inference takes **`{python} batch8_inference_ms_str` ms**.
    - **User Latency**: `{python} batch8_wait_ms_str` ms (wait) + `{python} batch8_inference_ms_str` ms (compute) = **`{python} batch8_user_latency_str` ms**.
    - **Throughput**: 8 img / `{python} batch8_user_latency_str` ms ≈ **`{python} batch8_throughput_str` img/s**.

**The Systems Conclusion**: By accepting a **`{python} latency_increase_str` $\times$ increase in latency** (`{python} batch1_ms_str` ms → `{python} batch8_user_latency_str` ms), you have achieved nearly **`{python} latency_increase_str` $\times$ higher throughput** on the same hardware. As long as `{python} batch8_user_latency_str` ms is under your 20ms budget, this is "free" capacity. This trade-off is the primary lever of serving economics.
:::

Look for the **"Knee"** in @fig-throughput-latency-knee, the point where the blue throughput curve begins to plateau just as the orange latency curve starts its sharp upward spike. This is the optimal operating point: push batch size beyond the knee and queuing delays dominate; stay below it and you leave hardware capacity on the table. The numbers are representative rather than tied to a single benchmark.

```{python}
#| label: fig-throughput-latency-knee
#| echo: false
#| fig-cap: "**The Throughput-Latency Knee.** Batch Size vs. Throughput (Blue) and Latency (Orange). Throughput increases with batch size as hardware utilization improves, but eventually saturates. Latency remains relatively flat until the 'Knee,' after which it spikes due to queuing. Values are representative and depend on model/hardware."
#| fig-alt: "Dual-axis line chart. Blue line (Throughput) rises and plateaus. Orange line (Latency) stays low then spikes upward. A vertical line marks the optimal point where throughput is high before latency explodes."

import pandas as pd
from mlsys import viz

fig, ax1, COLORS, plt = viz.setup_plot()

# =============================================================================
# DATA
# =============================================================================
BATCHING_DATA = [
    {'BatchSize': 1, 'Throughput': 64, 'Latency': 15.6},
    {'BatchSize': 2, 'Throughput': 120, 'Latency': 16.5},
    {'BatchSize': 4, 'Throughput': 230, 'Latency': 17.4},
    {'BatchSize': 8, 'Throughput': 404, 'Latency': 19.8},
    {'BatchSize': 16, 'Throughput': 650, 'Latency': 24.6},
    {'BatchSize': 32, 'Throughput': 935, 'Latency': 34.2},
    {'BatchSize': 64, 'Throughput': 1100, 'Latency': 60.0},
    {'BatchSize': 128, 'Throughput': 1143, 'Latency': 136.8},
    {'BatchSize': 256, 'Throughput': 1150, 'Latency': 300.0}
]
df = pd.DataFrame(BATCHING_DATA)

# =============================================================================
# PLOT: The Throughput-Latency Knee
# =============================================================================
color_tp, color_lat = COLORS['BlueLine'], COLORS['OrangeLine']

ax1.plot(df['BatchSize'], df['Throughput'], 'o-', color=color_tp, label='Throughput')
ax1.set_xlabel('Batch Size')
ax1.set_ylabel('Throughput (Requests/sec)', color=color_tp, fontweight='bold')
ax1.tick_params(axis='y', labelcolor=color_tp)
ax1.set_xscale('log', base=2)

ax2 = ax1.twinx()
ax2.plot(df['BatchSize'], df['Latency'], 's-', color=color_lat, label='Latency')
ax2.set_ylabel('Latency (ms)', color=color_lat, fontweight='bold', rotation=270, labelpad=15)
ax2.tick_params(axis='y', labelcolor=color_lat)
ax2.spines['right'].set_visible(True)
ax2.spines['top'].set_visible(False)

optimal_idx = 5
ax1.axvline(df['BatchSize'].iloc[optimal_idx], color='gray', linestyle='--', alpha=0.5)
ax1.text(df['BatchSize'].iloc[optimal_idx], 200, " Optimal\n Point", ha='right', color='gray', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
plt.show()
```

The efficiency gains from batching come at a cost: requests must wait for the batch to form. This creates a direct tension between throughput optimization (larger batches) and latency minimization (immediate processing). The different batching strategies and their tradeoffs govern how engineers tune this balance.

### Static vs Dynamic Batching {#sec-model-serving-static-vs-dynamic-batching-fd0a}

Static batching\index{Static Batching!fixed batch size} waits for a fixed batch size before processing. Simple to implement but problematic in practice: during low traffic, requests wait indefinitely for a full batch, and during high traffic, large batches increase per-request latency.

Dynamic batching\index{Dynamic Batching!time window} addresses these limitations by collecting requests within a time window and processing whatever has arrived when the window closes [@olston2017tensorflow]. This bounds maximum wait time regardless of traffic level. The window size represents a direct tradeoff: shorter windows reduce latency but sacrifice throughput; longer windows improve throughput but increase latency.

Typical configurations use windows of 5–50 ms with maximum batch sizes of 8–32 for latency-sensitive applications. The optimal configuration depends on request arrival patterns, model characteristics, and latency requirements.

### Dynamic Batching Latency-Throughput Trade-offs {#sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d}

Dynamic batching introduces a quantifiable tension between throughput optimization and latency constraints, revealing *why latency spikes under load* and enabling systematic configuration decisions rather than trial-and-error tuning.

::: {.callout-notebook title="Why Latency Spikes Under Load"}

**Recall** from @sec-model-serving-littles-law-9352: Little's Law ($L = \lambda W$) governs all stable queues. When hardware is saturated (throughput $\lambda$ is maxed out), any increase in traffic increases queue depth ($L$). Since $\lambda$ cannot grow, **latency ($W$) must grow linearly with queue depth**. This is why **admission control** (rejecting requests when $L$ exceeds a threshold) is the only way to preserve latency during overload.
:::

@eq-batching-latency decomposes the total user-perceived latency for a batched request into two components:

$$L_{\text{lat}} = L_{\text{lat,wait}} + L_{\text{lat,compute}}(b)$$ {#eq-batching-latency}

where $L_{\text{lat,wait}}$ is the time spent waiting in the batching queue (corresponding to $L_{queue}$ in the overall latency budget) and $L_{\text{lat,compute}}(b)$ is the inference time for batch size $b$ (encompassing $L_{infer}$ plus portions of $L_{pre}$ and $L_{post}$). The batching window $T$ bounds wait time ($L_{\text{lat,wait}} \leq T$), while batch size affects compute time through GPU utilization characteristics.

#### Quantitative Analysis of Batching {#sec-model-serving-queue-waiting-time-analysis-8d5c}

For Poisson arrivals with rate $\lambda$ and batching window $T$, requests arrive uniformly within the window. A request arriving at time $t$ within the window waits $T - t$ for the batch to close. @eq-avg-wait shows that the average wait time is simply half the window:

$$E[L_{\text{lat,wait}}] = \frac{T}{2}$$ {#eq-avg-wait}

```{python}
#| label: batching-budget-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING WINDOW LATENCY BUDGET
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Batching window latency budget analysis
# │
# │ Goal: Demonstrate how batching windows consume the latency budget.
# │ Show: That a 20ms window consumes 20% of a 50ms SLO before computation.
# │ How: Calculate average wait time assuming uniform request arrival.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: avg_wait_str, budget_pct_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (batching scenario) ---
batch_window_ms_value = 20            # batching window (ms)
slo_ms_value = 50                     # latency SLO (ms)
inference_ms_value = 5                # inference time (ms)

# --- Process (average wait and budget share) ---
avg_wait_ms_value = batch_window_ms_value / 2
budget_pct_value = avg_wait_ms_value / slo_ms_value * 100

# --- Outputs (formatted strings for prose) ---
avg_wait_str = fmt(avg_wait_ms_value, precision=0, commas=False)             # e.g. "10" ms
budget_pct_str = fmt(budget_pct_value, precision=0, commas=False)            # e.g. "20" %
```

This simple relationship has direct implications. A 20 ms batching window adds `{python} avg_wait_str` ms average latency regardless of batch size achieved. If your latency SLO is 50ms and inference takes 5ms, the batching window consumes `{python} budget_pct_str`% of your latency budget before any computation begins.

#### Batch Size Distribution {#sec-model-serving-batch-size-distribution-b3d3}

The number of requests collected during window $T$ follows a Poisson distribution with mean $\lambda T$. @eq-batch-distribution formalizes this relationship:

$$P(\text{batch size} = k) = \frac{(\lambda T)^k e^{-\lambda T}}{k!}$$ {#eq-batch-distribution}

@tbl-batch-variability quantifies this variability, showing how batch size fluctuates for different traffic levels with a fixed 10 ms window:

| **Arrival Rate** | **Mean Batch** | **Std Dev** | **P(batch=0)** | **P(batch≥2 $\times$ mean)** |
|:-----------------|---------------:|------------:|---------------:|-----------------------------:|
| **50 QPS**       |            0.5 |         0.7 |            61% |                          39% |
| **200 QPS**      |            2.0 |         1.4 |            14% |                          14% |
| **500 QPS**      |            5.0 |         2.2 |           0.7% |                           3% |
| **1000 QPS**     |           10.0 |         3.2 |         0.005% |                         0.3% |

: **Batch Size Variability**: At low traffic, batching windows frequently contain zero requests (wasted GPU cycles). At moderate traffic, batch sizes fluctuate significantly around the mean. High traffic provides more stable batching, and the probability of batches exceeding twice the mean size decreases as traffic grows (from 39% at 50 QPS to 0.3% at 1000 QPS), reflecting the law of large numbers. {#tbl-batch-variability}

#### Throughput Maximization Strategy {#sec-model-serving-throughput-maximization-strategy-27f5}

Throughput optimization requires maximizing the number of requests processed per unit time. For a system with service time $S(b)$ for batch size $b$, throughput follows @eq-batch-throughput:

$$\text{Throughput}(b) = \frac{b}{T + S(b)}$$ {#eq-batch-throughput}

The numerator increases linearly with batch size while the denominator increases sub-linearly (due to GPU parallelism). This creates an optimal batch size that balances these competing effects.

For ResNet-50 on a V100 GPU, service time approximately scales as $S(b) = 5\text{ms} + 0.6b$ (5ms fixed overhead plus 0.6ms per additional image in the batch). This linear approximation captures the dominant trend; actual service times may deviate slightly due to memory hierarchy effects. With $T = 10$ms batching window:

```{python}
#| label: batching-analysis-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ BATCHING THROUGHPUT ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @tbl-batching-throughput and "Iron Law of Batching Efficiency" callout
# │
# │ Goal: Quantify the efficiency gains from high-batch serving.
# │ Show: That batch-32 improves utilization from 11% to 79% over batch-1.
# │ How: Contrast throughput and latency while applying Iron Law efficiency terms.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: throughput_*, b1_*, b32_*, il_*, T_window_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt

# --- Inputs (batching model parameters) ---
T_window_value = 10.0                 # batching window (ms)
fixed_overhead_ms_value = 5.0         # fixed overhead (ms)
per_image_ms_value = 0.6              # per-image compute (ms)

batch_sizes_value = [1, 4, 8, 16, 32]
il_overhead_ms_value = 5.0            # Iron Law overhead (ms)
il_compute_b1_ms_value = 0.6          # batch-1 compute (ms)
il_compute_b32_ms_value = 19.2        # batch-32 compute (ms)
il_threshold_pct_value = 10           # efficiency threshold (%)

# --- Process (throughput and efficiency calculations) ---
def service_time_value(b):
    return fixed_overhead_ms_value + per_image_ms_value * b
def total_latency_value(b):
    return T_window_value + service_time_value(b)
def throughput_value(b):
    return b / (total_latency_value(b) / 1000)

throughputs_value = {b: throughput_value(b) for b in batch_sizes_value}
latencies_value = {b: total_latency_value(b) for b in batch_sizes_value}
throughput_increase_value = throughputs_value[32] / throughputs_value[1]
il_eff_b1_pct_value = int(il_compute_b1_ms_value / (il_overhead_ms_value + il_compute_b1_ms_value) * 100)
il_eff_b32_pct_value = int(il_compute_b32_ms_value / (il_overhead_ms_value + il_compute_b32_ms_value) * 100)

# --- Outputs (formatted strings for prose) ---
throughput_increase_str = fmt(throughput_increase_value, precision=1, commas=False)  # e.g. "14.6" x
b1_throughput_str = fmt(throughputs_value[1], precision=0, commas=False)             # e.g. "64" img/s
b32_throughput_str = fmt(throughputs_value[32], precision=0, commas=False)           # e.g. "935" img/s
b1_latency_str = fmt(latencies_value[1], precision=1, commas=False)                  # e.g. "15.6" ms
b32_latency_str = fmt(latencies_value[32], precision=1, commas=False)                # e.g. "34.2" ms

il_overhead_str = fmt(il_overhead_ms_value, precision=0, commas=False)               # e.g. "5" ms
il_compute_b1_str = f"{il_compute_b1_ms_value}"                                      # e.g. "0.6" ms
il_compute_b32_str = f"{il_compute_b32_ms_value}"                                    # e.g. "19.2" ms
il_eff_b1_str = f"{il_eff_b1_pct_value}"                                             # e.g. "11" %
il_eff_b32_str = f"{il_eff_b32_pct_value}"                                           # e.g. "79" %
il_threshold_str = fmt(il_threshold_pct_value, precision=0, commas=False)            # e.g. "10" %
T_window_str = fmt(T_window_value, precision=0, commas=False)                        # e.g. "10" ms
```

| **Batch Size** | **Service Time** | **Total Latency** | **Throughput** | **Efficiency** |
|:---------------|-----------------:|------------------:|---------------:|:---------------|
| 1              |            5.6ms |            15.6ms |       64 img/s | Low            |
| 4              |            7.4ms |            17.4ms |      230 img/s | Moderate       |
| 8              |            9.8ms |            19.8ms |      404 img/s | Good           |
| 16             |           14.6ms |            24.6ms |      650 img/s | High           |
| 32             |           24.2ms |            34.2ms |      935 img/s | Maximum        |

: **Batching Throughput Analysis**: ResNet-50 throughput on V100 with 10 ms batching window. Throughput increases 14.6 $\times$ from batch size 1 to 32 (64 to 935 img/s), but total latency more than doubles (15.6 ms to 34.2 ms). The optimal configuration depends on whether the latency SLO or throughput target is the binding constraint. {#tbl-batching-throughput}

The throughput gains in @tbl-batching-throughput trace directly back to *the Iron Law of batching efficiency*, the framework established in @sec-model-training-iron-law-training-performance-a53f, where batching amortizes the fixed overhead term.

::: {.callout-notebook title="The Iron Law of Batching Efficiency"}

**The Iron Law Connection:**
In serving, we maximize throughput by amortizing the **Latency Term** ($L_{lat}$), as shown in @eq-compute-time:

$$ T = \frac{O}{R_{peak} \cdot \eta} + L_{lat} $$ {#eq-compute-time}

**Deriving the Sweet Spot:**

*   **Case 1 (Batch 1):** Overhead (`{python} il_overhead_str` ms) ≈ Compute (`{python} il_compute_b1_str` ms). Efficiency ≈ `{python} il_eff_b1_str`%. The GPU is mostly waiting.
*   **Case 2 (Batch 32):** Overhead (`{python} il_overhead_str` ms) ≪ Compute (`{python} il_compute_b32_str` ms). Efficiency ≈ `{python} il_eff_b32_str`%. The GPU is crunching numbers.

**The Golden Rule:** Increase batch size until the **Latency Term** becomes negligible (< `{python} il_threshold_str`% of total time). Beyond this point, you gain minimal throughput but pay a linear latency penalty.
:::

#### Latency-Constrained Optimization {#sec-model-serving-latencyconstrained-optimization-8f66}

When latency SLOs provide the binding constraint, the optimization problem becomes finding the maximum batch size that meets the SLO. For a latency target $L_{\text{lat,target}}$ and average wait time $T/2$, @eq-latency-constrained-batch defines the maximum allowable batch size using a first-order **average** latency approximation:

$$b_{\text{max}} = \max\{b : \frac{T}{2} + S(b) \leq L_{\text{lat,target}}\}$$ {#eq-latency-constrained-batch}

Consider a 50ms p95 latency SLO for ResNet-50 serving (using this mean-based approximation as a starting point):

```{python}
#| label: latency-constrained-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ LATENCY-CONSTRAINED OPTIMIZATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Latency-constrained optimization narrative, comparing conservative
# │          vs aggressive batching window scenarios for a 50ms p95 SLO
# │
# │ Goal: Demonstrate diminishing returns for large batching windows.
# │ Show: That aggressive windows gain only ~12% throughput while adding 10ms wait.
# │ How: Contrast throughput and average wait for conservative vs. long windows.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: s1_window_ms_str, s1_wait_ms_str, s1_budget_ms_str,
# │          s1_max_batch_str, s1_batch_str, s1_throughput_str,
# │          s2_window_ms_str, s2_wait_ms_str, s2_budget_ms_str,
# │          s2_batch_str, s2_throughput_str,
# │          throughput_gain_pct_str, latency_avg_increase_ms_str,
# │          latency_p99_increase_ms_str
# └─────────────────────────────────────────────────────────────────────────────

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class BatchingOptimization:
    """
    Namespace for Latency-Constrained Batching Optimization.
    Scenario: Comparing 5ms (Conservative) vs 25ms (Aggressive) batching windows.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    # Scenario 1 (Conservative)
    s1_window = 5.0
    s1_batch = 32
    s1_tput = 1140.0

    # Scenario 2 (Aggressive)
    s2_window = 25.0
    s2_batch = 48
    s2_tput = 1280.0

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    # Avg wait = Window / 2
    s1_wait = s1_window / 2
    s2_wait = s2_window / 2

    # Budget (target 50ms)
    s1_budget = 50 - s1_wait
    s2_budget = 50 - s2_wait

    # Trade-off metrics
    tput_gain = ((s2_tput / s1_tput) - 1) * 100
    latency_increase = s2_wait - s1_wait

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(tput_gain <= 25, f"Aggressive batching gained too much throughput ({tput_gain:.1f}%). Diminishing returns not shown.")
    check(latency_increase >= 5, "Latency penalty is too small to be a concern.")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    s1_window_ms_str = f"{int(s1_window)}"
    s1_wait_ms_str = f"{s1_wait}"
    s1_budget_ms_str = f"{s1_budget}"
    s1_max_batch_str = "70" # Theoretical ceiling
    s1_batch_str = f"{s1_batch}"
    s1_throughput_str = f"{int(s1_tput):,}"

    s2_window_ms_str = f"{int(s2_window)}"
    s2_wait_ms_str = f"{s2_wait}"
    s2_budget_ms_str = f"{s2_budget}"
    s2_batch_str = f"{s2_batch}"
    s2_throughput_str = f"{int(s2_tput):,}"

    throughput_gain_pct_str = f"{tput_gain:.0f}"
    latency_avg_increase_ms_str = f"{latency_increase:.0f}"
    # Simplified P99 increase for prose consistency
    latency_p99_increase_ms_str = f"{int(s2_window - s1_window)}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
s1_window_ms_str = BatchingOptimization.s1_window_ms_str
s1_wait_ms_str = BatchingOptimization.s1_wait_ms_str
s1_budget_ms_str = BatchingOptimization.s1_budget_ms_str
s1_max_batch_str = BatchingOptimization.s1_max_batch_str
s1_batch_str = BatchingOptimization.s1_batch_str
s1_throughput_str = BatchingOptimization.s1_throughput_str
s2_window_ms_str = BatchingOptimization.s2_window_ms_str
s2_wait_ms_str = BatchingOptimization.s2_wait_ms_str
s2_budget_ms_str = BatchingOptimization.s2_budget_ms_str
s2_batch_str = BatchingOptimization.s2_batch_str
s2_throughput_str = BatchingOptimization.s2_throughput_str
throughput_gain_pct_str = BatchingOptimization.throughput_gain_pct_str
latency_avg_increase_ms_str = BatchingOptimization.latency_avg_increase_ms_str
latency_p99_increase_ms_str = BatchingOptimization.latency_p99_increase_ms_str
```

**Scenario 1: Conservative window (T = `{python} s1_window_ms_str`ms)**
- Average wait: `{python} s1_wait_ms_str`ms
- Latency budget for inference: `{python} s1_budget_ms_str`ms
- Maximum batch size: `{python} s1_max_batch_str` (but typically capped at 32 for memory)
- Achieved throughput: ~`{python} s1_throughput_str` img/s (batch=`{python} s1_batch_str`)

**Scenario 2: Aggressive window (T = `{python} s2_window_ms_str`ms)**
- Average wait: `{python} s2_wait_ms_str`ms
- Latency budget for inference: `{python} s2_budget_ms_str`ms
- Maximum batch size: `{python} s2_batch_str`
- Achieved throughput: ~`{python} s2_throughput_str` img/s (batch=`{python} s2_batch_str`)

The aggressive window achieves only `{python} throughput_gain_pct_str`% higher throughput but increases average latency by `{python} latency_avg_increase_ms_str`ms and p99 latency by `{python} latency_p99_increase_ms_str`ms. Examine @tbl-batching-throughput: for latency-sensitive applications, the conservative window provides better user experience at modest throughput cost.

#### SLO Violation Analysis {#sec-model-serving-slo-violation-analysis-6ebf}

Batch size variability causes SLO violations even when mean latency appears safe. The p99 latency includes both worst-case wait time (full window) and worst-case batch size (governed by Poisson tail). @eq-p99-batch-latency captures this relationship:

$$L_{\text{lat,p99}} \approx T + S(b_{p99})$$ {#eq-p99-batch-latency}

```{python}
#| label: slo-violation-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ SLO VIOLATION ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: SLO violation analysis narrative, estimating p99 latency impact
# │          from Poisson-driven batch size variability
# │
# │ Goal: Demonstrate why provisioning on mean latency causes SLO violations.
# │ Show: That p99 latency can be 2.2× higher than the mean due to batch size variance.
# │ How: Model request arrival and batch assembly to compare mean vs. tail response times.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: qps_str, T_slo_str, mean_wait_str, mean_batch_str,
# │          mean_service_str, mean_latency_str, p99_service_str,
# │          p99_latency_str, p99_ratio_str, p99_batch_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (traffic and SLO parameters) ---
qps_value = 500
T_slo_value = 10.0
p99_batch_value = 11

# --- Process (mean vs p99 latency comparison) ---
mean_batch_value = qps_value * (T_slo_value / 1000)

mean_wait_value = T_slo_value / 2
mean_service_value = service_time_value(int(mean_batch_value))
mean_latency_value = mean_wait_value + mean_service_value

p99_service_value = service_time_value(p99_batch_value)
p99_latency_value = T_slo_value + p99_service_value

p99_to_mean_value = p99_latency_value / mean_latency_value

# --- Outputs (formatted strings for prose) ---
qps_str = f"{qps_value}"
T_slo_str = fmt(T_slo_value, precision=0, commas=False)
mean_wait_str = fmt(mean_wait_value, precision=0, commas=False)
mean_batch_str = fmt(mean_batch_value, precision=0, commas=False)
mean_service_str = fmt(mean_service_value, precision=1, commas=False)
mean_latency_str = fmt(mean_latency_value, precision=0, commas=False)
p99_service_str = fmt(p99_service_value, precision=1, commas=False)
p99_latency_str = fmt(p99_latency_value, precision=1, commas=False)
p99_ratio_str = fmt(p99_to_mean_value, precision=2, commas=False)
p99_batch_str = f"{p99_batch_value}"
```

where bp99 is the 99th percentile batch size. For lambda = `{python} qps_str` QPS and T = `{python} T_slo_str` ms:

- Mean batch size: `{python} mean_batch_str`
- p99 batch size: `{python} p99_batch_str` (from Poisson distribution)
- Mean latency: `{python} mean_wait_str` ms + `{python} mean_service_str` ms = `{python} mean_latency_str` ms
- p99 latency: `{python} T_slo_str` ms + `{python} p99_service_str` ms = `{python} p99_latency_str` ms

The p99 latency is `{python} p99_ratio_str` $\times$ the mean, reflecting both wait time variance and batch size variance. Systems that provision based on mean latency will experience SLO violations.

::: {.callout-perspective title="Practitioner's Perspective: The Latency-Throughput Trade-off" collapse="false"}
In systems engineering interviews and architecture reviews, the most common pitfall is discussing "inference speed" without specifying batch size.

*   **Batch-1 Regime**: Optimized for latency. Relevant for real-time interaction (e.g., typing helpers, robotics). The bottleneck is usually Python overhead or memory bandwidth.
*   **Batch-32 Regime**: Optimized for throughput. Relevant for offline processing or high-traffic services. The bottleneck is usually compute (FLOPS).

**The Professional Response**: When asked "how fast is this model?", always clarify: "Are we optimizing for single-stream latency (Batch 1) or maximum throughput (Batch N)?" This distinction demonstrates systems maturity.
:::

#### Adaptive Batching Windows {#sec-model-serving-adaptive-batching-windows-c404}

Fixed batching windows waste latency budget during high traffic when large batches form quickly. @lst-adaptive-batching demonstrates how adaptive strategies adjust the window based on queue depth.

::: {#lst-adaptive-batching lst-cap="**Adaptive Batching Window**: Dynamically adjusts batch timeout based on queue depth and arrival rate, reducing average latency by 27% compared to fixed windows while maintaining throughput."}
```{.python}
def adaptive_batching_window(queue_depth, arrival_rate, slo_ms):
    """Compute optimal batching window.

    Based on current system state.
    """
    target_batch_size = 16  # Optimal batch for GPU utilization

    # Fast path: batch ready, close immediately to minimize latency
    if queue_depth >= target_batch_size:
        return 0

    # Compute maximum allowable wait from SLO constraint
    # Reserve 30% of latency budget for batching,
    # remainder for inference
    max_wait = slo_ms * 0.3

    # Estimate time to accumulate target batch at current arrival rate
    if arrival_rate > 0:
        requests_needed = target_batch_size - queue_depth
        estimated_wait = requests_needed / arrival_rate
        # Return minimum of estimated wait and SLO-constrained maximum
        return min(estimated_wait, max_wait)

    return (
        max_wait  # Low traffic: use full budget to accumulate batch
    )
```
:::

This approach reduces average wait time during high traffic while maintaining batch sizes. For traffic varying between 200–1000 QPS:

- Fixed window (10 ms): Average latency 15 ms, throughput 650 img/s
- Adaptive window: Average latency 11 ms (27% reduction), throughput 680 img/s (5% improvement)

The interplay between window size and batch limits creates a space of possible configurations, each representing a different balance between throughput and latency.

The batching configuration space forms a Pareto frontier[^fn-pareto-frontier-batching]\index{Pareto Frontier!throughput-latency} where improving throughput requires accepting higher latency. @tbl-pareto-batching traces this frontier across five representative configurations:

| **Window (ms)** | **Max Batch** | **Avg Latency** | **p99 Latency** | **Throughput** | **Configuration**    |
|:----------------|--------------:|----------------:|----------------:|---------------:|:---------------------|
| 2               |            16 |             8ms |            18ms |      890 img/s | Ultra-low latency    |
| 5               |            32 |            10ms |            22ms |    1,140 img/s | Balanced             |
| 10              |            32 |            15ms |            35ms |    1,240 img/s | Moderate latency     |
| 20              |            64 |            23ms |            52ms |    1,310 img/s | Throughput-optimized |
| 50              |           128 |            38ms |            98ms |    1,350 img/s | Maximum throughput   |

: **Batching Pareto Frontier**: Each configuration represents a different point on the throughput-latency trade-off curve. Moving from 2ms to 50ms windows improves throughput by only 52% while increasing p99 latency by 5.4 $\times$. Diminishing returns make aggressive batching costly for latency-sensitive applications. {#tbl-pareto-batching}

#### Practical Configuration Guidelines {#sec-model-serving-practical-configuration-guidelines-9791}

[^fn-pareto-frontier-batching]: **Pareto Frontier**: Named after Italian economist Vilfredo Pareto (1848--1923), who observed in 1896 that 80% of Italy's land was owned by 20% of the population. The "Pareto frontier" (also called "Pareto optimal" or "efficient frontier") describes the set of solutions where no objective can be improved without worsening another. In serving systems, the frontier maps the throughput-latency trade-off: each point represents a configuration where gaining throughput requires sacrificing latency, and vice versa. The concept originated in welfare economics but pervades engineering optimization, from portfolio theory [@markowitz1952portfolio] to multi-objective neural architecture search.

The Pareto frontier in @tbl-pareto-batching illustrates why these guidelines matter: moving from a 2ms to a 50ms window improves throughput by only 52% while increasing p99 latency by 5.4 $\times$. Principled batching configuration avoids this region of diminishing returns by working backward from the latency budget. Allocating 20 to 30 percent of the SLO to batching wait time leaves the remainder for inference and overhead, which bounds the maximum window at $T_{\text{max}} = 0.3 \times L_{\text{lat,SLO}}$. The traffic estimate that feeds this calculation should use the p95 arrival rate rather than the average, because batching windows tuned for average traffic produce oversized batches during spikes—precisely when SLO headroom matters most. GPU memory imposes a hard ceiling on batch size independent of the latency constraint, since activation memory scales linearly with the batch dimension. Finally, monitoring the actual batch size distribution in production reveals whether initial traffic assumptions hold; high variance signals that the window needs adaptive tuning rather than a fixed configuration.

For ResNet-50 with 50ms SLO and 500 QPS traffic:

```{python}
#| label: practical-config-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ PRACTICAL BATCHING CONFIGURATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Practical configuration guidelines example — calculating optimal
# │          batch window and size for ResNet-50 with 50ms SLO at 500 QPS
# │
# │ Goal: Outline the systematic procedure for deriving a production batching config.
# │ Show: How allocating 30% of the SLO budget to batching yields a safe 12ms window.
# │ How: Calculate expected batch size from QPS and cap by memory limits.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: pc_slo_ms_str, pc_qps_str, pc_batch_budget_ms_str,
# │          pc_max_window_ms_str, pc_expected_batch_str,
# │          pc_mem_limit_batch_str, pc_config_window_ms_str,
# │          pc_config_batch_str, pc_predicted_p99_ms_str,
# │          pc_predicted_throughput_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (SLO, traffic, and memory constraints) ---
pc_slo_ms_value = 50
pc_qps_value = 500
pc_budget_pct_value = 0.3
pc_mem_limit_batch_value = 32

# --- Process (budget allocation and configuration) ---
pc_batch_budget_ms_value = pc_slo_ms_value * pc_budget_pct_value
pc_max_window_ms_value = pc_batch_budget_ms_value
pc_expected_batch_value = pc_qps_value * (pc_max_window_ms_value / 1000)
pc_config_window_ms_value = 12 # Tuned
pc_config_batch_value = 32
pc_predicted_p99_ms_value = 43
pc_predicted_throughput_value = 1180

# --- Outputs (formatted strings for prose) ---
pc_slo_ms_str = f"{pc_slo_ms_value}"
pc_qps_str = f"{pc_qps_value}"
pc_batch_budget_ms_str = f"{pc_batch_budget_ms_value:.0f}"
pc_max_window_ms_str = f"{pc_max_window_ms_value:.0f}"
pc_expected_batch_str = f"{pc_expected_batch_value}"
pc_mem_limit_batch_str = f"{pc_mem_limit_batch_value}"
pc_config_window_ms_str = f"{pc_config_window_ms_value}"
pc_config_batch_str = f"{pc_config_batch_value}"
pc_predicted_p99_ms_str = f"{pc_predicted_p99_ms_value}"
pc_predicted_throughput_str = f"{pc_predicted_throughput_value:,}"
```

- Latency budget for batching: `{python} pc_batch_budget_ms_str`ms
- Maximum window: `{python} pc_max_window_ms_str`ms
- Expected batch size: `{python} pc_expected_batch_str`
- Maximum batch size: `{python} pc_mem_limit_batch_str` (memory limit)
- Configuration: T = `{python} pc_config_window_ms_str`ms, b_max = `{python} pc_config_batch_str`
- Predicted p99 latency: `{python} pc_predicted_p99_ms_str`ms (within SLO)
- Predicted throughput: `{python} pc_predicted_throughput_str` img/s

### Continuous Batching {#sec-model-serving-continuous-batching-8bb6}

Autoregressive models\index{Autoregressive Models!token generation} like language models generate outputs token by token—each new token depends on all previously generated tokens, so generation is inherently sequential. The dynamic batching examined in @sec-model-serving-throughput-optimization-18d1 assumes fixed-length outputs. LLMs violate this assumption: if one sequence in a batch of 8 finishes after 10 tokens while others need 100 tokens, 87.5 percent of the compute for that sequence slot is wasted\index{Sequence Length Variability!batch waste} [@yu2022orca].

Continuous batching[^fn-continuous-batching]\index{Continuous Batching!LLM serving} (also called iteration-level batching) addresses this waste by allowing new requests to join a batch between generation steps and completed sequences to exit [@kwon2023vllm]. The system manages batch composition dynamically at each decoding iteration rather than forming static batches that persist for the entire generation process.

The mechanism works as follows: when a sequence generates its end-of-sequence token, its slot becomes immediately available. A waiting request can fill that slot for the next iteration rather than waiting for the entire batch to complete. Similarly, the system can add new requests to available slots without interrupting ongoing generation.

[^fn-continuous-batching]: **Continuous Batching**: Introduced by the Orca system from Yu et al. at OSDI 2022 [@yu2022orca], which coined the term "iteration-level batching" to distinguish it from "request-level batching" (traditional dynamic batching). The key insight is scheduling granularity: traditional batching commits to a fixed batch for an entire generation sequence (potentially hundreds of iterations), while continuous batching makes scheduling decisions at every single token-generation step. NVIDIA adopted the term "in-flight batching" for the same concept in TensorRT-LLM. The technique draws an analogy to CPU process scheduling: just as modern operating systems preemptively schedule processes at each time slice rather than running each to completion, continuous batching reschedules the GPU's "process slots" at each decoding iteration.

This dynamic approach maintains high GPU utilization even when sequence lengths vary dramatically.

Systems implementing continuous batching, such as vLLM[^fn-vllm] and TensorRT-LLM[^fn-tensorrt-llm], achieve 2–4 $\times$ higher throughput than traditional static batching [@agrawal2024sarathi]. The improvement comes from two sources: eliminating wasted compute on completed sequences and reducing average wait time for new requests. For production language model serving where response lengths vary from single tokens to thousands, continuous batching has become essential for cost-effective deployment.

[^fn-vllm]: **vLLM**: Open-sourced by UC Berkeley researchers Woosuk Kwon, Zhuohan Li, and others in June 2023 alongside their SOSP paper introducing PagedAttention. The name stands for "virtual LLM," drawing a deliberate analogy to virtual memory in operating systems: just as virtual memory decouples logical addresses from physical RAM pages, vLLM decouples the logical KV cache from contiguous GPU memory blocks. This OS-inspired design solved the memory fragmentation problem that limited prior LLM serving systems to 50--60% memory utilization, enabling 2--4 $\times$ higher throughput at the same hardware cost.

[^fn-tensorrt-llm]: **TensorRT-LLM**: NVIDIA's open-source library (released October 2023) that extends TensorRT's inference optimization to large language models. Built on top of TensorRT's graph compilation and kernel fusion capabilities, it adds LLM-specific optimizations: in-flight batching (NVIDIA's term for continuous batching), paged KV cache management, multi-GPU tensor parallelism, and custom attention kernels. The library provides a Python API for defining models that compile to optimized TensorRT engines, bridging the gap between research model code and production-grade serving performance.

Memory management adds complexity to continuous batching. As sequences enter and exit the batch, the key-value cache that stores attention context must be dynamically allocated and freed. Consider what happens when sequences of varying lengths share GPU memory: a 100-token sequence completes and releases its cache, but a new 150-token sequence cannot use that space because it needs a larger contiguous block. Over time, small unusable gaps accumulate between allocated regions, eventually preventing new sequences from starting even when total free memory appears sufficient. This *memory fragmentation*\index{Memory Fragmentation!KV cache} can waste 40 to 50 percent of available memory in naive implementations, severely limiting the concurrent batch size that determines throughput.

#### PagedAttention {#sec-model-serving-pagedattention-b8d4}

PagedAttention\index{Continuous Batching!iteration-level}\index{PagedAttention!memory fragmentation solution},[^fn-pagedattention] introduced in vLLM, solves this fragmentation problem by applying operating system virtual memory concepts to GPU memory [@kwon2023vllm]. Instead of allocating one contiguous block per sequence, PagedAttention divides the KV cache into fixed-size *pages* (typically 16 tokens each). A sequence's cache consists of pointers to non-contiguous pages scattered across GPU memory. When a sequence completes, its pages return to a free list and can be reused by any new sequence, regardless of length. This approach achieves near-zero fragmentation: vLLM reports memory utilization above 95% compared to 50–60% for contiguous allocation schemes. The overhead is modest (one pointer lookup per page during attention computation), making PagedAttention the standard for production LLM serving.

[^fn-pagedattention]: **PagedAttention**: Introduced by Kwon et al. at SOSP 2023, this algorithm directly applies operating system virtual memory concepts to GPU memory management for LLMs. Before PagedAttention, researchers found that existing systems wasted 60–80% of KV cache memory due to fragmentation and over-reservation. By borrowing paging and copy-on-write mechanisms from OS design, PagedAttention reduces waste to under 4%, enabling 2–4 $\times$ higher throughput on the same hardware. This technique has become the de facto standard in production LLM serving systems.

The batching and memory techniques covered here establish the foundation for LLM serving, but several advanced topics warrant additional study:

::: {.callout-perspective title="LLM Serving: Beyond the Fundamentals"}

Language model serving introduces challenges beyond the batching and memory principles established here. The key-value cache that stores attention context scales with sequence length and batch size, often exceeding the model weights themselves in memory consumption. Techniques like speculative decoding\index{Speculative Decoding!latency reduction}\index{Speculative Decoding!draft model verification} use small draft models to propose multiple tokens that the target model verifies in parallel, achieving 2–3 $\times$ latency reduction for interactive applications. Weight-only quantization (INT4 weights with FP16 activations) proves more effective than activation quantization for memory-bandwidth-bound LLM inference.

These LLM-specific optimizations build directly on the foundations this chapter establishes: queuing theory governs request scheduling, batching tradeoffs determine throughput-latency curves, and precision selection follows the same accuracy-efficiency principles. The serving fundamentals apply universally; LLM serving adds domain-specific techniques atop this foundation. Advanced treatments provide detailed coverage of KV cache optimization, including advanced techniques for multi-tenant serving and distributed inference.

:::

Continuous batching represents the state of the art for LLM serving, yet not all deployment scenarios benefit from batching. The sophisticated techniques examined so far (from dynamic batching windows to PagedAttention) optimize for high-throughput server workloads. These techniques introduce complexity and latency overhead that may not be justified for all deployment contexts. A practical question remains: *when* does batching hurt rather than help?

#### When Not to Batch {#sec-model-serving-batch-12a4}

Some\index{Batching!when to avoid} scenarios require single-request processing. Ultra-low latency requirements\index{Ultra-Low Latency!no batching}, where p99 latency must stay under 10 ms, make any batching delay unacceptable. Highly variable request sizes create padding overhead that wastes compute, since the smallest input in a batch must be padded to match the largest. And memory constraints become binding when models already consume most GPU memory, since batch activations scale linearly with batch size and can trigger out-of-memory errors.

### Session Affinity Constraints {#sec-model-serving-session-affinity-constraints-8b1f}

When requests from the same user or session should route to the same replica, batching becomes constrained. Session affinity, also called sticky sessions, matters for three main reasons.

**KV-Cache Reuse**\index{KV Cache!session reuse}\index{KV Cache!multi-turn conversations}: For conversational AI, the key-value cache from previous turns dramatically speeds up multi-turn conversations. Routing a follow-up request to a different replica forfeits this cached context, increasing latency by 2 to 5 times for long conversations.

**User-Specific Models**\index{Personalized Models!user adapters}: Some systems serve personalized models or adapters per user. Routing requests to the replica that has already loaded that user's adapter avoids repeated loading overhead.

**Stateful Preprocessing**: When preprocessing maintains state through tokenizer caches or session-specific normalization, routing to a different replica requires rebuilding this state.

The tension with batching is clear since strict affinity\index{Session Affinity!sticky sessions} constrains which requests can be batched together, potentially reducing batch sizes and GPU utilization. Production systems often implement soft affinity\index{Soft Affinity!load balancing} where requests prefer their assigned replica but can overflow to others when that replica is overloaded. This preserves most affinity benefits while maintaining load balance.

### Traffic Patterns and Batching Strategy {#sec-model-serving-traffic-patterns-batching-strategy-2e6b}

The optimal batching strategy depends critically on how requests arrive. Different deployment contexts exhibit distinct arrival patterns, each requiring different batching approaches. The MLPerf inference benchmark codifies these patterns into four scenarios that directly map to real-world deployments, as @sec-benchmarking explains in detail.

#### Server Traffic (Poisson Arrivals) {#sec-model-serving-server-traffic-poisson-arrivals-5d26}

Cloud APIs\index{Server Traffic!Poisson process} and web services typically receive requests following a Poisson process,[^fn-poisson-process] where arrivals are independent and uniformly distributed over time. @eq-poisson-batch expresses the expected batch size for Poisson arrivals with rate $\lambda$ and batching window $T$:

[^fn-poisson-process]: **Poisson Process**: A stochastic model where events occur continuously and independently at a constant average rate. Named after French mathematician Simeon Denis Poisson (1781-1840), this model accurately describes many real-world arrival patterns including web requests and API calls. The key property for serving systems is that inter-arrival times are exponentially distributed, meaning the probability of long gaps between requests decays exponentially, which is why batching windows can be tuned probabilistically.

$$E[\text{batch size}] = \lambda \cdot T$$ {#eq-poisson-batch}

The variance equals the mean (a property of Poisson distributions), so batch sizes fluctuate significantly at moderate traffic. With $\lambda = 200$ requests/second and $T = 10$ms, expected batch size is 2, but 16% of windows will have zero requests (wasted compute cycles) while others may have 4 or more.

The optimal batching window balances waiting cost against throughput benefit. @eq-optimal-window defines this optimum:

$$T_{\text{optimal}} = \min\left(L_{\text{lat,SLO}} - S, \sqrt{\frac{S}{\lambda}}\right)$$ {#eq-optimal-window}

where $L_{\text{lat,SLO}}$ is the latency SLO and $S$ is the service time. A perhaps surprising result emerges from this equation: as traffic increases, the optimal window decreases while achieved batch sizes still grow. @tbl-traffic-adaptive demonstrates this phenomenon across four traffic levels.

| **Arrival Rate** | **Optimal Window** | **Avg Batch Size** | **p99 Latency** |
|:-----------------|-------------------:|-------------------:|----------------:|
| **100 QPS**      |               20ms |                2.0 |            45ms |
| **500 QPS**      |                8ms |                4.0 |            42ms |
| **1,000 QPS**    |                5ms |                5.0 |            38ms |
| **5,000 QPS**    |                2ms |               10.0 |            35ms |

: **Traffic-Adaptive Batching**: Higher traffic enables shorter windows while still achieving larger batches. The optimal window decreases even as batch sizes grow because more requests arrive per unit time. {#tbl-traffic-adaptive}

#### Streaming Traffic (Correlated Arrivals) {#sec-model-serving-streaming-traffic-correlated-arrivals-32b6}

Autonomous vehicles\index{Streaming Traffic!sensor synchronization}, video analytics, and robotics systems receive inputs from multiple synchronized sensors. This scenario illustrates *multi-camera autonomous vehicle serving*.

::: {.callout-notebook title="Multi-Camera Autonomous Vehicle Serving"}

Consider a vehicle with 6 cameras capturing at 30 FPS, requiring spatial fusion:

**Timeline for processing frame set N:**

| **Time** | **Event**                         |
|:---------|:----------------------------------|
| T = 0ms  | Cameras begin capturing frame N   |
| T = 8ms  | Camera 1 frame arrives            |
| T = 10ms | Cameras 2-5 frames arrive         |
| T = 15ms | Camera 6 arrives (jitter)         |
| T = 15ms | Batch inference begins (6 images) |
| T = 25ms | Inference complete                |
| T = 32ms | Result ready for planning module  |

**Key constraints:**

- Hard deadline: 33ms per frame set (real-time requirement)
- Batch size: Fixed at 6 (one per camera)
- Synchronization budget: 12ms of 33ms total (36% for jitter tolerance)
- Timeout policy: If camera frame not received by T+20ms, use previous frame

Unlike Poisson traffic where dynamic batching optimizes throughput, streaming traffic requires synchronization policies that handle sensor jitter while meeting hard deadlines.

:::

#### Single-User Traffic (Sequential Arrivals) {#sec-model-serving-singleuser-traffic-sequential-arrivals-78da}

Mobile\index{Single-User Traffic!mobile serving}\index{SingleStream!MLPerf scenario} and embedded applications serve one user at a time, with requests arriving only after the previous result is consumed. We can analyze these constraints in *ResNet-50 mobile serving*.

```{python}
#| label: mobile-serving-calc
#| echo: false

# ┌─────────────────────────────────────────────────────────────────────────────
# │ MOBILE SERVING LATENCY AND ENERGY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Mobile Serving" — single-user traffic pattern
# │
# │ Goal: Contrast latency and energy costs for mobile inference.
# │ Show: That JPEG decode dominates the energy budget, exceeding NPU inference.
# │ How: Model latency and Joules per request for a complete vision pipeline.
# │
# │ Imports: (none)
# │ Exports: m_cam_ms_str, m_jpeg_ms_str, m_resize_ms_str, m_npu_ms_str,
# │          m_ui_ms_str, m_total_ms_str, m_cam_mj_str, m_jpeg_mj_str,
# │          m_resize_mj_str, m_npu_mj_str, m_ui_mj_str, m_total_mj_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (mobile latency and energy per phase) ---
m_cam_ms_value = 8
m_jpeg_ms_value = 15
m_resize_ms_value = 5
m_npu_ms_value = 12
m_ui_ms_value = 5

m_cam_mj_value = 0.08
m_jpeg_mj_value = 1.5
m_resize_mj_value = 0.4
m_npu_mj_value = 0.8
m_ui_mj_value = 0.2

# --- Process (total latency and energy) ---
m_total_ms_value = (
    m_cam_ms_value + m_jpeg_ms_value + m_resize_ms_value + m_npu_ms_value + m_ui_ms_value
)
m_total_mj_value = (
    m_cam_mj_value + m_jpeg_mj_value + m_resize_mj_value + m_npu_mj_value + m_ui_mj_value
)

# --- Outputs (formatted strings for table) ---
m_cam_ms_str = f"{m_cam_ms_value}ms"
m_jpeg_ms_str = f"{m_jpeg_ms_value}ms"
m_resize_ms_str = f"{m_resize_ms_value}ms"
m_npu_ms_str = f"{m_npu_ms_value}ms"
m_ui_ms_str = f"{m_ui_ms_value}ms"
m_total_ms_str = f"{m_total_ms_value}ms"

m_cam_mj_str = f"{m_cam_mj_value}mJ"
m_jpeg_mj_str = f"{m_jpeg_mj_value}mJ"
m_resize_mj_str = f"{m_resize_mj_value}mJ"
m_npu_mj_str = f"{m_npu_mj_value}mJ"
m_ui_mj_str = f"{m_ui_mj_value}mJ"
m_total_mj_str = f"{m_total_mj_value:.1f}mJ"
```

::: {.callout-notebook title="ResNet-50: Mobile Serving"}

| **Phase**              | **Duration**                  | **Energy**                    | **Notes**         |
|:-----------------------|:------------------------------|:------------------------------|:------------------|
| **Camera buffer read** | `{python} m_cam_ms_str`       | `{python} m_cam_mj_str`       | System API        |
| **JPEG decode (CPU)**  | `{python} m_jpeg_ms_str`      | `{python} m_jpeg_mj_str`      | Single-threaded   |
| **Resize + Normalize** | `{python} m_resize_ms_str`    | `{python} m_resize_mj_str`    | CPU preprocessing |
| **NPU inference**      | `{python} m_npu_ms_str`       | `{python} m_npu_mj_str`       | 82% utilization   |
| **Post-process + UI**  | `{python} m_ui_ms_str`        | `{python} m_ui_mj_str`        | Result rendering  |
| **Total**              | **`{python} m_total_ms_str`** | **`{python} m_total_mj_str`** | 22 FPS sustained  |

**Key metrics for ML node serving:**

- **Energy per inference**: 3.0mJ enables ~12 million inferences per 10Wh battery (typical smartphone)
- **Thermal budget**: At 3.0mJ/45ms = 67mW sustained, indefinite operation without throttling
- **NPU vs CPU tradeoff**: CPU fallback uses 4.2mJ (1.4 $\times$ energy) at 85ms (1.9 $\times$ latency)
- **Memory footprint**: 150MB peak (model + activations), competing with app memory

**Critical insight**: Even at batch size 1, the mobile NPU achieves 82% utilization because its compute capacity matches single-image workloads. This differs from datacenter GPUs, which achieve only 15% utilization at batch size 1 because their massive parallelism requires larger batches to saturate.

:::

#### Mobile Serving Constraints {#sec-model-serving-mobile-serving-constraints-eb68}

Unlike cloud serving where cost dominates, mobile serving faces three related constraints that shape optimization strategy:

1. **Energy Budget**\index{Energy Budget!mobile inference}: Each inference depletes battery. A photo app running continuous inference at 22 FPS drains 240mW, acceptable for active use but problematic for background processing. The optimization target shifts from throughput to energy-per-inference.

2. **Thermal Throttling**\index{Thermal Throttling!mobile serving}: Sustained high-power operation triggers thermal management. When the SoC reaches thermal limits (typically 45°C junction), the OS reduces NPU frequency by 30–50%, degrading both latency and throughput. Bursty workloads that allow cooling between bursts outperform sustained maximum throughput.

3. **Memory Constraints**\index{Memory Constraints!mobile RAM}: Mobile devices share limited RAM between applications. A model consuming 500MB may be evicted during background operation, requiring reload (cold start) that adds 200–500 ms latency. Even a 150MB footprint becomes problematic when the model must coexist with other app components. Memory-efficient quantization directly improves user experience through faster model restoration, and memory-mapped model loading (@sec-model-serving-loading-strategies-eb38) helps further by loading pages on demand rather than requiring the full model in memory.

These constraints make mobile serving optimization qualitatively different from cloud optimization. The goal is not maximum throughput but **sustainable performance**, maintaining acceptable latency without thermal throttling or excessive battery drain.

@tbl-traffic-patterns-summary maps the four MLPerf scenarios to their deployment contexts and optimal batching strategies, providing a decision framework for serving system design.

| **Scenario**                                         | **Context**                         | **Strategy**                  | **Focus**                                |
|:-----------------------------------------------------|:------------------------------------|:------------------------------|:-----------------------------------------|
| **Server**\index{Server Scenario!MLPerf}             | Cloud APIs, web services            | Dynamic batching with timeout | Window tuning, utilization-latency curve |
| **MultiStream**\index{MultiStream!MLPerf scenario}   | Autonomous driving, video analytics | Synchronized sensor fusion    | Jitter handling, deadline guarantees     |
| **SingleStream**                                     | Mobile apps, embedded devices       | No batching (batch=1)         | Preprocessing, power efficiency          |
| **Offline**\index{Offline Inference!MLPerf scenario} | Batch processing, data pipelines    | Maximum batch size            | Throughput, hardware utilization         |

: **Traffic Patterns and Batching Strategies**: The four MLPerf inference scenarios map to distinct deployment contexts. Server traffic (cloud APIs) uses dynamic batching with timeout; MultiStream (autonomous driving) uses synchronized sensor fusion; SingleStream (mobile) processes requests individually; Offline (batch processing) maximizes batch size for throughput. {#tbl-traffic-patterns-summary}

:::: {.callout-checkpoint title="Batching and Traffic Patterns" collapse="false"}
Batching is the primary lever for serving economics, but the optimal strategy depends on context.

- [ ] **Throughput-latency tradeoff**: Can you explain why batch size 32 achieves 6 $\times$ higher throughput than batch size 1, yet a production system with a 20ms SLO might still choose batch size 8?
- [ ] **Dynamic vs. static batching**: Can you describe why static batching (waiting for a full batch) fails under variable traffic, and how dynamic batching with a time window solves this?
- [ ] **Traffic pattern matching**: Given a deployment scenario (e.g., cloud API, autonomous vehicle, mobile app), can you select the appropriate MLPerf scenario and explain why that batching strategy fits?
- [ ] **Adaptive windows**: Can you explain why the optimal batching window *decreases* as traffic *increases*, even though batch sizes grow?
::::

The batching strategies examined so far share a critical assumption: each request produces a single, fixed-size output---one classification label, one bounding box, one embedding vector. This assumption governs the queuing math, the Pareto frontier analysis, and the traffic-adaptive window tuning. But the fastest-growing category of serving workloads violates this assumption entirely. Large language models generate outputs token by token, with each token depending on every previous one. A single request may produce hundreds or thousands of tokens over seconds of elapsed time, yet must feel responsive from the first token onward. This fundamental shift from fixed-output to variable-length, streaming-output serving demands new metrics, new memory management strategies, and new batching techniques that build on---but substantially extend---the foundations established above.

## LLM Serving {#sec-model-serving-llm-serving-b8bf}

Large language models\index{LLM Serving!token generation} introduce three properties absent from traditional serving: *autoregressive generation*[^fn-autoregressive] (each token depends on all previous tokens, making output inherently sequential), *variable-length output* (response length is unknown at request time, invalidating fixed-batch assumptions), and *stateful memory* (the key-value cache grows with each generated token, creating dynamic memory pressure that traditional models never face). Together, these properties create a qualitatively different serving challenge. The p50, p95, and p99 metrics that govern classification serving still matter, but they apply to different *phases* of the request---the initial prompt processing and the subsequent token generation. The foundational principles of queuing theory, batching tradeoffs, and latency budgets apply universally; LLM serving adds domain-specific techniques atop this foundation.

[^fn-autoregressive]: **Autoregressive**: From Greek *auto-* (self) and Latin *regressus* (a going back). In statistics, an autoregressive model predicts the next value from its own previous values—the output "regresses" on itself. George Udny Yule introduced autoregressive models in 1927 for analyzing sunspot cycles. In language modeling, the term describes the sequential token generation process where each output token conditions on all previously generated tokens, creating a fundamental serial dependency that prevents the parallelism exploited during training's forward pass. This serial bottleneck explains why LLM serving is memory-bandwidth-bound rather than compute-bound: the model must be read from memory once per token, regardless of available compute capacity.

### Performance Metrics: TTFT and TPOT {#sec-model-serving-performance-metrics-ttft-tpot-b009}

Generative models produce a stream of tokens rather than a single output tensor. This streaming nature requires dedicated *LLM performance metrics* that reflect the internal state transition from "prefill" (processing input) to "decode" (generating output). The two key measures are *Time to First Token (TTFT)* and *Time Per Output Token (TPOT)*, which capture responsiveness and fluidity respectively.

::: {.callout-definition title="LLM Performance Metrics"}

***Time to First Token (TTFT)***\index{TTFT (Time to First Token)}\index{TPOT (Time Per Output Token)} measures latency from request to first output token, governed by the compute-bound **Prefill Phase** (processing the full prompt) [@pope2023efficiently]. **Time Per Output Token (TPOT)** measures latency of each subsequent token, governed by the memory-bandwidth-bound **Decode Phase** (autoregressive KV cache lookups). This decomposition isolates the distinct hardware bottlenecks (**Compute** versus **Memory Bandwidth**), enabling targeted optimization of each phase.

:::

These two metrics[^fn-ttft-tpot] capture distinct user experience aspects.

[^fn-ttft-tpot]: **TTFT and TPOT**: These metrics emerged from the LLM serving community circa 2022--2023, formalized in Pope et al.'s analysis of efficient generative inference [@pope2023efficiently]. TTFT adapts the traditional "time to first byte" (TTFB) metric from web performance—coined in the early 2000s to measure server responsiveness—to token-based generation. TPOT has no direct predecessor; it captures the unique streaming characteristic of autoregressive models, where user-perceived quality depends not on total completion time but on the *rhythm* of token arrival. Together, they replace the single-number latency metric adequate for classification models with a two-dimensional characterization reflecting the prefill/decode phase split inherent to transformer generation. A fast TTFT provides immediate responsiveness (the system starts answering quickly), while a fast TPOT provides fluid generation (the answer streams smoothly). Production systems must optimize both, typically with different techniques since they are governed by different hardware constraints. Translating these metrics into concrete *LLM serving latency targets* grounds the discussion in production reality.

::: {.callout-lighthouse title="LLM Serving Latency Targets"}

A production-grade LLM service typically targets the following SLOs:

- **TTFT**: < 500 ms (for a 1000-token prompt)
- **TPOT**: < 50 ms (equivalent to ~20 tokens/second, faster than human reading speed)
- **Throughput**: > 1000 tokens/second aggregate across all users

:::

### Decoding Strategies {#sec-model-serving-decoding-strategies-afe8}

Generative models require decoding strategies that trade off quality, diversity, and latency. The choice of decoding strategy dramatically affects both output quality and computational cost.

The simplest approach, greedy decoding\index{Greedy Decoding!LLM generation}, selects the highest-probability token at each step. It is fast but often produces repetitive, low-quality outputs because it cannot recover from early mistakes. Beam search[^fn-beam-search]\index{Beam Search!decoding strategy}\index{Beam Search!candidate sequences} improves quality by maintaining multiple candidate sequences and selecting the highest-scoring complete sequence, though it multiplies computation by the beam width. Sampling\index{Sampling!temperature, top-k, top-p} with temperature, top-k, and top-p parameters introduces randomness for diversity [@holtzman2020curious]. Temperature scales logits before softmax. Top-k limits sampling to the k highest-probability tokens. Top-p, also called nucleus sampling[^fn-nucleus-sampling]\index{Nucleus Sampling!top-p}, limits sampling to tokens comprising probability mass p.

[^fn-beam-search]: **Beam Search**: A heuristic search algorithm that explores a graph by expanding only the most promising nodes at each level. The "beam" metaphor evokes a flashlight illuminating a narrow band of the search space—wider beams (larger beam widths) explore more possibilities but cost proportionally more compute and memory. Beam search originated in speech recognition in the 1970s (Raj Reddy's group at CMU) and was adopted for neural machine translation by Sutskever et al. [@sutskever2014sequence]. For LLM serving, beam width directly multiplies memory requirements because each beam maintains its own KV cache, making beam width a critical serving cost parameter.

[^fn-nucleus-sampling]: **Nucleus Sampling (Top-p)**: Introduced by Ari Holtzman et al. in "The Curious Case of Neural Text Degeneration" [@holtzman2020curious] (2019). The "nucleus" metaphor comes from the core of the probability distribution: rather than sampling from a fixed number of top tokens (top-k), nucleus sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold *p* (typically 0.9--0.95). This adapts to the shape of the distribution: when the model is confident, few tokens are sampled; when uncertain, more diversity is allowed. The approach elegantly solves the problem that top-k with fixed *k* is either too restrictive (cutting off valid tokens) or too permissive (including nonsensical ones) depending on context.

The choice presents latency tradeoffs [@meister2020beam]. Beam search with width 5 takes roughly 5 $\times$ the compute of greedy decoding. Sampling adds minimal overhead but requires careful parameter tuning to balance quality and coherence.

Production LLM systems\index{Streaming Responses!LLM serving}\index{Chunked HTTP!streaming tokens} return tokens as they are produced rather than waiting for complete generation. This transforms the user experience: a 2-second total generation feels responsive when tokens stream continuously, but feels broken when users stare at a blank screen for 2 seconds. Streaming requires infrastructure support for chunked HTTP responses and client-side incremental rendering. The latency profile shifts accordingly: TTFT determines when output starts appearing (responsiveness), while TPOT determines the perceived generation speed (fluidity).

### Memory and KV Cache {#sec-model-serving-memory-kv-cache-d1ea}

Generative inference requires managing the **KV Cache**[^fn-kv-cache]\index{KV Cache!LLM memory}\index{KV Cache!sequence length scaling}, a stateful memory structure that grows with sequence length. Unlike traditional models where memory usage is constant per batch, LLM memory usage is dynamic. Each generated token adds to the context window, consuming additional GPU memory through state accumulation, and variable-length sequences can lead to memory fragmentation if not managed explicitly.

[^fn-kv-cache]: **KV Cache (Key-Value Cache)**: In transformer attention (introduced by Vaswani et al. [@vaswani2017attention]), each layer computes Key and Value projections from input tokens. During autoregressive generation, previously computed K and V vectors remain valid for all future tokens—only the new token's Q (Query) vector changes. The KV cache stores these precomputed projections to avoid redundant recomputation, trading memory for compute. Without caching, generating the $n$-th token would require reprocessing all $n-1$ previous tokens, making generation cost quadratic in sequence length. With caching, each new token requires only one new K-V pair per layer, reducing cost to linear. The catch: for a 70B model with 80 layers and FP16 precision, the KV cache consumes ~1.3 MB per token per request, meaning a batch of 32 requests at 8,000 tokens each requires ~330 GB of KV cache alone—far exceeding the model weights themselves.

The continuous batching and PagedAttention techniques covered in @sec-model-serving-continuous-batching-8bb6 address these challenges. Advanced techniques including prefix caching and speculative decoding are covered in specialized coverage of large-scale systems.

The computational intensity of managing KV caches across concurrent requests raises a broader question: *what* is the energy cost of each token generated? Unlike classification models where energy per inference is constant, LLM energy consumption scales with response length—every generated token requires reading the entire model from memory. Quantifying *the carbon cost of a chat* translates these hardware demands into energy and carbon metrics that make the environmental impact concrete.

```{python}
#| label: carbon-cost-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ CARBON COST OF CHAT
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Carbon Cost of a Chat" - energy footprint of LLM serving
# │
# │ Goal: Quantify the energy cost per LLM token.
# │ Show: That poor utilization causes 10× higher energy consumption per token.
# │ How: Calculate Joules per token based on TDP and concurrent request volume.
# │
# │ Imports: mlsys.constants (H100_TDP, energy comparisons), h100_tdp_value (from
# │          gpu-specs cell)
# │ Exports: cc_concurrent_str, cc_tokens_req_str, cc_total_tokens_str,
# │          cc_host_overhead_str, cc_total_power_str, cc_joules_token_str,
# │          cc_response_tokens_str, cc_response_joules_str, cc_smartphone_str,
# │          cc_boiling_str, cc_low_util_str, cc_idle_power_str, cc_low_util_joules_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import H100_TDP, ENERGY_SMARTPHONE_CHARGE_J, ENERGY_BOILING_WATER_J, watt, joule
from mlsys.formatting import fmt

# --- Inputs (LLM serving scenario assumptions) ---
cc_concurrent_req_value = 114                     # concurrent requests per H100
cc_tokens_per_sec_req_value = 7.5                 # tokens/sec per request (decode phase)
cc_host_overhead_w_value = 300                    # host server power overhead (W)
cc_response_tokens_value = 500                    # typical response length
cc_low_util_pct_value = 10                        # poor utilization scenario (%)
cc_idle_power_w_value = 300                       # GPU idle power (W)

# --- Process (energy calculations) ---
h100_tdp_value = H100_TDP.to(watt).magnitude
cc_total_tokens_sec_value = cc_concurrent_req_value * cc_tokens_per_sec_req_value
cc_total_power_w_value = h100_tdp_value + cc_host_overhead_w_value
cc_joules_per_token_value = cc_total_power_w_value / cc_total_tokens_sec_value
cc_response_joules_value = cc_joules_per_token_value * cc_response_tokens_value
cc_smartphone_joules_value = ENERGY_SMARTPHONE_CHARGE_J.to(joule).magnitude
cc_boiling_water_joules_value = ENERGY_BOILING_WATER_J.to(joule).magnitude
cc_low_util_tokens_sec_value = cc_total_tokens_sec_value * (cc_low_util_pct_value / 100)
cc_low_util_joules_value = cc_idle_power_w_value / cc_low_util_tokens_sec_value

# --- Outputs (formatted strings for prose) ---
cc_concurrent_str = fmt(cc_concurrent_req_value, precision=0, commas=False)           # e.g. "114" requests
cc_tokens_req_str = fmt(cc_tokens_per_sec_req_value, precision=1, commas=False)       # e.g. "7.5" tokens/sec
cc_total_tokens_str = fmt(cc_total_tokens_sec_value, precision=0, commas=False)       # e.g. "855" tokens/sec
cc_host_overhead_str = fmt(cc_host_overhead_w_value, precision=0, commas=False)       # e.g. "300" W
cc_total_power_str = fmt(cc_total_power_w_value, precision=0, commas=False)           # e.g. "1000" W
cc_joules_token_str = fmt(cc_joules_per_token_value, precision=2, commas=False)       # e.g. "1.17" J/token
cc_response_tokens_str = fmt(cc_response_tokens_value, precision=0, commas=False)     # e.g. "500" tokens
cc_response_joules_str = fmt(cc_response_joules_value, precision=0, commas=False)     # e.g. "585" J
cc_smartphone_str = f"{cc_smartphone_joules_value:,}"                                 # e.g. "50,400" J
cc_boiling_str = f"{cc_boiling_water_joules_value:,}"                                 # e.g. "420,000" J
cc_low_util_str = fmt(cc_low_util_pct_value, precision=0, commas=False)               # e.g. "10" %
cc_idle_power_str = fmt(cc_idle_power_w_value, precision=0, commas=False)             # e.g. "300" W
cc_low_util_joules_str = fmt(cc_low_util_joules_value, precision=1, commas=False)     # e.g. "3.5" J/token
```

::: {.callout-notebook #notebook-carbon-chat title="The Carbon Cost of a Chat"}

**Joules per Token: The Green Metric**:
As LLMs scale, energy efficiency becomes a first-class operational metric alongside latency. For an H100 GPU (`{python} h100_tdp`W TDP), we can quantify the energy footprint of serving:

1.  **Throughput**: `{python} cc_concurrent_str` concurrent requests $\times$ `{python} cc_tokens_req_str` tokens/sec/req ≈ **`{python} cc_total_tokens_str` tokens/sec**.
2.  **Power**: `{python} h100_tdp` W (GPU) + `{python} cc_host_overhead_str` W (Host/Overhead) = **`{python} cc_total_power_str` W**.
3.  **Energy per Token**:

    `{python} cc_total_power_str` Joules/sec / `{python} cc_total_tokens_str` tokens/sec ≈ **`{python} cc_joules_token_str` Joules/token**

**The Systems Conclusion**: A typical `{python} cc_response_tokens_str`-token response consumes ≈ **`{python} cc_response_joules_str` Joules**.

- For comparison, charging a smartphone consumes ≈ `{python} cc_smartphone_str` Joules.
- Boiling a cup of water consumes ≈ `{python} cc_boiling_str` Joules.

**The Engineering Lever**: The primary way to reduce Joules/Token is to **increase hardware utilization**. If the GPU sits at `{python} cc_low_util_str`% utilization due to poor batching, the "Idle Power" is still ~`{python} cc_idle_power_str` W, causing the energy-per-token to skyrocket to **>`{python} cc_low_util_joules_str` Joules**. MLOps is not just about speed; it is about sustainability through efficiency.
:::

:::: {.callout-checkpoint title="LLM Serving Fundamentals" collapse="false"}
LLM serving introduces constraints absent from traditional model serving.

- [ ] **TTFT vs. TPOT**: Can you explain why these two metrics capture different user experience aspects (responsiveness vs. fluidity) and why they are governed by different hardware bottlenecks (compute vs. memory bandwidth)?
- [ ] **Memory wall**: Can you explain why adding more compute cores yields zero latency improvement for token generation, and why only faster memory or smaller models help? (The Llama-3 case study in @sec-model-serving-production-case-study-serving-llama38b-0499 quantifies this relationship.)
- [ ] **Continuous batching**: Can you explain why traditional static batching wastes compute when sequence lengths vary, and how iteration-level batching solves this?
- [ ] **PagedAttention**: Can you explain the memory fragmentation problem in KV cache management and how borrowing virtual memory concepts from OS design achieves near-zero waste?
::::

## Inference Runtime Selection {#sec-model-serving-inference-runtime-selection-5eef}

The batching strategies and LLM-specific techniques examined in preceding sections determine *how* requests are grouped and processed. These strategies assume an underlying execution engine that actually runs the model computations—an assumption that matters enormously. The token generation time (@eq-token-generation-time) and the latency budgets established earlier are achievable only if the runtime efficiently maps operations to hardware. The inference runtime, the software layer that orchestrates tensor operations and manages hardware resources, can vary by an order of magnitude in performance for identical models. Choosing appropriately requires understanding the tradeoffs between framework-native serving, general-purpose optimization, and specialized inference engines.

### Runtime Ecosystem and Configuration {#sec-model-serving-frameworknative-serving-da62}

PyTorch and TensorFlow models can serve directly using their native runtimes. This approach maximizes compatibility (any model that trains will serve) and simplifies the deployment pipeline (no export or conversion step). Framework runtimes include training functionality that adds overhead, and default execution paths may not exploit hardware-specific optimizations.

TorchScript and TensorFlow SavedModel formats enable ahead-of-time compilation and graph optimization, improving over eager execution while maintaining framework compatibility. These formats represent the first step toward deployment optimization without abandoning the familiar framework ecosystem.

#### General-Purpose Optimization {#sec-model-serving-generalpurpose-optimization-9ec2}

ONNX Runtime[^fn-onnx-runtime]\index{ONNX Runtime!cross-platform inference} provides a hardware-agnostic optimization layer [@onnxruntime2024]. Models export to ONNX format, then ONNX Runtime applies graph optimizations and selects execution providers for the target hardware. This enables single-format deployment across CPUs, GPUs, and specialized accelerators.

[^fn-onnx-runtime]: **ONNX Runtime**: Microsoft's open-source inference engine, first released in December 2018. Built to execute models in the ONNX (Open Neural Network Exchange) format, it acts as a hardware abstraction layer: the same ONNX model can run on CPUs (via Intel MKL-DNN or ARM optimizations), NVIDIA GPUs (via CUDA or TensorRT), AMD GPUs (via ROCm), or custom accelerators through pluggable "execution providers." ONNX Runtime applies framework-agnostic graph optimizations—constant folding, redundant node elimination, operator fusion—that benefit all hardware targets. This cross-platform capability makes it the most common choice when models must deploy across heterogeneous infrastructure without maintaining separate optimization pipelines per hardware target.

#### Specialized Inference Engines {#sec-model-serving-specialized-inference-engines-475f}

TensorRT\index{Inference Engine!specialized}\index{TensorRT!GPU optimization}[^fn-tensorrt-serving] (NVIDIA GPUs), OpenVINO[^fn-openvino]\index{OpenVINO!Intel optimization} (Intel hardware), and similar engines optimize specifically for their target hardware [@nvidia2024tensorrt; @chen2018tvm]. They apply aggressive optimizations that framework-native runtimes cannot safely perform:

[^fn-openvino]: **OpenVINO (Open Visual Inference and Neural network Optimization)**: Intel's open-source toolkit, first released in May 2018. The name emphasizes its original focus on visual inference (computer vision on Intel hardware), though it now supports all neural network types. OpenVINO's core capability is mapping neural network operations onto Intel-specific instruction sets: AVX-512 and AMX (Advanced Matrix Extensions) on Xeon CPUs, integrated GPU execution via oneAPI, and specialized inference on Intel's Movidius VPUs (Vision Processing Units) for edge deployment. For organizations using Intel infrastructure, OpenVINO achieves 2--5 $\times$ speedup over framework-native CPU inference through hardware-specific kernel implementations and INT8 calibration.

[^fn-tensorrt-serving]: **TensorRT**: NVIDIA's inference optimization SDK that applies layer fusion, kernel auto-tuning, and precision calibration to neural networks. Unlike framework-native runtimes that preserve training-time graph structure, TensorRT rebuilds the computation graph for the specific target GPU during a build phase. This GPU-specific compilation means TensorRT engines are not portable across GPU architectures, requiring separate builds for V100, A100, and H100 deployments. The build phase can take minutes but produces engines that often achieve 2–5 $\times$ speedup over PyTorch.

Layer fusion\index{Layer Fusion!kernel optimization} combines multiple sequential operations into a single GPU kernel. Consider a common pattern: convolution → batch normalization → ReLU activation. Without fusion, this requires three kernel launches, three round-trips to GPU memory (write conv output, read for batchnorm, write batchnorm output, read for ReLU), and three sets of intermediate tensors. Fusion combines all three into one kernel that reads inputs once, computes the combined result in registers, and writes final outputs once. This eliminates kernel launch overhead (15–60 μs saved per fusion) and reduces memory traffic by 2–3 $\times$. TensorRT automatically detects and fuses common patterns; a typical ResNet-50 reduces from ~50 kernels to ~15 after fusion.

Kernel auto-tuning\index{Kernel Auto-Tuning!algorithm selection}\index{Algorithm Selection!convolution implementations} selects the fastest algorithm for each operation on the specific GPU. A single convolution can be implemented using dozens of algorithms (direct, FFT-based, Winograd, various tiling strategies), each optimal for different input sizes and GPU architectures. Auto-tuning benchmarks each candidate and caches the winner, trading compilation time for runtime performance.

These optimizations typically achieve 2–5 $\times$ speedup over framework-native serving but require explicit export and may not support all operations. A *runtime comparison* on a standard model quantifies these gains across the optimization spectrum.

```{python}
#| label: runtime-comparison-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ RUNTIME COMPARISON
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Runtime Comparison" — specialized inference
# │          engines section
# │
# │ Goal: Demonstrate the speedup spectrum across inference runtimes.
# │ Show: That hardware-specific engines (TensorRT) yield up to 9× speedup over eager PyTorch.
# │ How: Compare benchmarked latencies for JIT, ONNX, and TensorRT at multiple precisions.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: rt_pytorch_ms_str, rt_torchscript_ms_str, rt_onnx_ms_str,
# │          rt_trt_fp32_ms_str, rt_trt_fp16_ms_str, rt_trt_int8_ms_str,
# │          rt_pytorch_speedup_str, rt_torchscript_speedup_str,
# │          rt_onnx_speedup_str, rt_trt_fp32_speedup_str,
# │          rt_trt_fp16_speedup_str, rt_trt_int8_speedup_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (latency per runtime, ms, ResNet-50 batch-1 on V100) ---
rt_pytorch_ms_value = 8.5
rt_torchscript_ms_value = 6.2
rt_onnx_ms_value = 5.1
rt_trt_fp32_ms_value = 2.8
rt_trt_fp16_ms_value = 1.4
rt_trt_int8_ms_value = 0.9

# --- Process (speedup relative to PyTorch eager baseline) ---
rt_pytorch_speedup_value = 1.0
rt_torchscript_speedup_value = rt_pytorch_ms_value / rt_torchscript_ms_value
rt_onnx_speedup_value = rt_pytorch_ms_value / rt_onnx_ms_value
rt_trt_fp32_speedup_value = rt_pytorch_ms_value / rt_trt_fp32_ms_value
rt_trt_fp16_speedup_value = rt_pytorch_ms_value / rt_trt_fp16_ms_value
rt_trt_int8_speedup_value = rt_pytorch_ms_value / rt_trt_int8_ms_value

# --- Outputs (formatted strings for table) ---
rt_pytorch_ms_str = f"{rt_pytorch_ms_value}"
rt_torchscript_ms_str = f"{rt_torchscript_ms_value}"
rt_onnx_ms_str = f"{rt_onnx_ms_value}"
rt_trt_fp32_ms_str = f"{rt_trt_fp32_ms_value}"
rt_trt_fp16_ms_str = f"{rt_trt_fp16_ms_value}"
rt_trt_int8_ms_str = f"{rt_trt_int8_ms_value}"

rt_pytorch_speedup_str = fmt(rt_pytorch_speedup_value, precision=1, commas=False)
rt_torchscript_speedup_str = fmt(rt_torchscript_speedup_value, precision=1, commas=False)
rt_onnx_speedup_str = fmt(rt_onnx_speedup_value, precision=1, commas=False)
rt_trt_fp32_speedup_str = fmt(rt_trt_fp32_speedup_value, precision=1, commas=False)
rt_trt_fp16_speedup_str = fmt(rt_trt_fp16_speedup_value, precision=1, commas=False)
rt_trt_int8_speedup_str = fmt(rt_trt_int8_speedup_value, precision=1, commas=False)
```

::: {.callout-notebook title="ResNet-50: Runtime Comparison"}

Performance comparison for ResNet-50 inference on V100 GPU (batch size 1):

| **Runtime**     |                         **Latency** |                                    **Speedup** | **Notes**                 |
|:----------------|------------------------------------:|-----------------------------------------------:|:--------------------------|
| PyTorch (eager) |     `{python} rt_pytorch_ms_str` ms |     `{python} rt_pytorch_speedup_str` $\times$ | Baseline, no optimization |
| TorchScript     | `{python} rt_torchscript_ms_str` ms | `{python} rt_torchscript_speedup_str` $\times$ | JIT compilation           |
| ONNX Runtime    |        `{python} rt_onnx_ms_str` ms |        `{python} rt_onnx_speedup_str` $\times$ | Cross-platform            |
| TensorRT FP32   |    `{python} rt_trt_fp32_ms_str` ms |    `{python} rt_trt_fp32_speedup_str` $\times$ | NVIDIA-specific           |
| TensorRT FP16   |    `{python} rt_trt_fp16_ms_str` ms |    `{python} rt_trt_fp16_speedup_str` $\times$ | Tensor Core acceleration  |
| TensorRT INT8   |    `{python} rt_trt_int8_ms_str` ms |    `{python} rt_trt_int8_speedup_str` $\times$ | Requires calibration      |

**Key insight**: The `{python} rt_trt_int8_speedup_str` $\times$ speedup from TensorRT INT8 comes at the cost of: (1) quantization calibration data, (2) potential accuracy loss (<1% for ResNet-50), and (3) NVIDIA-specific deployment.

:::

The optimization-compatibility tradeoff is inherent. More aggressive optimization yields better performance yet increases deployment complexity and may introduce numerical differences from training. The choice depends on latency requirements, deployment constraints, and available engineering resources.

#### Runtime Configuration {#sec-model-serving-runtime-configuration-492b}

Beyond runtime selection, configuration choices significantly impact serving performance. Thread pool sizing controls parallelism for CPU inference—too few threads leave cores idle, while too many cause contention. Memory allocation strategies (pre-allocated buffers versus dynamic allocation) trade startup cost against flexibility. Execution provider selection prioritizes which hardware backends handle each operation, and graph optimization level trades compilation time for runtime performance. Production deployments require systematic experimentation to find optimal configurations for specific models and hardware combinations, measuring their impact on latency distributions rather than relying on defaults.

### Precision Selection for Serving {#sec-model-serving-precision-selection-serving-55ba}

A team deploying ResNet-50 on V100 GPUs faces a concrete constraint: their 30-GPU cluster costs \$90/hour, and business growth requires 3 $\times$ more throughput without expanding the fleet. Switching from FP32 to INT8 inference achieves exactly this—the same model on the same hardware serves 3 $\times$ more requests per second, reducing the effective cost per inference by two-thirds, at a cost of less than 0.4 percentage points of accuracy. This example illustrates the direct connection between numerical precision and infrastructure economics. Precision selection connects to the quantization techniques covered in @sec-model-compression. For the foundational comparison of numerical formats (FP32, FP16, BF16, FP8, INT8) and their precision-range trade-offs, see @sec-machine-foundations-numerical-representations-c889; for the mechanics of symmetric and asymmetric integer quantization, see @sec-machine-foundations-integer-quantization-5442. While @sec-model-compression focuses on training-time quantization, serving introduces additional considerations including calibration requirements, layer sensitivity, and dynamic precision selection.

#### Precision-Throughput Relationship {#sec-model-serving-precisionthroughput-relationship-b503}

For\index{Precision!throughput tradeoff}\index{Quantization!serving precision} memory-bandwidth-bound operations\index{Memory-Bandwidth Bound!precision impact}, reducing precision proportionally increases throughput by reducing data movement. @eq-precision-throughput quantifies the theoretical maximum speedup from precision reduction:

$$
\frac{\text{Throughput}_{\text{INT8}}}{\text{Throughput}_{\text{FP32}}} = \frac{32}{8} = 4\times \text{ (theoretical maximum)}
$$ {#eq-precision-throughput}

In practice, GPU compute pipelines and Tensor Core alignment requirements limit achieved speedup to 2.5–3.5 $\times$ for INT8 versus FP32. Tensor Cores\index{Tensor Cores!alignment requirements} require specific alignment: INT8 operations need tensor dimensions divisible by 16, while FP16 requires divisibility by 8. @sec-hardware-acceleration provides the detailed Tensor Core architecture that explains these alignment constraints. The *precision tradeoffs* for a standard vision model illustrate how these theoretical limits manifest in practice.

```{python}
#| label: precision-tradeoff-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ PRECISION TRADEOFFS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Precision Tradeoffs on V100" — precision-
# │          throughput relationship section
# │
# │ Goal: Quantify the three-way trade-off between latency, memory, and accuracy.
# │ Show: That FP16 is a "free lunch" (2× speedup) while INT8 trades marginal accuracy for 3× gains.
# │ How: Contrast ResNet-50 metrics across FP32, FP16, and INT8 precisions.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: pt_fp32_ms_str, pt_fp32_mem_mb_str, pt_fp32_acc_str,
# │          pt_fp16_ms_str, pt_fp16_mem_mb_str, pt_fp16_acc_str,
# │          pt_fp16_util_str, pt_int8_ms_str, pt_int8_mem_mb_str,
# │          pt_int8_ptq_acc_str, pt_int8_qat_acc_str, pt_int8_util_str,
# │          pt_int8_speedup_str, pt_fp16_speedup_str, pt_int8_acc_loss_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (precision metrics for ResNet-50 on V100) ---
pt_fp32_ms_value = 2.8
pt_fp32_mem_mb_value = 98
pt_fp32_acc_value = 76.13

pt_fp16_ms_value = 1.4
pt_fp16_mem_mb_value = 49
pt_fp16_acc_value = 76.13
pt_fp16_util_value = 85

pt_int8_ms_value = 0.9
pt_int8_mem_mb_value = 25
pt_int8_ptq_acc_value = 75.80
pt_int8_qat_acc_value = 76.05
pt_int8_util_value = 92

# --- Process (speedup ratios and accuracy loss) ---
pt_int8_speedup_value = pt_fp32_ms_value / pt_int8_ms_value
pt_fp16_speedup_value = pt_fp32_ms_value / pt_fp16_ms_value
pt_int8_acc_loss_value = pt_fp32_acc_value - pt_int8_ptq_acc_value

# --- Outputs (formatted strings for table) ---
pt_fp32_ms_str = f"{pt_fp32_ms_value}"
pt_fp32_mem_mb_str = f"{pt_fp32_mem_mb_value}"
pt_fp32_acc_str = f"{pt_fp32_acc_value}"

pt_fp16_ms_str = f"{pt_fp16_ms_value}"
pt_fp16_mem_mb_str = f"{pt_fp16_mem_mb_value}"
pt_fp16_acc_str = f"{pt_fp16_acc_value}"
pt_fp16_util_str = f"{pt_fp16_util_value}"

pt_int8_ms_str = f"{pt_int8_ms_value}"
pt_int8_mem_mb_str = f"{pt_int8_mem_mb_value}"
pt_int8_ptq_acc_str = f"{pt_int8_ptq_acc_value:.2f}"
pt_int8_qat_acc_str = f"{pt_int8_qat_acc_value:.2f}"
pt_int8_util_str = f"{pt_int8_util_value}"

pt_int8_speedup_str = fmt(pt_int8_speedup_value, precision=1, commas=False)
pt_fp16_speedup_str = fmt(pt_fp16_speedup_value, precision=0, commas=False)
pt_int8_acc_loss_str = f"{pt_int8_acc_loss_value:.2f}"
```

::: {.callout-notebook title="ResNet-50: Precision Tradeoffs on V100"}

| **Precision**  |                 **Latency** |                      **Memory** |                    **Accuracy** |        **Tensor Core Util.** | **Calibration** |
|:---------------|----------------------------:|--------------------------------:|--------------------------------:|-----------------------------:|:----------------|
| **FP32**       | `{python} pt_fp32_ms_str`ms | `{python} pt_fp32_mem_mb_str`MB |     `{python} pt_fp32_acc_str`% |                           0% | None            |
| **FP16**       | `{python} pt_fp16_ms_str`ms | `{python} pt_fp16_mem_mb_str`MB |     `{python} pt_fp16_acc_str`% | `{python} pt_fp16_util_str`% | None            |
| **INT8 (PTQ)** | `{python} pt_int8_ms_str`ms | `{python} pt_int8_mem_mb_str`MB | `{python} pt_int8_ptq_acc_str`% | `{python} pt_int8_util_str`% | 1,000 samples   |
| **INT8 (QAT)** | `{python} pt_int8_ms_str`ms | `{python} pt_int8_mem_mb_str`MB | `{python} pt_int8_qat_acc_str`% | `{python} pt_int8_util_str`% | Full retraining |

**Key observations:**

- INT8 achieves `{python} pt_int8_speedup_str` $\times$ speedup but loses `{python} pt_int8_acc_loss_str`% accuracy with post-training quantization (PTQ)
- Quantization-aware training (QAT) recovers most accuracy but requires retraining
- FP16 provides `{python} pt_fp16_speedup_str` $\times$ speedup with no accuracy loss for most models

:::

#### Layer Sensitivity {#sec-model-serving-layer-sensitivity-6a31}

Not\index{Quantization!layer sensitivity}\index{Layer Sensitivity!precision tolerance} all layers tolerate reduced precision equally. Empirically, quantization error for a layer scales with weight magnitude and gradient sensitivity, captured by the following proportionality in @eq-quant-error:

$$\epsilon_{\text{quant}} \propto \alpha \cdot \|W\|_2 \cdot 2^{-b}$$ {#eq-quant-error}

where $\alpha$ is a layer-specific sensitivity coefficient (determined empirically or via Fisher information), $\|W\|_2$ is the weight L2 norm, and $b$ is the bit width. This explains observed patterns where first convolutional layers with high gradients and large sensitivity coefficients are precision-sensitive and often kept at FP16, middle layers with stable gradients and low sensitivity coefficients tolerate INT8 well, and final classification layers with small weights but high task sensitivity benefit from FP16 or higher precision.

#### Calibration Requirements {#sec-model-serving-calibration-requirements-06b0}

Post-training\index{Calibration!INT8 quantization}\index{Post-Training Quantization!calibration}\index{Calibration Dataset!representative traffic} quantization requires a calibration dataset to determine optimal scale factors for INT8 conversion. Production experience shows that calibration data must be representative of actual serving traffic, not just training data. Using ImageNet validation images to calibrate a model serving wildlife camera images resulted in 3.2% accuracy degradation in one production system.

#### Dynamic Precision Selection {#sec-model-serving-dynamic-precision-selection-dc60}

Advanced\index{Dynamic Precision!adaptive quality} serving systems select precision per request based on runtime conditions. If the system is ahead of latency SLO, it uses higher precision for better accuracy. For low-confidence INT8 results, it recomputes at FP16. Different customer tiers may receive different precision levels. This pattern enables adaptive quality-latency tradeoffs while maximizing throughput during normal operation.

The precision decision has direct infrastructure consequences: INT8 inference achieves roughly 3 $\times$ higher throughput than FP32, meaning a workload requiring 30 GPUs at FP32 needs only 10 at INT8. This 3 $\times$ reduction in hardware translates directly to a 3 $\times$ reduction in operating costs. The connection between model-level optimization and infrastructure economics is why precision selection cannot be treated as purely a model concern.

Runtime selection and precision tuning operate at the model level: they determine *what* computation runs and at *what* numerical format. But between the model and the silicon lies another optimization layer—the mechanics of how computation graphs compile to kernels, how bytes move from disk to memory, and how the CPU and GPU coordinate their work. These node-level techniques often yield the final 2–5 $\times$ that separates a functional prototype from a production-grade serving node.

## Node-Level Optimization {#sec-model-serving-nodelevel-optimization-3d9d}

Runtime selection and precision tuning establish the software foundation for serving. Achieving peak efficiency requires going deeper: understanding *how* the hardware executes each operation and *where* every microsecond goes. This section explores optimizations that occur at the boundaries of software and silicon: compiling the computation graph\index{Graph Compilation!optimization}, exploiting CPU capabilities when GPUs are absent, minimizing the time to get bytes from disk to memory, and visualizing exactly *where* every microsecond goes.

### Runtime Graph Compilation {#sec-model-serving-runtime-graph-compilation-7a7e}

Inference engines like TensorRT were introduced in @sec-model-serving-inference-runtime-selection-5eef. How do they achieve 2–5 $\times$ speedups? The answer lies in **Graph Compilation**. Training computation graphs are dynamic and mutable, whereas serving graphs are static. This static nature allows compilers to perform aggressive optimizations that would be unsafe or too slow during training.

#### Operator Fusion {#sec-model-serving-operator-fusion-f8d2}

The most potent graph-level optimization is operator fusion\index{Operator Fusion!graph compilation}. As discussed in @sec-hardware-acceleration, memory bandwidth often limits performance more than compute. Fusion collapses multiple operations (e.g., `Conv2D` -> `BiasAdd` -> `ReLU`) into a single kernel launch. This keeps intermediate data in the GPU's fast L1/L2 cache or registers, avoiding round-trips to global memory (VRAM).

#### Constant Folding {#sec-model-serving-constant-folding-c652}

Parts of the graph that depend only on model weights\index{Constant Folding!compile-time optimization}\index{Compile-Time Optimization!constant folding}—which are constant during serving—can be pre-computed at compile time. For example, if a model contains `x * (sqrt(2) / 2)`, the compiler replaces the division and square root with a single multiplication by `0.707...`.

#### Memory Planning {#sec-model-serving-memory-planning-4cef}

Since the graph structure is known, the compiler can pre-calculate the exact memory offsets for every tensor\index{Memory Planning!tensor allocation}. This leads to the central architectural choice of *JIT vs. AOT compilation*.

::: {.callout-notebook title="JIT vs. AOT Compilation"}
*   **Just-In-Time (JIT)**\index{JIT Compilation!runtime optimization}: Compiles the graph the first time it is run (e.g., `torch.compile`).
    *   *Pros*: Optimizes for the specific input shapes seen at runtime.
    *   *Cons*: First request pays a "compilation penalty" (latency spike).
*   **Ahead-of-Time (AOT)**\index{AOT Compilation!pre-deployment}: Compiles the graph before deployment (e.g., `torch.export`, TensorRT `trtexec`).
    *   *Pros*: Zero compilation latency at startup; guarantees a fixed graph.
    *   *Cons*: Must handle all dynamic shapes explicitly or compile multiple profiles.
:::

### CPU Inference Optimization {#sec-model-serving-cpu-inference-optimization-ae86}

GPUs dominate the narrative, yet CPUs\index{CPU Inference!when to use} remain the workhorse for a vast number of inference workloads, particularly for smaller models, latency-insensitive batch jobs, or cost-constrained environments. Optimizing for the CPU requires a different mindset.

#### SIMD and Vectorization {#sec-model-serving-simd-vectorization-6086}

Modern CPUs[^fn-simd-serving]\index{SIMD!CPU vectorization}\index{AVX-512!vector instructions} (Intel Xeon, AMD EPYC) pack powerful vector units (AVX-512, AMX). Standard Python loops cannot use these. Specialized runtimes like **OpenVINO** or **Intel Extension for PyTorch (IPEX)** map neural network operators directly to these vector instructions, achieving order-of-magnitude speedups over vanilla implementations.

[^fn-simd-serving]: **SIMD (Single Instruction, Multiple Data)**: A parallel computing classification from Michael Flynn's 1966 taxonomy [@flynn1966very] of computer architectures. SIMD enables one instruction to operate on multiple data elements simultaneously—for example, adding eight pairs of floating-point numbers in a single clock cycle. Intel's AVX-512 (Advanced Vector Extensions, 2016) processes 512 bits (16 floats) per instruction; AMX (Advanced Matrix Extensions, 2023) extends this to matrix tile operations. For CPU inference, SIMD exploitation is the primary lever: a naively written matrix multiplication uses scalar operations at ~1% of theoretical peak, while SIMD-optimized kernels approach 80--90% utilization.

#### Thread Pinning and NUMA {#sec-model-serving-thread-pinning-numa-dc2b}

On multi-socket servers[^fn-numa-serving]\index{NUMA!memory locality}\index{Thread Pinning!CPU affinity}, accessing memory attached to a different CPU socket (NUMA) adds significant latency. Inference servers must be "NUMA-aware," pinning threads to specific cores and ensuring that memory allocations remain local to those cores.

[^fn-numa-serving]: **NUMA (Non-Uniform Memory Access)**: A memory architecture where access latency depends on the physical proximity of the processor to the memory bank. The term was coined in the 1990s as multi-processor systems grew beyond what a single shared memory bus could serve. In a two-socket server, each CPU has "local" memory (accessed in ~80 ns) and "remote" memory attached to the other socket (accessed in ~130 ns via the interconnect, a 60% penalty). For ML inference, NUMA-unaware memory allocation causes model weights to reside on the "wrong" socket, adding 40--60% latency overhead to every weight access. Linux's `numactl` tool and the `libnuma` library enable NUMA-aware process placement.

#### Small Batch Advantage {#sec-model-serving-small-batch-advantage-c91c}

CPUs often outperform GPUs at batch size 1 for small models\index{Small Batch!CPU advantage}. The overhead of launching a GPU kernel (~10 $\mu$s) and transferring data (~50 $\mu$s) can exceed the compute time for a tiny dense layer. For models under 50 MB serving single requests, a well-optimized CPU runtime often delivers lower latency than a GPU.

### Model Serialization and Fast Loading {#sec-model-serving-fast-model-loading-1109}

In autoscaling systems, the time to spin up a new node is critical. A major component of "Cold Start" (@sec-model-serving-model-loading-initialization-cc5a) is simply reading the model weights from disk into memory. The choice of serialization format determines how quickly this loading can occur.

The standard PyTorch `torch.load()` uses Python's `pickle` format\index{Pickle!loading overhead}. This approach is inefficient because it requires the CPU to unpickle objects one by one, copy them into memory, and then often copy them *again* to the GPU. A faster alternative is memory mapping\index{mmap!zero-copy loading}, which allows the OS to map a file directly into the process's virtual address space. The data is effectively "loaded" only when accessed, and the OS handles the transfer from disk to RAM efficiently.

Building on this zero-copy principle, Safetensors[^fn-safetensors]\index{Safetensors!zero-copy loading} is a modern format designed specifically for fast loading. It stores tensors as raw bytes with a minimal JSON header. This enables zero-copy\index{Zero-Copy!model loading} loading: the raw bytes on disk are mapped directly into the tensor's memory buffer.

::: {.callout-example title="Loading Speed: Safetensors vs. Pickle"}
Loading a 5GB Stable Diffusion model:

*   **Pickle (`torch.load`)**: ~15 seconds. High CPU usage.
*   **Safetensors**: ~0.5 seconds. Near-zero CPU usage.

By using `mmap` and formats like `safetensors`, loading speed becomes limited only by the disk's read speed (e.g., 3GB/s for NVMe), rather than CPU parsing overhead.
:::

[^fn-safetensors]: **Safetensors**: Created by Hugging Face engineer Nicolas Patry and released in 2022. The name emphasizes safety: unlike Python's pickle format (used by `torch.save`), safetensors cannot execute arbitrary code during deserialization, eliminating a class of security vulnerabilities where malicious model files could compromise a serving system. The format stores tensors as contiguous raw bytes with a small JSON header describing shapes and data types, enabling memory-mapped loading where the OS maps the file directly into the process's address space. This design achieves 30--100 $\times$ faster loading than pickle because no parsing, object construction, or memory copying occurs.

### Profiling the Serving Node {#sec-model-serving-profiling-serving-node-1e99}

Optimization without measurement is guesswork. The system efficiency metric defined in @eq-system-efficiency provides the target—maximizing the fraction of wall-clock time the accelerator spends on useful computation—but achieving it requires visualizing the execution flow to find where time is lost.

#### The Timeline View {#sec-model-serving-timeline-view-6159}

Tools\index{Timeline Profiling!serving optimization} like **PyTorch Profiler**\index{Framework Profiler!timeline analysis} or NVIDIA **Nsight Systems (nsys)**\index{GPU Profiler!timeline analysis} generate a timeline trace. This visualization reveals the exact sequence of events on the CPU and GPU. When examining a trace, look for:

1.  **Gaps in the GPU Timeline**: If the GPU bar has empty spaces, the GPU is idle. This usually means the GPU is waiting for the CPU (preprocessing bottleneck) or disk (data loading).
2.  **Kernel Launch Overhead**: If you see thousands of tiny slivers on the GPU timeline, your model is launching too many small kernels. This is a prime candidate for **Operator Fusion**.
3.  **Host-to-Device Transfers**: Look for `MemcpyHtoD` (Host to Device) blocks. Are they overlapping with computation, or blocking it?

::: {.callout-example title="The Profiling Loop"}
1.  **Capture**: Run a warmup, then capture a trace of 10-50 requests.
2.  **Visualize**: Open the trace in a viewer (Chrome Tracing, Nsight).
3.  **Identify**: Find the largest gap or the longest block.
4.  **Optimize**: Apply a specific fix (e.g., fusion, pinning).
5.  **Verify**: Re-capture and confirm the gap is gone.
:::

#### Optimization Technique Impact Matrix {#sec-model-serving-optimization-technique-impact-matrix-7c1e}

To guide optimization efforts, @tbl-optimization-impact summarizes the key techniques available at the node level, their primary targets, and expected returns.

| **Technique**         | **Target Metric**    | **Typical Gain** | **Implement. Cost** | **Best For**             |
|:----------------------|:---------------------|-----------------:|:--------------------|:-------------------------|
| **Operator Fusion**   | Latency & Throughput |     2–5 $\times$ | Medium (Compiler)   | Memory-bound layers      |
| **INT8 Quantization** | Throughput           |     3–4 $\times$ | High (Calibration)  | Inference-heavy nodes    |
| **Graph Compilation** | Latency              |   1.5–3 $\times$ | Low (One-line)      | Static graph models      |
| **Zero-Copy Loading** | Startup Time         |   10–50 $\times$ | Low (File format)   | Autoscaling / Cold Start |
| **CPU Pinning**       | Tail Latency (P99)   | 20-50% reduction | Low (Config)        | Latency-critical apps    |

: **Node-Level Optimization Impact**: A decision matrix for selecting optimization techniques. High-impact techniques like quantization often carry higher implementation costs (calibration data requirements), while architectural changes like zero-copy loading offer dramatic gains for specific metrics (startup time) with low effort. {#tbl-optimization-impact}

This hierarchy of impact guides where to invest engineering effort. Use the following checklist to prioritize your optimization strategy.

::: {.callout-checkpoint title="The Optimization Hierarchy" collapse="false"}
Optimizing inference requires a layered approach.

**The Stack**

- [ ] **System Level**: Have you minimized network round trips and serialization overhead? (gRPC, persistent connections).
- [ ] **Application Level**: Are you batching requests effectively? (Dynamic batching).
- [ ] **Model Level**: Is the model compiled for the target hardware? (TensorRT, ONNX Runtime).
- [ ] **Kernel Level**: Are operations fused to minimize memory bandwidth?
:::

The optimization techniques examined so far—batching, runtime selection, precision tuning, graph compilation—collectively determine how much useful work a single serving node extracts from its hardware. The natural question that follows is economic: given these per-node capabilities, *how much* infrastructure is required, and *what does it cost*?

## Economics and Planning {#sec-model-serving-economics-capacity-planning-3e7e}

Every optimization technique examined so far—batching, precision tuning, operator fusion, graph compilation—reduces a single number: the cost of one inference on one machine. But production deployment requires answering a different question: how many machines, of what type, at what total cost? A team that achieves 1,200 images/second on a V100 still needs to know whether 8 V100s at \$3/hour each or 24 T4s at \$0.53/hour each yields lower total cost of ownership for their 5,000 QPS target. Serving costs\index{Serving Economics!infrastructure costs}\index{Serving Costs!request volume scaling} scale with request volume, unlike training costs that scale with dataset size and model complexity [@zhang2019mark]. The intelligence deflation trend shown in @fig-intelligence-deflation intensifies this pressure: as per-token prices collapse by orders of magnitude, the margin on each inference shrinks, making infrastructure efficiency the primary lever for economic viability.

### Cost Per Inference {#sec-model-serving-cost-per-inference-27fc}

Total\index{Cost Per Inference!serving economics} serving cost decomposes into four components: compute time (GPU or CPU cycles consumed per inference), memory (accelerator memory required to hold model weights and activations), data transfer (network bandwidth for request and response payloads), and orchestration overhead (container runtime, load balancing, and monitoring). For GPU inference, the dominant cost component shifts with utilization. At high utilization, compute time dominates because the GPU stays busy processing requests. At low utilization, memory cost dominates\index{GPU Utilization!cost economics} because the GPU is reserved—and billed—even while idle. This distinction matters for cost optimization: improving throughput reduces compute cost per inference, while improving utilization reduces the memory waste of idle hardware. We can apply this framework to a *ResNet-50 cost analysis*.

```{python}
#| label: cost-analysis-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ COST ANALYSIS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "ResNet-50: Cost Analysis" — cost per inference section
# │
# │ Goal: Contrast the cost-per-million inferences across hardware tiers.
# │ Show: That expensive GPUs (V100) can be cheaper per-inference than T4 or CPU due to high throughput.
# │ How: Calculate unit costs using AWS hourly rates and measured images-per-second.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: ca_cpu_cost_str, ca_cpu_throughput_str, ca_cpu_cpm_str,
# │          ca_t4_cost_str, ca_t4_throughput_str, ca_t4_cpm_str,
# │          ca_v100_cost_str, ca_v100_throughput_str, ca_v100_cpm_str,
# │          ca_v100_price_increase_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (AWS on-demand pricing and throughput) ---
ca_cpu_cost_value = 0.17
ca_cpu_throughput_value = 50

ca_t4_cost_value = 0.53
ca_t4_throughput_value = 400

ca_v100_cost_value = 3.06
ca_v100_throughput_value = 1200

# --- Process (cost per million images) ---
ca_cpu_cpm_value = ca_cpu_cost_value / (ca_cpu_throughput_value * SEC_PER_HOUR / MILLION)
ca_t4_cpm_value = ca_t4_cost_value / (ca_t4_throughput_value * SEC_PER_HOUR / MILLION)
ca_v100_cpm_value = ca_v100_cost_value / (ca_v100_throughput_value * SEC_PER_HOUR / MILLION)

ca_v100_price_increase_value = ca_v100_cost_value / ca_t4_cost_value

# --- Outputs (formatted strings for table) ---
ca_cpu_cost_str = fmt(ca_cpu_cost_value, precision=2, commas=False)
ca_cpu_throughput_str = f"{ca_cpu_throughput_value}"
ca_cpu_cpm_str = fmt(ca_cpu_cpm_value, precision=2, commas=False)

ca_t4_cost_str = fmt(ca_t4_cost_value, precision=2, commas=False)
ca_t4_throughput_str = f"{ca_t4_throughput_value}"
ca_t4_cpm_str = fmt(ca_t4_cpm_value, precision=2, commas=False)

ca_v100_cost_str = fmt(ca_v100_cost_value, precision=2, commas=False)
ca_v100_throughput_str = f"{ca_v100_throughput_value:,}"
ca_v100_cpm_str = fmt(ca_v100_cpm_value, precision=2, commas=False)

ca_v100_price_increase_str = fmt(ca_v100_price_increase_value, precision=0, commas=False)
```

::: {.callout-notebook title="ResNet-50: Cost Analysis"}

Consider serving ResNet-50 on AWS infrastructure (US-East region, on-demand pricing as of this writing):

| **Instance Type**         |                 **Cost/Hour** |                          **Throughput** |       **Cost per 1M Images** |
|:--------------------------|------------------------------:|----------------------------------------:|-----------------------------:|
| **c5.xlarge (CPU)**       |  \$`{python} ca_cpu_cost_str` |  `{python} ca_cpu_throughput_str` img/s |  \$`{python} ca_cpu_cpm_str` |
| **g4dn.xlarge (T4 GPU)**  |   \$`{python} ca_t4_cost_str` |   `{python} ca_t4_throughput_str` img/s |   \$`{python} ca_t4_cpm_str` |
| **p3.2xlarge (V100 GPU)** | \$`{python} ca_v100_cost_str` | `{python} ca_v100_throughput_str` img/s | \$`{python} ca_v100_cpm_str` |

**Key insight**: The T4 GPU instance achieves the lowest cost per inference despite higher hourly cost, because GPU throughput dramatically exceeds CPU throughput. The V100 is only cost-effective at very high sustained traffic where its higher throughput justifies the `{python} ca_v100_price_increase_str` $\times$ price increase. Note that cloud pricing varies by region and changes over time; consult current pricing for production planning.

:::

### GPU vs CPU Economics {#sec-model-serving-gpu-vs-cpu-economics-eb06}

GPUs provide significant speedup for parallel operations but cost more per hour\index{GPU vs CPU!economics} [@wu2019machine]. The crossover point depends on model characteristics and latency requirements.

CPU inference makes economic sense for small models with few parameters and simple operations, when latency requirements are relaxed (hundreds of milliseconds acceptable), when request volume is low or highly variable (making GPU reservation wasteful), or when the model's operations do not parallelize well. GPU inference dominates when models are large with parallel-friendly operations, latency requirements are strict (tens of milliseconds), request volume is high and consistent enough to sustain utilization, and batching can amortize the per-inference overhead of GPU kernel launches.

Beyond\index{Autoscaling!startup latency} steady-state costs, startup time affects scaling economics. CPU instances typically start in 30–60 seconds while GPU instances take 2–5 minutes including driver initialization, model loading, and warmup. For variable traffic patterns, this startup latency can be more important than cost per inference. If traffic spikes arrive faster than GPU instances can scale, latency SLOs will be violated despite having sufficient eventual capacity.

This asymmetry suggests different scaling strategies where CPU instances enable reactive scaling\index{Reactive Scaling!CPU instances} by responding to current demand while GPU instances often require predictive scaling\index{Predictive Scaling!GPU instances} by provisioning based on anticipated demand. For bursty workloads, a hybrid approach\index{Hybrid Scaling!GPU+CPU} uses always-on GPU capacity for baseline load plus CPU overflow capacity for spikes, trading higher per-inference cost during spikes for better responsiveness. This GPU+CPU hybrid is one instance of the broader *hybrid architecture* patterns cataloged in @sec-ml-systems-hybrid-architectures-combining-paradigms-7cdd, where the train-serve split and hierarchical processing patterns also combine paradigms to balance cost, latency, and capability.

### Capacity Planning {#sec-model-serving-capacity-planning-96a3}

The GPU versus CPU decision establishes the cost per inference, but determining how much infrastructure to provision requires combining cost analysis with the queuing theory foundations from @sec-model-serving-queuing-theory-tail-latency-29a6. Capacity planning\index{Capacity Planning!infrastructure sizing} translates three inputs into infrastructure specifications: traffic patterns (peak request rate, daily/weekly cycles, growth projections), latency SLOs (p50, p95, p99 targets), and model characteristics (inference time distribution at various batch sizes) [@harchol2013performance].

The worked example in @sec-model-serving-queuing-theory-tail-latency-29a6 demonstrates the complete workflow: starting from a 50ms p99 SLO and 5,000 QPS target, deriving the safe utilization threshold of `{python} cp_rho_safe_pct_str` percent from @eq-p99-latency, and determining GPU count with headroom of `{python} cp_final_ceil_str` V100s. Production systems typically provision for peak load plus 30 percent headroom, using auto-scaling to reduce costs during low-traffic periods while meeting latency objectives during peaks. The key insight from capacity planning is that throughput numbers are meaningful only when coupled with latency guarantees: a system achieving 10,000 QPS but violating the p99 SLO on 5 percent of requests is not actually serving 10,000 QPS—it is serving 9,500 valid QPS and failing on the rest.

### Production Case Study: Serving Llama-3-8B {#sec-model-serving-production-case-study-serving-llama38b-0499}

To synthesize the principles of latency budgeting, memory management, and hardware efficiency, we analyze a complete production profile for a modern Large Language Model (LLM) serving workload\index{LLM Serving!8B parameter case study}\index{LLM Serving!production case study}. This case study demonstrates how physical constraints—memory bandwidth and capacity—translate directly into service-level metrics and unit economics.

We begin with the bottleneck that dominates LLM serving costs: KV cache memory. Watch how steeply the memory curves climb in @fig-kv-cache-growth—especially for larger batch sizes—to see *why* long-context serving is memory-bound even on H100s, using typical 70B-class assumptions.

```{python}
#| label: fig-kv-cache-growth
#| echo: false
#| fig-cap: "**The KV-Cache Explosion**: Memory usage vs. Context Length for a 70B-class model. Assumes 80 layers, d_model=8192, FP16 KV cache, GQA (8x). The linear growth of the Key-Value cache (storing attention history) quickly consumes available GPU memory (red dashed line). For batch size 32 (purple), the system hits the 'OOM Zone' at just 8k context length, forcing a trade-off between batch size (throughput) and context window (capability)."
#| fig-alt: "Line chart showing memory usage increasing linearly with context length. Multiple lines for different batch sizes. Red dashed horizontal line marks GPU memory limit. Purple line for batch 32 crosses into OOM zone at 8k context."

import numpy as np
from mlsys import viz
from mlsys.constants import BYTES_FP16, byte, GB

fig, ax, COLORS, plt = viz.setup_plot()

# =============================================================================
# PLOT: The KV-Cache Explosion
# =============================================================================
seq_len = np.linspace(0, 32000, 100)
layers, d_model, bytes_per_param = 80, 8192, BYTES_FP16.magnitude  # 70B model params, FP16
gqa_ratio = 8  # Grouped Query Attention (8x reduction)

def get_kv_gb(batch, seq):
    # KV cache size = 2 * layers * d_model * seq * batch * bytes_per_param / gqa_ratio
    bytes_total = (2 * layers * d_model * seq * batch * bytes_per_param) / gqa_ratio
    return (bytes_total * byte).to(GB).magnitude

batches = [1, 4, 16, 32]
colors = [COLORS['BlueLine'], COLORS['GreenLine'], COLORS['OrangeLine'], COLORS['VioletLine']]

for b, c in zip(batches, colors):
    gb = get_kv_gb(b, seq_len)
    ax.plot(seq_len, gb, label=f'Batch Size {b}', color=c, linewidth=2)

limit_gb = 80
ax.axhline(limit_gb, color=COLORS['RedLine'], linestyle='--', linewidth=2)
ax.text(1000, limit_gb + 2, "A100/H100 Capacity (80GB)", color=COLORS['RedLine'], fontweight='bold', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
ax.axhspan(limit_gb, 140, color=COLORS['RedL'], alpha=0.2)
ax.text(16000, 100, "Out of Memory (OOM) Zone", color=COLORS['RedLine'], ha='center', fontsize=10, fontweight='bold', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))

ax.annotate("Linear Growth", xy=(15000, get_kv_gb(4, 15000)), xytext=(20000, 30),
            arrowprops=dict(facecolor=COLORS['primary'], arrowstyle='->'), fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))

ax.set_xlabel('Context Length (Tokens)')
ax.set_ylabel('KV Cache Size (GB) [FP16]')
ax.set_xlim(0, 32000)
ax.set_ylim(0, 120)
ax.set_xticks([0, 8000, 16000, 24000, 32000])
ax.set_xticklabels(['0', '8k', '16k', '24k', '32k'])
ax.legend(loc='lower right', fontsize=8)
plt.show()
```

The linear growth of the KV cache with sequence length forces a hard trade-off: to support longer contexts (32k+), we must reduce batch size, which in turn kills throughput efficiency.

#### Workload Profile {#sec-model-serving-workload-profile-a380}

*   **Model**: Llama-3-8B (quantized to 4-bit AWQ[^fn-awq]\index{AWQ!4-bit quantization}).

[^fn-awq]: **AWQ (Activation-Aware Weight Quantization)**: Introduced by Ji Lin et al. at MIT in 2023. AWQ observes that not all weights contribute equally to model quality—weights corresponding to channels with large activation magnitudes are disproportionately important. Rather than quantizing all weights identically, AWQ applies per-channel scaling that protects salient weights from quantization error while aggressively compressing less important ones. This activation-aware approach achieves 4-bit quantization with negligible quality loss (typically <0.5% accuracy degradation), enabling models to fit in roughly 4 $\times$ less GPU memory and achieving proportional memory bandwidth savings during the decode phase. For serving, AWQ's 4-bit compression directly translates to higher concurrent batch sizes and lower cost per token.

*   **Hardware**: 1 $\times$ NVIDIA H100 SXM5 GPU (`{python} h100_mem` GB HBM3, `{python} h100_bw_tbs` TB/s bandwidth).
*   **Request Characteristics**: 1,000-token input prompt (Prefill), 256-token generated response (Decode).
*   **Target SLOs**: TTFT $<$ 200 ms, TPOT $<$ 20 ms.

#### Latency Deconstruction {#sec-model-serving-latency-deconstruction-217e}

The end-to-end request latency is governed by the two-phase execution model of autoregressive transformers, applying the TTFT and TPOT metrics defined in @sec-model-serving-performance-metrics-ttft-tpot-b009.

##### Prefill Phase (Time to First Token) {.unnumbered}

The model processes the 1,000-token prompt in parallel\index{Prefill Phase!compute-bound}\index{Prefill Phase!parallel processing}. On an H100, this compute-bound operation achieves approximately 10,000 tokens per second: $T_{\text{prefill}} = 1000 \text{ tokens} / 10{,}000 \text{ tokens/s} = 100 \text{ ms}$. Accounting for 20 ms of system overhead (network ingress, tokenization), the **TTFT is 120 ms**, comfortably within the 200 ms SLO.

##### Decode Phase (Time Per Output Token) {.unnumbered}

The model generates 256 tokens sequentially. This phase is memory-bandwidth bound\index{Decode Phase!memory-bandwidth bound}\index{Memory Bandwidth!LLM bottleneck}\index{Decode Phase!sequential generation}—the same IO-bound pattern seen in the DLRM embedding lookups (@sec-model-serving-latency-distribution-analysis-b0f8), but at a larger scale: the system must read the entire 3.5 GB weight tensor from VRAM to generate a single token.

::: {.callout-perspective title="The Physics of Token Generation"}

Recall the **Energy-Movement Invariant** from @sec-data-engineering: moving a bit is 100–1,000 $\times$ more expensive than computing on it. In the **Decode Phase**, this law determines the physical "cost per word."

**The Memory Wall for Generative AI**: Because the decode phase has an arithmetic intensity of $\approx 1$ FLOP/byte (we must read every weight just to generate one token), performance is strictly limited by memory bandwidth ($BW$), not compute. This relationship is captured in @eq-token-generation-time:

$$ T_{\text{token}} \approx \frac{\text{Model Size (Bytes)}}{\text{Memory Bandwidth (Bytes/s)}} $$ {#eq-token-generation-time}

**The Engineering Implication**:
Every time you generate a token, you are paying a massive "energy tax" to move the model's logic from HBM into compute registers. For Llama-3-8B (3.5 GB int4), an A100 80GB (`{python} a100_bw_tbs` TB/s HBM2e) generates tokens at $\approx 1.7$ ms/token. Adding more *compute cores* yields **zero** latency improvement; only faster memory (Physics) or smaller models (Algorithm) can speed up generation.
:::

```{python}
#| label: llm-serving-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ LLM SERVING ECONOMICS (LLAMA-3-8B CASE STUDY)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Llama-3-8B serving case study — token latency, throughput, and
# │          unit economics on H100 with 4-bit quantization
# │
# │ Goal: Connect memory bandwidth, KV cache capacity, and serving economics.
# │ Show: That memory capacity bounds throughput while bandwidth bounds latency.
# │ How: Calculate TPOT, concurrent batch size, and $/M tokens for Llama-2-7B on H100.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: model_weight_gb_str, h100_bw_tb_str,
# │          token_time_theoretical_ms_str, realized_tpot_ms_str,
# │          decode_tokens_str, total_decode_s_str, kv_cache_gb_str,
# │          kv_per_token_mb_str, kv_capacity_tokens_str,
# │          tokens_per_req_str, concurrent_batch_str,
# │          req_time_s_str, hourly_cost_str, tokens_per_hour_m_str,
# │          cost_per_m_tokens_str, remaining_vram_gb_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (model, hardware, and cost parameters) ---
model_weight_gb_value = 3.5
h100_bw_tb_value = H100_MEM_BW.to(TB / second).magnitude
realized_tpot_ms_value = 10 # Conservative production target (theoretical min ~1-2ms)
decode_tokens_value = 256

kv_cache_gb_value = 72
kv_per_token_mb_value = 0.5
tokens_per_req_value = 1256

ttft_s_value = 0.12
hourly_cost_value = 3.00

# --- Process (latency, capacity, and economics) ---
token_time_theoretical_ms_value = model_weight_gb_value / (h100_bw_tb_value * 1000) * 1000
total_decode_s_value = decode_tokens_value * realized_tpot_ms_value / 1000

kv_capacity_tokens_value = int(kv_cache_gb_value * 1000 / kv_per_token_mb_value)
concurrent_batch_value = int(kv_capacity_tokens_value / tokens_per_req_value)

req_time_s_value = ttft_s_value + decode_tokens_value * realized_tpot_ms_value / 1000
tokens_per_hour_value = concurrent_batch_value * (SEC_PER_HOUR / req_time_s_value) * tokens_per_req_value
cost_per_m_tokens_value = hourly_cost_value / (tokens_per_hour_value / MILLION)

remaining_vram_gb_value = int(80 - model_weight_gb_value)

# --- Outputs (formatted strings for prose) ---
model_weight_gb_str = f"{model_weight_gb_value}"
h100_bw_tb_str = fmt(h100_bw_tb_value, precision=1, commas=False)
token_time_theoretical_ms_str = fmt(token_time_theoretical_ms_value, precision=0, commas=False)
realized_tpot_ms_str = f"{realized_tpot_ms_value}"
decode_tokens_str = f"{decode_tokens_value}"
total_decode_s_str = fmt(total_decode_s_value, precision=2, commas=False)

kv_cache_gb_str = f"{kv_cache_gb_value}"
kv_per_token_mb_str = f"{kv_per_token_mb_value}"
kv_capacity_tokens_str = f"{kv_capacity_tokens_value:,}"
tokens_per_req_str = f"{tokens_per_req_value:,}"
concurrent_batch_str = f"{concurrent_batch_value}"

req_time_s_str = fmt(req_time_s_value, precision=2, commas=False)
hourly_cost_str = fmt(hourly_cost_value, precision=2, commas=False)
tokens_per_hour_m_str = fmt(tokens_per_hour_value / MILLION, precision=0, commas=False)
cost_per_m_tokens_str = fmt(cost_per_m_tokens_value, precision=3, commas=False)
remaining_vram_gb_str = f"{remaining_vram_gb_value}"
```

*   Ttoken ≈ `{python} model_weight_gb_str` GB / `{python} h100_bw_tbs` TB/s ≈ `{python} token_time_theoretical_ms_str` ms (theoretical limit).
*   Accounting for kernel launch overhead, attention computation, and a conservative production safety margin, realized Ttoken is approximately `{python} realized_tpot_ms_str` ms.
*   Total decode time: `{python} decode_tokens_str` tokens $\times$ `{python} realized_tpot_ms_str` ms/token = `{python} total_decode_s_str` seconds.
*   **TPOT is `{python} realized_tpot_ms_str` ms**, well within the 20 ms "fluidity" SLO.

#### Memory & Throughput {#sec-model-serving-memory-throughput-63dd}

With 4-bit weights occupying `{python} model_weight_gb_str` GB, the remaining ~`{python} remaining_vram_gb_str` GB of VRAM is available for the **KV Cache**. Using **PagedAttention**, we can allocate this memory with near-zero fragmentation.

*   Each token requires approximately `{python} kv_per_token_mb_str` MB of KV cache (32 layers $\times$ 4096 dim $\times$ 2 vectors $\times$ 2-byte precision, assuming standard multi-head attention; models with Grouped Query Attention use fewer KV heads, reducing this by up to 4 $\times$).
*   Total cache capacity ≈ `{python} kv_cache_gb_str` GB / `{python} kv_per_token_mb_str` MB/token ≈ `{python} kv_capacity_tokens_str` tokens.
*   At `{python} tokens_per_req_str` tokens per request (input + output), the GPU can handle a **concurrent batch size of ~`{python} concurrent_batch_str` requests**.

#### Unit Economics {#sec-model-serving-unit-economics-b685}

For an H100 SXM5 instance at approximately USD `{python} hourly_cost_str` per hour (specialized cloud providers; hyperscaler rates vary from USD 2-13 per hour as of this writing):

*   Total tokens per hour: `{python} concurrent_batch_str` batch $\times$ (SEC_PER_HOUR s/hr / `{python} req_time_s_str` s/req) $\times$ `{python} tokens_per_req_str` tokens/req ≈ `{python} tokens_per_hour_m_str` million tokens/hour.
*   **Cost per million tokens**: USD `{python} hourly_cost_str` / `{python} tokens_per_hour_m_str` ≈ **USD `{python} cost_per_m_tokens_str`**.

This analysis highlights that for LLMs, **memory capacity**\index{Memory Capacity!LLM throughput}\index{KV Cache!memory capacity} (the size of the KV cache) is the primary determinant of throughput and cost, while **memory bandwidth**\index{Memory Bandwidth!LLM latency}\index{HBM!memory bandwidth} is the primary determinant of latency.

This case study applies the core principles developed throughout this chapter: latency budgets decompose into prefill and decode phases, queuing theory governs batch sizing and capacity planning, and hardware constraints in the form of memory bandwidth and capacity determine achievable performance and cost. The quantitative framework established here enables principled engineering decisions, but only when applied correctly. Common misconceptions cause even experienced engineers to misapply these principles in practice.

## Fallacies and Pitfalls {#sec-model-serving-fallacies-pitfalls-336b}

Serving inverts training priorities in ways that violate intuitions from batch processing. The nonlinear relationship between utilization and latency, the hidden costs of preprocessing, and the silent failure modes of training-serving skew cause violated SLOs, wasted optimization effort, and accuracy degradation invisible to standard monitoring.

**Fallacy:** *Reducing model inference latency proportionally reduces user-perceived latency.*

```{python}
#| label: fallacy-latency-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACY: INFERENCE LATENCY ≠ USER LATENCY
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacy "Reducing model inference latency proportionally reduces
# │          user-perceived latency"
# │
# │ Goal: Demonstrate the nonlinear interaction between inference speed and queuing.
# │ Show: That collapsing queuing wait yields system-level speedups far exceeding model-level speedups.
# │ How: Model M/M/1 queuing wait times for 5ms vs. 2ms inference latencies.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: fl_utilization_high_pct_str, fl_service_slow_ms_str,
# │          fl_wait_slow_ms_str, fl_service_fast_ms_str,
# │          fl_utilization_new_pct_str, fl_wait_fast_ms_str,
# │          fl_inference_gain_ms_str, fl_queuing_improvement_str,
# │          fl_total_slow_ms_str, fl_total_fast_ms_str,
# │          fl_system_speedup_str, fl_model_speedup_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (utilization and service times) ---
fl_utilization_high_value = 0.8
fl_service_slow_ms_value = 5
fl_service_fast_ms_value = 2

# --- Process (M/M/1 queuing model) ---
# M/M/1 wait = service / (1 - rho) - service = service * rho / (1 - rho)
# Wait time at 80% util
fl_wait_slow_ms_value = fl_service_slow_ms_value * fl_utilization_high_value / (1 - fl_utilization_high_value)
fl_total_slow_ms_value = fl_wait_slow_ms_value + fl_service_slow_ms_value

# New utilization with faster service (assuming constant arrival rate)
# rho_new = rho_old * (service_new / service_old)
fl_utilization_new_value = fl_utilization_high_value * (fl_service_fast_ms_value / fl_service_slow_ms_value)

fl_wait_fast_ms_value = fl_service_fast_ms_value * fl_utilization_new_value / (1 - fl_utilization_new_value)
fl_total_fast_ms_value = fl_wait_fast_ms_value + fl_service_fast_ms_value

fl_model_speedup_value = fl_service_slow_ms_value / fl_service_fast_ms_value
fl_system_speedup_value = fl_total_slow_ms_value / fl_total_fast_ms_value
fl_queuing_improvement_value = fl_wait_slow_ms_value / fl_wait_fast_ms_value

fl_inference_gain_ms_value = fl_service_slow_ms_value - fl_service_fast_ms_value

# --- Outputs (formatted strings for prose) ---
fl_utilization_high_pct_str = f"{fl_utilization_high_value * 100:.0f}"
fl_service_slow_ms_str = f"{fl_service_slow_ms_value}"
fl_wait_slow_ms_str = f"{fl_wait_slow_ms_value:.0f}"

fl_service_fast_ms_str = f"{fl_service_fast_ms_value}"
fl_utilization_new_pct_str = f"{fl_utilization_new_value * 100:.0f}"
fl_wait_fast_ms_str = f"{fl_wait_fast_ms_value:.1f}"

fl_inference_gain_ms_str = f"{fl_inference_gain_ms_value}"
fl_queuing_improvement_str = f"{fl_queuing_improvement_value:.0f}"

fl_total_slow_ms_str = f"{fl_total_slow_ms_value:.0f}"
fl_total_fast_ms_str = f"{fl_total_fast_ms_value:.1f}"
fl_system_speedup_str = f"{fl_system_speedup_value:.1f}"
fl_model_speedup_str = f"{fl_model_speedup_value:.1f}"
```

Engineers who optimize model inference expect proportional improvement in user-perceived latency\index{Latency!inference vs user-perceived}, but serving systems introduce latency sources absent from offline benchmarks. Under load, queuing delay dominates: @eq-mm1-wait shows that at `{python} fl_utilization_high_pct_str` percent utilization with `{python} fl_service_slow_ms_str`ms service time, average wait time is `{python} fl_wait_slow_ms_str`ms before inference even begins. Reducing inference from `{python} fl_service_slow_ms_str`ms to `{python} fl_service_fast_ms_str`ms changes service time but also shifts utilization from `{python} fl_utilization_high_pct_str` percent to `{python} fl_utilization_new_pct_str` percent, reducing queuing wait from `{python} fl_wait_slow_ms_str`ms to `{python} fl_wait_fast_ms_str`ms, a `{python} fl_queuing_improvement_str` $\times$ queuing improvement that dwarfs the `{python} fl_inference_gain_ms_str`ms inference gain. This nonlinear interaction between inference speed and queuing behavior means the *system-level* speedup (`{python} fl_total_slow_ms_str`ms → `{python} fl_total_fast_ms_str`ms, or `{python} fl_system_speedup_str` $\times$) far exceeds the *model-level* speedup (`{python} fl_service_slow_ms_str`ms → `{python} fl_service_fast_ms_str`ms, or `{python} fl_model_speedup_str` $\times$). Conversely, teams that reduce inference by only 20 percent at high utilization see negligible user-facing improvement because queuing still dominates. Serving optimization requires analyzing the complete latency budget, including serialization, queuing, preprocessing, and postprocessing, under realistic load conditions rather than profiling inference latency in isolation.

**Pitfall:** *Running serving infrastructure at high utilization to maximize cost efficiency.*

```{python}
#| label: fallacy-utilization-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: HIGH UTILIZATION LATENCY DEGRADATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Running serving infrastructure at high utilization to
# │          maximize cost efficiency"
# │
# │ Goal: Demonstrate the nonlinear latency explosion near system capacity.
# │ Show: That increasing utilization from 70% to 90% triples average latency.
# │ How: Model M/M/1 wait times to identify the practical utilization ceiling.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: fu_util_high_pct_str, fu_util_mod_pct_str,
# │          fu_wait_high_factor_str, fu_total_high_factor_str,
# │          fu_cost_reduction_str, fu_latency_increase_str,
# │          fu_service_ms_str, fu_p99_mod_ms_str, fu_p99_high_ms_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (utilization levels and service time) ---
fu_util_high_value = 0.9
fu_util_mod_value = 0.7
fu_service_ms_value = 5

# --- Process (M/M/1 queuing model at two utilization levels) ---
# M/M/1 wait factor = rho / (1-rho); total time = service / (1-rho)
fu_wait_high_factor_value = fu_util_high_value / (1 - fu_util_high_value)
fu_wait_high_ms_value = fu_service_ms_value * fu_wait_high_factor_value

fu_cost_reduction_pct_value = (1 - fu_util_mod_value / fu_util_high_value) * 100

fu_avg_latency_mod_value = fu_service_ms_value / (1 - fu_util_mod_value)
fu_avg_latency_high_value = fu_service_ms_value / (1 - fu_util_high_value)
fu_latency_increase_factor_value = fu_avg_latency_high_value / fu_avg_latency_mod_value

fu_p99_mod_value = 4.6 * fu_service_ms_value / (1 - fu_util_mod_value) # Approx formula
fu_p99_high_value = 4.6 * fu_service_ms_value / (1 - fu_util_high_value)

# --- Outputs (formatted strings for prose) ---
fu_util_high_pct_str = f"{fu_util_high_value * 100:.0f}"
fu_util_mod_pct_str = f"{fu_util_mod_value * 100:.0f}"
fu_wait_high_factor_str = f"{fu_wait_high_factor_value:.0f}" # 9
fu_total_high_factor_str = f"{1/(1-fu_util_high_value):.0f}" # 10

fu_cost_reduction_str = f"{fu_cost_reduction_pct_value:.0f}"
fu_latency_increase_str = f"{fu_latency_increase_factor_value:.0f}"

fu_service_ms_str = f"{fu_service_ms_value}"
fu_p99_mod_ms_str = f"{fu_p99_mod_value:.0f}" # ~77
fu_p99_high_ms_str = f"{fu_p99_high_value:.0f}" # ~230
```

Teams target `{python} fu_util_high_pct_str` percent utilization\index{Utilization!high utilization pitfall} to minimize idle capacity. In production, latency degrades nonlinearly as utilization approaches capacity. @eq-mm1-wait shows that at `{python} fu_util_high_pct_str` percent utilization, average time in system reaches `{python} fu_total_high_factor_str` $\times$ service time. Moving from `{python} fu_util_mod_pct_str` percent to `{python} fu_util_high_pct_str` percent utilization cuts infrastructure costs by `{python} fu_cost_reduction_str` percent but triples average latency. For a `{python} fu_service_ms_str`ms inference service, p99 latency jumps from ~`{python} fu_p99_mod_ms_str`ms to ~`{python} fu_p99_high_ms_str`ms (M/M/1 model). Systems provisioned for average load violate SLOs precisely when traffic increases during business-critical periods. Production systems targeting 60 to 70 percent utilization at peak load maintain the latency headroom needed to absorb traffic spikes.

**Fallacy:** *Training accuracy guarantees serving accuracy.*

```{python}
#| label: fallacy-skew-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACY: TRAINING-SERVING SKEW
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacy "Training accuracy guarantees serving accuracy"
# │
# │ Goal: Quantify the silent accuracy degradation from preprocessing mismatches.
# │      mismatches between training and serving. A model at 95% validation
# │      accuracy drops to 90% in production from resize interpolation
# │ How: Model validation vs. production accuracy for a mismatched vision pipeline.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: fs_val_acc_str, fs_prod_acc_str, fs_acc_drop_str,
# │          fs_resize_drop_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (accuracy values and skew range) ---
fs_val_acc_value = 95.0
fs_prod_acc_value = 90.0
fs_resize_drop_min_value = 0.5
fs_resize_drop_max_value = 1.0

# --- Process (accuracy drop) ---
fs_acc_drop_value = fs_val_acc_value - fs_prod_acc_value

# --- Outputs (formatted strings for prose) ---
fs_val_acc_str = f"{fs_val_acc_value:.0f}"
fs_prod_acc_str = f"{fs_prod_acc_value:.0f}"
fs_acc_drop_str = f"{fs_acc_drop_value:.0f}"
fs_resize_drop_str = f"{fs_resize_drop_min_value}-{fs_resize_drop_max_value}"
```

Engineers assume identical model weights preserve validation set performance\index{Latency!inference vs user-perceived}. In production, preprocessing differences silently shift inputs outside the training distribution. @sec-model-serving-trainingserving-skew-7b99 shows *how* training-serving skew causes accuracy degradation despite identical weights: PIL versus OpenCV resize interpolation alone can shift accuracy by `{python} fs_resize_drop_str` percent, float64 versus float32 normalization produces different values, or feature computation timing changes. A model achieving `{python} fs_val_acc_str` percent validation accuracy drops to `{python} fs_prod_acc_str` percent in production from these preprocessing mismatches, a `{python} fs_acc_drop_str` percentage point loss invisible to latency monitoring. Standard monitoring checking exceptions and latency violations fails to detect this silent degradation. Production systems require either identical preprocessing code for training and serving, or statistical monitoring comparing input distributions to catch drift before accuracy degrades.

**Pitfall:** *Using average latency to evaluate serving system performance.*

```{python}
#| label: tail-latency-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: TAIL LATENCY AMPLIFICATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Using average latency to evaluate serving system
# │          performance"
# │
# │ Goal: Demonstrate the significant gap between mean and p99 latency in queuing systems.
# │      queues. At 70% utilization, average latency is ~17ms but p99 reaches
# │      ~77ms—a gap invisible to mean-based monitoring. The slowest 1% of
# │      requests often represent the highest-value users.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: tl_util_pct_str, tl_service_ms_str, tl_avg_ms_str,
# │          tl_p99_ms_str, tl_gap_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (utilization and service time) ---
tl_util_value = 0.7
tl_service_ms_value = 5

# --- Process (M/M/1 mean vs p99 latency) ---
# M/M/1: avg time in system = service / (1 - rho)
tl_avg_ms_value = tl_service_ms_value / (1 - tl_util_value)
# M/M/1 p99 approximation: 4.6 * avg
tl_p99_ms_value = 4.6 * tl_avg_ms_value
tl_gap_value = tl_p99_ms_value / tl_avg_ms_value

# --- Outputs (formatted strings for prose) ---
tl_util_pct_str = f"{tl_util_value * 100:.0f}"
tl_service_ms_str = f"{tl_service_ms_value}"
tl_avg_ms_str = f"{tl_avg_ms_value:.0f}"
tl_p99_ms_str = f"{tl_p99_ms_value:.0f}"
tl_gap_str = f"{tl_gap_value:.1f}"
```

Engineers monitor average latency\index{Mean Latency!monitoring pitfall} because it trends smoothly and is simple to compute. In production, averages hide the slowest requests that determine user satisfaction. As @sec-model-serving-tail-latency-5376 demonstrates, at `{python} tl_util_pct_str` percent utilization with `{python} tl_service_ms_str`ms service time, average latency is `{python} tl_avg_ms_str`ms while p99 reaches `{python} tl_p99_ms_str`ms, a `{python} tl_gap_str` $\times$ gap invisible to mean-based monitoring. Teams optimizing average latency miss the tail that determines user satisfaction: the 1 percent of users experiencing `{python} tl_p99_ms_str`ms delays often generate the most valuable transactions. Production SLOs specify percentile targets (p95, p99) precisely because averages mask tail behavior.

**Fallacy:** *Larger serving batches always improve throughput without affecting latency SLOs.*

```{python}
#| label: fallacy-batching-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACY: LARGER BATCHES ALWAYS IMPROVE THROUGHPUT
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacy "Larger serving batches always improve throughput without
# │          affecting latency SLOs"
# │
# │ Goal: Demonstrate the diminishing returns of increasing serving batch sizes.
# │      batch size from 16 to 32 gains only ~12% throughput while nearly
# │      doubling inference time (14ms→25ms), plus padding wastes 15-30% of

# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: fb_batch_small_str, fb_batch_large_str, fb_throughput_gain_str,
# │          fb_inf_small_ms_str, fb_inf_large_ms_str,
# │          fb_padding_waste_min_str, fb_padding_waste_max_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (batch size comparison data) ---
fb_batch_small_value = 16
fb_batch_large_value = 32
fb_throughput_small_value = 1143 # From earlier table
fb_throughput_large_value = 1280 # From earlier table
fb_inf_small_ms_value = 14
fb_inf_large_ms_value = 25

fb_padding_waste_min_pct_value = 15
fb_padding_waste_max_pct_value = 30

# --- Process (throughput gain percentage) ---
fb_throughput_gain_pct_value = (fb_throughput_large_value / fb_throughput_small_value - 1) * 100

# --- Outputs (formatted strings for prose) ---
fb_batch_small_str = f"{fb_batch_small_value}"
fb_batch_large_str = f"{fb_batch_large_value}"
fb_throughput_gain_str = f"{fb_throughput_gain_pct_value:.0f}"
fb_inf_small_ms_str = f"{fb_inf_small_ms_value}"
fb_inf_large_ms_str = f"{fb_inf_large_ms_value}"
fb_padding_waste_min_str = f"{fb_padding_waste_min_pct_value}"
fb_padding_waste_max_str = f"{fb_padding_waste_max_pct_value}"
```

Engineers maximize batch size\index{Batching!fallacy of larger batches} assuming GPU saturation improves cost efficiency under production load. In serving systems, however, batching introduces a latency-throughput tradeoff\index{Latency-Throughput Tradeoff!batching} governed by queuing dynamics absent from offline benchmarks. Accumulating requests into larger batches increases wait time for early arrivals: a batch window of 10 ms means the first request waits 10 ms before inference begins, directly adding to p99 latency. For ResNet-50 on V100, increasing batch size from `{python} fb_batch_small_str` to `{python} fb_batch_large_str` improves throughput only `{python} fb_throughput_gain_str` percent while nearly doubling per-batch inference time from `{python} fb_inf_small_ms_str` ms to `{python} fb_inf_large_ms_str` ms, and variable input sizes within a batch create padding overhead that wastes `{python} fb_padding_waste_min_str` to `{python} fb_padding_waste_max_str` percent of compute on padding tokens. @sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d shows that for 50ms p99 targets, batch sizes above 32 routinely violate SLOs because batch formation delay plus increased per-batch inference time exceeds the latency budget. Serving batch optimization requires jointly tuning batch size, batch timeout, and concurrency against latency SLOs under realistic traffic patterns, not maximizing throughput in isolation.

**Pitfall:** *Calibrating quantized models with training data rather than production traffic.*

```{python}
#| label: fallacy-calibration-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: CALIBRATION DATA MISMATCH
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Calibrating quantized models with training data rather
# │          than production traffic"
# │
# │ Goal: Quantify the accuracy loss from calibrating INT8 on mismatched data.
# │      ImageNet data when production traffic differs (e.g., wildlife cameras).
# │      A 3.2pp drop (76.1%→72.9%) is invisible to latency monitoring but


# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: fc_acc_loss_str, fc_imagenet_acc_str, fc_ood_acc_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (accuracy and calibration mismatch) ---
fc_acc_loss_pct_value = 3.2
fc_imagenet_acc_value = 76.1  # ResNet-50 INT8 on ImageNet
fc_ood_acc_value = fc_imagenet_acc_value - fc_acc_loss_pct_value

# --- Outputs (formatted strings for prose) ---
fc_acc_loss_str = f"{fc_acc_loss_pct_value}"
fc_imagenet_acc_str = f"{fc_imagenet_acc_value:.1f}"
fc_ood_acc_str = f"{fc_ood_acc_value:.1f}"
```

Teams calibrate with training data\index{Calibration!production traffic mismatch} because it is readily available and produced validation accuracy. In production, traffic distribution often differs from training data, making calibration scale factors suboptimal. Post-training quantization determines INT8 scale factors by measuring activation ranges on calibration data, but this assumes production inputs match the calibration distribution. One production system achieving `{python} fc_imagenet_acc_str` percent accuracy on ImageNet-calibrated INT8 dropped to `{python} fc_ood_acc_str` percent, a `{python} fc_acc_loss_str` percentage point loss, when serving wildlife camera images with different lighting and backgrounds. @sec-model-compression shows quantization error scales with activation range: miscalibration amplifies errors precisely on out-of-distribution inputs where activations exceed calibrated ranges. Effective quantization requires calibrating with representative samples of actual serving traffic, not convenience data.

**Pitfall:** *Cold start latency only matters for the first request.*

```{python}
#| label: fallacy-coldstart-calc
#| echo: false
from mlsys.formatting import fmt

# ┌─────────────────────────────────────────────────────────────────────────────
# │ PITFALL: COLD START COMPOUNDING
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Pitfall "Cold start latency only matters for the first request"
# │
# │ Goal: Demonstrate how cold starts compound during traffic spikes.
# │      reliability matters most. Scaling up 10 instances with 30s TensorRT
# │      compilation each creates 300s of aggregate delay, and requests

# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: cs_new_instances_str, cs_compile_time_str, cs_aggregate_cold_str,
# │          cs_steady_latency_str, cs_cold_latency_str, cs_cold_mult_str
# └─────────────────────────────────────────────────────────────────────────────

# --- Inputs (scale-up scenario parameters) ---
cs_new_instances_value = 10
cs_compile_time_s_value = 30  # TensorRT compilation per instance
cs_steady_latency_ms_value = 5
cs_cold_latency_ms_value = 500  # First request during cold start

# --- Process (aggregate cold start and multiplier) ---
cs_aggregate_cold_s_value = cs_new_instances_value * cs_compile_time_s_value
cs_cold_multiplier_value = cs_cold_latency_ms_value / cs_steady_latency_ms_value

# --- Outputs (formatted strings for prose) ---
cs_new_instances_str = f"{cs_new_instances_value}"
cs_compile_time_str = f"{cs_compile_time_s_value}"
cs_aggregate_cold_str = f"{cs_aggregate_cold_s_value}"
cs_steady_latency_str = f"{cs_steady_latency_ms_value}"
cs_cold_latency_str = f"{cs_cold_latency_ms_value}"
cs_cold_mult_str = f"{cs_cold_multiplier_value:.0f}"
```

Engineers optimize steady-state latency\index{Cold Start!bursty traffic impact} assuming most requests hit warm instances. In production, cold starts compound during the events that matter most: traffic spikes requiring scale-up, deployments rolling out new versions, and recovery from instance failures. @sec-model-serving-model-loading-initialization-cc5a details the anatomy of cold start: TensorRT compilation alone takes `{python} cs_compile_time_str` seconds per instance. During a traffic spike requiring `{python} cs_new_instances_str` new instances, aggregate cold start latency reaches `{python} cs_aggregate_cold_str` seconds of user-facing delay before new capacity becomes useful. Worse, requests hitting cold instances experience `{python} cs_cold_latency_str` ms latency versus `{python} cs_steady_latency_str` ms steady-state, a `{python} cs_cold_mult_str` $\times$ degradation that violates SLOs precisely when traffic is highest. Systems ignoring cold start meet SLOs during steady state but fail during scale-up events and deployment windows when reliability matters most.

## Summary {#sec-model-serving-summary-9635}

Serving marks the transition from model development to production deployment, where the optimization priorities that governed training must be inverted. The shift from throughput maximization to latency minimization transforms every system design decision. The queuing theory foundations\index{Queuing Theory!serving foundations} established here reveal *why* this inversion is not merely a change in metrics but a change in the governing mathematics. The nonlinear relationship between utilization and latency means that systems behaving well at moderate load can suddenly violate SLOs when traffic increases modestly. Little's Law and the M/M/1 wait time equations provide the quantitative foundation for capacity planning, replacing intuition-based provisioning with engineering rigor.

Effective serving optimization requires understanding the complete request path rather than focusing exclusively on model inference. Interface protocols like gRPC and efficient serialization formats minimize the "tax" of data movement, while preprocessing often consumes 45 to 70 percent of total latency when inference runs on optimized accelerators. The microsecond-scale overheads identified by Barroso, Patterson, and colleagues explain *why* serving latency often exceeds the sum of its measured parts, and *why* system-level optimization matters as much as model optimization. Training-serving skew represents another dimension of this complexity, silently degrading accuracy when preprocessing logic differs between training and production environments in ways that traditional testing cannot detect.

The traffic pattern analysis reveals *how* the deployment paradigm selected in @sec-ml-systems shapes every serving decision downstream. Server workloads with Poisson arrivals optimize dynamic batching windows, autonomous vehicles with streaming sensor data require synchronized batch formation, and mobile applications with single-user patterns eliminate batching entirely—each pattern a direct consequence of the physical constraints (power wall, memory wall, light barrier) that created the four paradigms in the first place. The MLPerf scenarios codify these patterns for standardized benchmarking, connecting the serving principles established here to the measurement frameworks explored in @sec-benchmarking. Node-level optimization techniques—graph compilation, operator fusion, and systematic profiling—bridge the gap between model-level decisions and hardware execution, often yielding 2–5 $\times$ additional speedup through better utilization of the accelerator's duty cycle. Precision selection and runtime optimization extend the quantization techniques from @sec-model-compression and Tensor Core capabilities from @sec-hardware-acceleration into the serving domain. Finally, the translation of these technical metrics into unit economics, as shown by the Llama-3 case study, demonstrates *how* engineering decisions regarding batching, precision, and hardware selection directly determine the financial viability of deployment—a pressure intensified by the intelligence deflation trend (@fig-intelligence-deflation) that continually compresses per-inference margins.

::: {.callout-takeaways title="Inverting Every Training Priority"}

* **Serving inverts training priorities**: Training optimizes throughput (samples/hour); serving optimizes latency (ms/request). Different objectives require different system designs.
* **Queuing theory governs capacity planning**: At 80% utilization, wait time is 5 $\times$ service time; at 90%, it reaches 10 $\times$. Small load increases cause disproportionate latency spikes.
* **Preprocessing dominates optimized systems**: When model inference is fast (5ms), preprocessing (image decode, tokenization) consumes 45–70% of total latency. Optimize the pipeline, not just the model.
* **Batching strategy depends on traffic pattern**: Poisson arrivals (web APIs) use dynamic batching; streaming sensors use synchronized batches; mobile apps eliminate batching entirely.
* **Training-serving skew can degrade accuracy undetected**: Different preprocessing between training and serving (e.g., resize interpolation, normalization order) shifts inputs outside the training distribution, causing accuracy degradation that conventional monitoring cannot detect. Use identical code paths.
* **LLM serving is memory-bandwidth bound**: Token generation reads the entire model from VRAM per token, making decode latency strictly limited by memory bandwidth rather than compute. KV cache management via PagedAttention and continuous batching is the primary throughput lever, achieving 2–4 $\times$ improvement over naive serving.
* **Precision and runtime selection directly determine infrastructure cost**: INT8 inference achieves ~3 $\times$ higher throughput than FP32, translating directly to proportionally fewer GPUs. Runtime optimization (TensorRT, ONNX Runtime) provides an additional 2–5 $\times$ speedup over framework-native serving, making these choices as impactful as model architecture decisions.

:::

The serving principles established here (queuing theory for capacity planning, preprocessing optimization, batching strategy selection, and training-serving skew prevention) form the foundation for building production ML systems that meet real-world SLAs. Whether deploying a recommendation system serving millions of users or a medical AI where every millisecond affects patient outcomes, these principles translate mathematical understanding into engineering decisions that determine whether systems succeed or fail under load.

::: {.callout-chapter-connection title="From Node to Factory"}

This chapter engineered the single serving node: latency budgets decomposed each request, queuing theory sized the hardware, batching strategies maximized throughput, and runtime optimization extracted every available microsecond. But a single node is fragile. Models drift as the world changes. Deployments must roll out without downtime. Monitoring must detect the silent accuracy degradation that training-serving skew causes. Scaling events demand orchestration across dozens or hundreds of replicas. In @sec-ml-operations, we scale our perspective from the single request to the full system lifecycle—building the automated machinery (CI/CD pipelines, feature stores, model registries, and observability platforms) that keeps production ML systems running reliably through crashes, model drift, and continuous updates.

:::

<!-- This is here to make sure that quizzes are inserted properly before a part begins. -->
::: { .quiz-end }
:::