cs249r_book/book/quarto/contents/vol1/optimizations/model_compression.qmd

---
quiz: model_compression_quizzes.json
concepts: model_compression_concepts.yml
glossary: model_compression_glossary.json
crossrefs: model_compression_xrefs.json
engine: jupyter
---

# Model Compression {#sec-model-compression}

```{python}
#| echo: false
#| label: chapter-start
from mlsys.registry import start_chapter

start_chapter("vol1:model_compression")
```

::: {layout-narrow}
::: {.column-margin}
\chapterminitoc
:::

\noindent
![](images/png/cover_model_optimizations.png){fig-alt="Construction site metaphor for model optimization showing workers with hard hats and cranes building and refining a multilayer neural network structure with scaffolding, tools, and heavy equipment."}

:::


## Purpose {.unnumbered}

\begin{marginfigure}
\mlsysstack{30}{25}{90}{25}{45}{0}{0}{10}
\end{marginfigure}

_Why do the models that win benchmarks rarely become the models that run in production?_

Training produced a capable model, yet capability alone does not guarantee deployability. Cloud, Edge, Mobile, and TinyML each impose constraints that research benchmarks ignore: memory budgets measured in megabytes rather than gigabytes, latency targets measured in milliseconds rather than seconds, power envelopes measured in milliwatts rather than kilowatts. Research optimizes for accuracy on held-out test sets; production optimizes for accuracy per dollar, accuracy per watt, accuracy per millisecond. The model that achieves state-of-the-art performance typically does so by being larger, slower, and more resource-intensive than any production constraint permits. This gap between research achievement and deployment viability is not a failure of either community but a reflection of different optimization targets.

Bridging that gap requires a systematic discipline of *compression*: trading capabilities the deployment does not need for constraints the deployment cannot violate. The key insight is that trained models are vastly over-specified for most production tasks — they carry more precision, more connections, and more capacity than the deployment context demands, and that surplus can be systematically removed. The techniques differ in what they trade away, but they share a common principle: find what the model has learned that the deployment does not need, and remove it without destroying what the deployment requires. When applied well, compression can reduce model size by one to two orders of magnitude, transforming a research artifact that runs only in a datacenter into a production asset that meets the physics of a phone, a sensor, or a microcontroller. Concretely, the discipline is not about making models smaller but about *making the right models possible* for their physical environment.

::: {.content-visible when-format="pdf"}
\newpage
:::

::: {.callout-tip title="Learning Objectives"}

- Explain the three-part optimization framework: model representation (**pruning**), numerical precision (**quantization**), and architectural efficiency (**distillation** and architecture search)
- Compare **quantization** strategies in terms of memory reduction, energy consumption, and inference accuracy
- Apply **pruning** techniques to reduce model parameters while quantifying the accuracy-sparsity trade-off
- Implement **knowledge distillation** to transfer capabilities from large teacher models to efficient student architectures
- Analyze **hardware-aware design** principles to align model operations with target platform capabilities
- Design integrated optimization pipelines combining quantization, pruning, and distillation under resource constraints

:::

```{python}
#| label: compression-setup
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COMPRESSION SETUP
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Chapter-wide constants used across multiple sections, tables,
# │          and callouts in the Model Compression chapter
# │
# │ Goal: Centralize hardware and model parameters for the entire chapter.
# │ Show: A single source of truth for energy, scale, and cost constants.
# │ How: Retrieve constants from mlsys.constants and Digital Twins.
# │
# │ Imports: mlsys.constants (*), mlsys.formatting (fmt, sci)
# │ Exports: a100_tflops_fp16_str, a100_tflops_int8_str, a100_bw_tbs_str,
# │          a100_int8_speedup_str, int8_energy_reduction_str,
# │          energy_dram_str, energy_flop_fp32_str, energy_flop_int8_str,
# │          v100_bw_gbs_str, resnet_params_m_str, llm_7b_str,
# │          llm_175b_str, llm_175b_mem_str, smartphone_ram_str,
# │          mcu_ram_str, gpt3_training_flops_str, and others
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import *
from mlsys.formatting import fmt, check, sci

# --- Inputs (GPU specs) ---
a100_tflops_fp16_value = A100_FLOPS_FP16_TENSOR.to(TFLOPs / second).magnitude
a100_tflops_int8_value = A100_FLOPS_INT8.to(TFLOPs / second).magnitude
a100_bw_tbs_value = A100_MEM_BW.to(TB / second).magnitude
a100_int8_speedup_value = int(a100_tflops_int8_value / a100_tflops_fp16_value)

# --- Inputs (energy/perf illustrative values) ---
int8_energy_reduction_value = 20
mobilenet_int8_mj_value = 47
mobilenet_fp32_mj_value = 312
tpu_v4_tops_per_w_value = 0.9
v100_tops_per_w_value = 0.3
bandwidth_bound_speedup_value = 4

# --- Inputs (energy: multiply-add operations from constants) ---
energy_dram_value = ENERGY_DRAM_ACCESS_PJ.magnitude
energy_dram_per_byte_value = ENERGY_DRAM_PJ_PER_BYTE.magnitude
energy_flop_fp32_value = ENERGY_FLOP_FP32_PJ.magnitude
energy_flop_int8_value = ENERGY_FLOP_INT8_PJ.magnitude

# Energy for addition operations (Horowitz 2014, 45nm process)
energy_add_fp32_pj_value = ENERGY_ADD_FP32_PJ.to(ureg.picojoule).magnitude
energy_add_fp16_pj_value = ENERGY_ADD_FP16_PJ.to(ureg.picojoule).magnitude
energy_add_int32_pj_value = ENERGY_ADD_INT32_PJ.to(ureg.picojoule).magnitude
energy_add_int8_pj_value = ENERGY_ADD_INT8_PJ.to(ureg.picojoule).magnitude
energy_mul_fp32_pj_value = ENERGY_FLOP_FP32_PJ.magnitude

# INT8 vs FP32 energy ratio (MAC-to-MAC: multiply + add for each precision)
fp32_mac_pj_value = energy_mul_fp32_pj_value + energy_add_fp32_pj_value  # 3.7 + 0.9 = 4.6 pJ
int8_mac_pj_value = energy_flop_int8_value + energy_add_int8_pj_value    # 0.2 + 0.03 = 0.23 pJ
int8_fp32_energy_ratio_value = fp32_mac_pj_value / int8_mac_pj_value

# V100 specs
v100_bw_gbs_value = V100_MEM_BW.to(GB / second).magnitude
v100_tflops_fp32_value = V100_FLOPS_FP32.to(TFLOPs / second).magnitude

# Model specs
resnet_params_m_value = RESNET50_PARAMS.to(Mparam).magnitude
resnet_gflops_value = RESNET50_FLOPs.to(GFLOPs).magnitude
mobilenetv2_mflops_value = MOBILENETV2_FLOPs.to(GFLOPs).magnitude * 1000

# LLM parameter/memory calculations
llm_7b_params_value = 7
llm_7b_mem_fp16_gb_value = llm_7b_params_value * 2
llm_175b_params_value = GPT3_PARAMS.to(Bparam).magnitude
llm_175b_mem_fp16_gb_value = llm_175b_params_value * 2

# Device memory constraints
smartphone_ram_gb_value = SMARTPHONE_RAM_GB.to(GB).magnitude
mcu_ram_kb_value = MCU_RAM_KIB.to(KiB).magnitude

# GPT-3 training FLOPs
gpt3_training_flops_exp_value = 23

# --- Outputs (formatted strings for prose) ---
a100_tflops_fp16_str = fmt(a100_tflops_fp16_value, precision=0, commas=False)
a100_tflops_int8_str = fmt(a100_tflops_int8_value, precision=0, commas=False)
a100_bw_tbs_str = fmt(a100_bw_tbs_value, precision=1, commas=False)
a100_int8_speedup_str = fmt(a100_int8_speedup_value, precision=0, commas=False)
int8_energy_reduction_str = fmt(int8_energy_reduction_value, precision=0, commas=False)
mobilenet_int8_mj_str = fmt(mobilenet_int8_mj_value, precision=0, commas=False)
mobilenet_fp32_mj_str = fmt(mobilenet_fp32_mj_value, precision=0, commas=False)
tpu_v4_tops_per_w_str = fmt(tpu_v4_tops_per_w_value, precision=1, commas=False)
v100_tops_per_w_str = fmt(v100_tops_per_w_value, precision=1, commas=False)
bandwidth_bound_speedup_str = fmt(bandwidth_bound_speedup_value, precision=0, commas=False)

energy_dram_str = fmt(energy_dram_value, precision=0, commas=False)
energy_dram_per_byte_str = fmt(energy_dram_per_byte_value, precision=0, commas=False)
energy_flop_fp32_str = f"{energy_flop_fp32_value}"
energy_flop_int8_str = f"{energy_flop_int8_value}"

energy_add_fp32_str = f"{energy_add_fp32_pj_value}"
energy_add_fp16_str = f"{energy_add_fp16_pj_value}"
energy_add_int32_str = f"{energy_add_int32_pj_value}"
energy_add_int8_str = f"{energy_add_int8_pj_value}"
energy_mul_fp32_str = f"{energy_mul_fp32_pj_value}"

int8_fp32_energy_ratio_str = fmt(int8_fp32_energy_ratio_value, precision=1, commas=False)

v100_bw_gbs_str = fmt(v100_bw_gbs_value, precision=0, commas=False)
v100_tflops_fp32_str = fmt(v100_tflops_fp32_value, precision=1, commas=False)

resnet_params_m_str = fmt(resnet_params_m_value, precision=1, commas=False)
resnet_gflops_str = fmt(resnet_gflops_value, precision=1, commas=False)
mobilenetv2_mflops_str = fmt(mobilenetv2_mflops_value, precision=0, commas=False)

llm_7b_str = f"{llm_7b_params_value}"
llm_7b_mem_str = fmt(llm_7b_mem_fp16_gb_value, precision=0, commas=False)
llm_175b_str = fmt(llm_175b_params_value, precision=0, commas=False)
llm_175b_mem_str = fmt(llm_175b_mem_fp16_gb_value, precision=0, commas=False)
smartphone_ram_str = f"{smartphone_ram_gb_value}"
mcu_ram_str = f"{mcu_ram_kb_value}"
gpt3_training_flops_str = f"$3.14 \\times 10^{{{gpt3_training_flops_exp_value}}}$"
```


## Optimization Framework {#sec-model-compression-optimization-framework-9e21}

A `{python} llm_7b_str`-billion parameter language model requires `{python} llm_7b_mem_str` GB just to store its weights in FP16. Your deployment target is a smartphone with `{python} smartphone_ram_str` GB of RAM shared across the operating system, applications, and your model. *The math does not work.* No amount of clever engineering changes this arithmetic: `{python} llm_7b_mem_str` GB cannot fit in `{python} smartphone_ram_str` GB. Yet users expect the model to run: responsively, offline, without draining their battery in an hour. The gap between what training produces and what deployment permits (the Latency Budget, the maximum allowable end-to-end inference time, defined formally in @sec-model-serving) is not a minor inconvenience but a defining challenge of model compression.

Recall the **Silicon Contract** (@sec-introduction-iron-law-ml-systems-c32a), the implicit agreement every model makes with its hardware about which resource it will saturate. The three candidates are compute throughput, memory bandwidth, and memory capacity. During training, this contract is negotiated upward. Researchers select larger architectures, higher numerical precision, and deeper layers because the training environment, typically a GPU cluster with hundreds of gigabytes of memory, can afford those demands. In @sec-model-training, we used Mixed Precision (FP16) to speed up these training cycles while maintaining the ability to learn. Here, we go further—to INT8 and beyond—for inference, where we trade the ability to update weights for massive gains in execution efficiency. Deployment reverses these priorities. The production environment is smaller, power-constrained, and latency-sensitive, yet the model was designed for an environment with none of those limitations. Model compression is the systematic process of renegotiating that contract for its new execution context, reducing memory footprint, computational cost, and energy consumption while preserving the model's ability to perform its task.

The scale of this renegotiation makes model optimization an engineering discipline, not a collection of ad hoc tricks. A `{python} llm_175b_str` billion parameter model consumes over `{python} llm_175b_mem_str` GB in FP16 representation alone, yet a smartphone provides `{python} smartphone_ram_str` GB of RAM and a microcontroller offers `{python} mcu_ram_str` KB. Bridging six orders of magnitude requires systematic methods with predictable trade-offs, not trial and error. Every optimization technique removes something from the model (redundant parameters, numerical precision, or architectural complexity), and the engineer must understand exactly what is lost, what is preserved, and how these losses compose when techniques are combined.

\index{Overparameterization!systematic removal}
\index{Operator Fusion!framework introduction}
This chapter organizes these techniques along three complementary dimensions. *Structural optimization* removes redundancy from the model itself: pruning eliminates parameters that contribute little to output quality, knowledge distillation transfers a large model's learned behavior into a smaller architecture, and neural architecture search discovers designs that are inherently efficient. *Precision optimization* reduces the numerical bit-width of weights and activations, for example converting 32-bit floating point values to 8-bit integers (exploiting Tensor Cores discussed in @sec-hardware-acceleration-tensor-cores-771f), which shrinks memory footprint and accelerates arithmetic on hardware that supports lower-precision operations. *Hardware-level optimization* ensures that the resulting model executes efficiently on the target processor by fusing operations to reduce memory traffic and exploiting sparsity patterns that the hardware can accelerate. These dimensions are not alternatives but layers in an optimization stack. A practitioner deploying ResNet-50 to a mobile device might prune 50% of its filters, quantize the remaining weights to INT8, and fuse batch normalization into convolution, with each technique compounding the gains of the others.

Throughout this chapter, we ground each technique in concrete systems: ResNet-50 and MobileNetV2 (our **Lighthouse Models** from @sec-network-architectures) for vision workloads, transformer-based language models for sequence tasks, and the DS-CNN keyword spotter for TinyML deployment. These recurring models let us compare techniques under consistent conditions, making the trade-offs between accuracy, latency, memory, and energy tangible rather than abstract.

::: {.callout-definition title="Model Compression"}
***Model Compression***\index{Model Compression!definition} is the systematic renegotiation of the **Silicon Contract**. It transforms a research artifact, optimized for **Information Density**, into a deployment artifact, optimized for **Execution Efficiency**, by trading **Redundancy** and **Precision** for **Latency**, **Memory**, and **Energy** savings.
:::

The chapter follows the optimization stack from software to hardware. We begin with the *optimization framework* and *deployment context* that determine which techniques matter for a given target. We then examine each dimension in turn: @sec-model-compression-structural-optimization-ee93 (pruning, distillation, architecture search), @sec-model-compression-quantization-precision-cd46 (FP32 to INT8 and below), and @sec-model-compression-architectural-efficiency-8dd3 (operator fusion, sparsity exploitation, graph-level transformations). The chapter closes with practical guidance on selecting and composing these techniques for specific deployment constraints.

Model optimization is not a single technique but a *framework* with three complementary dimensions, each addressing different bottlenecks. These dimensions form a natural hierarchy: we first decide *what* computations the model should perform (representation), then *how precisely* to perform them (numerics), and finally *how efficiently* to execute them on physical hardware (implementation). As you trace the stack in @fig-3-sections from top to bottom, notice how each layer moves from pure software concerns toward hardware-level execution.

::: {#fig-3-sections fig-env="figure" fig-pos="htb" fig-cap="**Optimization Stack**: Model optimization progresses through three layers: efficient model representation, efficient numerics representation, and efficient hardware implementation." fig-alt="Three stacked rectangular boxes labeled from top to bottom: Efficient Model Representation, Efficient Numerics Representation, Efficient Hardware Implementation. A vertical arrow spans the stack with More software at top and More hardware at bottom."}
```{.tikz}
\resizebox{.45\textwidth}{!}{
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
  Box/.style={inner xsep=2pt,
  draw=black!90,
  line width=0.75pt,
  anchor=west,
  text width=54mm,align=flush center,
  minimum width=54mm, minimum height=9mm
  },
}
\node[Box,fill=red!30,anchor=south west](B1)at (0.33,0.5){Efficient Hardware Implementation};
\node[Box,fill=red!20,node distance=0.4,above=of B1](B2){Efficient Numerics Representation};
\node[Box,fill=red!10,node distance=0.4,above=of B2](B3){Efficient Model Representation};
\draw[latex-latex,line width=0.75pt](0,0)--++(90:4.85);

\node[left=1 of B1,rotate=90,anchor=north,font=\footnotesize\sf]{More hardware};
\node[left=1 of B3,rotate=90,anchor=north,font=\footnotesize\sf]{More software};
\end{tikzpicture}}
```
:::

The top layer, *efficient model representation*, focuses on eliminating redundancy in the model structure. Techniques like pruning, knowledge distillation, and Neural Architecture Search[^fn-nas] (NAS) reduce the number of parameters or operations required, addressing memory footprint and computational complexity at the algorithmic level.

[^fn-nas]: **Neural Architecture Search (NAS)**\index{NAS!etymology}: Pioneered by Barret Zoph and Quoc V. Le at Google Brain [@zoph2017neural], who used reinforcement learning to *learn* the architecture itself. The initial cost was staggering: 800 GPUs for 28 days. This expense catalyzed efficient NAS: weight-sharing approaches (ENAS, 2018) reduced search cost by 1000 $\times$, while hardware-aware NAS [@tan2019mnasnet] incorporated latency targets directly into the search objective. The resulting architectures---EfficientNet, MobileNetV3, NASNet---consistently outperform hand-designed models, demonstrating that architectural inductive biases (see @sec-network-architectures) can themselves be learned.

The middle layer, *efficient numerics representation*, optimizes how numerical values are stored and processed. Quantization and mixed-precision\index{Mixed-Precision Training} training reduce the bit-width of weights and activations (e.g., from 32-bit floating point to 8-bit integers), enabling faster execution and lower memory usage on specialized hardware.

The bottom layer, *efficient hardware implementation*, ensures operations run efficiently on target processors. Techniques like operator fusion, sparsity exploitation, and hardware-aware scheduling align computational patterns with hardware capabilities (memory hierarchy, vector units) to maximize utilization and throughput.

These dimensions are interdependent. Pruning reduces complexity but may require architectural changes for hardware efficiency. Quantization reduces precision but impacts execution logic. The most effective strategies combine techniques across all three layers. For practitioners seeking immediate guidance on which techniques to apply, @sec-model-compression-decision-framework-0d69 provides a decision framework that maps deployment constraints to specific technique recommendations. The intervening sections provide the technical foundation needed to apply that framework effectively.

To understand *why* numerics matter so deeply, consider the *physics of quantization* at the silicon level.

\index{Iron Law!quantization impact}

::: {.callout-notebook title="The Physics of Quantization"}
The **Energy-Movement Invariant** ($E_{move} \gg E_{compute}$) means that in the physics of silicon, **bits represent energy** (see @sec-machine-foundations-numerical-representations-c889 for a detailed comparison of FP32 vs. INT8 energy costs).

According to the **Iron Law** ($T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak} \cdot \eta} + L_{lat}$), which decomposes execution time into data volume moved, operations performed, and fixed latency, reducing the bit-width of a weight has a quadratic effect on efficiency:

1.  **Memory Energy (Dvol)**: Fetching a 32-bit float from DRAM costs ≈ **`{python} energy_dram_str` pJ**. Fetching an 8-bit integer costs ≈ **`{python} energy_dram_per_byte_str` pJ**.
2.  **Compute Energy (O)**: A 32-bit FLOP costs ≈ **`{python} energy_flop_fp32_str` pJ**. An 8-bit integer OP costs ≈ **`{python} energy_flop_int8_str` pJ**.

| **Operation**   | **Bit-Width** | **Relative Energy** |
|:----------------|--------------:|--------------------:|
| **Integer Add** |         8-bit |          1 $\times$ |
| **Float Add**   |        32-bit |         30 $\times$ |
| **DRAM Read**   |        64-bit | **40,000 $\times$** |

**For Inference**: Moving from FP32 to INT8 doesn't just save 4 $\times$ memory; it can reduce the **energy per inference** by up to **`{python} int8_energy_reduction_str` $\times$** on hardware with dedicated INT8 units, depending on the compute-to-memory ratio of the workload.

 This is the difference between a battery lasting 1 hour or 20 hours.

These same physics apply at datacenter scale: distributed training systems use reduced precision to cut gradient communication overhead, a topic covered in @sec-model-training. For a deeper treatment of how silicon architectures exploit these energy differences, see @sec-hardware-acceleration.
:::

These energy savings explain why neural networks tolerate aggressive quantization: the energy cost of higher precision exceeds the accuracy benefit for most applications. The question then becomes how much precision we can sacrifice before accuracy collapses. Look closely at @fig-quantization-free-lunch and identify the two regimes: a "Free Lunch" zone where reducing precision has minimal impact on accuracy, and a "Cliff" where the model fails catastrophically.

```{python}
#| label: fig-quantization-free-lunch
#| echo: false
#| fig-cap: "**The Quantization Free Lunch.** Model accuracy vs. Bit-width. Most models exhibit a 'Free Lunch' plateau where reducing precision from FP32 to INT8 yields <1% accuracy loss. This robustness collapses at the 'Quantization Cliff' (typically 3-4 bits). Curves are illustrative and meant to show qualitative behavior."
#| fig-alt: "Line chart of Accuracy vs Precision (Bits). Blue line (CNN) stays flat down to 4 bits then drops. Red line (Transformer) drops earlier. Green shaded area marks the 'Free Lunch Zone' of minimal loss."
# ┌─────────────────────────────────────────────────────────────────────────────
# │ THE QUANTIZATION FREE LUNCH (FIGURE)
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @fig-quantization-free-lunch in Optimization Framework section
# │
# │ Goal: Visualize the accuracy-vs-precision trade-off.
# │ Show: The "Free Lunch Zone" above 4 bits and the "cliff" below it.
# │ How: Plot model accuracy across bit-widths (FP32 to 1-bit).
# │
# │ Imports: numpy (np), mlsys (viz)
# │ Exports: (figure output only)
# └─────────────────────────────────────────────────────────────────────────────
import numpy as np
from mlsys import viz

fig, ax, COLORS, plt = viz.setup_plot()

# --- Data ---
bits = np.array([32, 16, 8, 4, 3, 2])
acc_cnn = np.array([76.1, 76.1, 76.0, 74.5, 55.0, 10.0])
acc_trans = np.array([84.0, 84.0, 83.5, 78.0, 40.0, 10.0])

# --- Plot ---
ax.plot(bits, acc_cnn, 'o-', color=COLORS['BlueLine'], label='CNN (ResNet-50)', markersize=5)
ax.plot(bits, acc_trans, 's-', color=COLORS['RedLine'], label='Transformer (BERT)', markersize=5)

ax.invert_xaxis()
ax.set_xlabel('Precision (Bits)')
ax.set_ylabel('Model Accuracy (%)')
ax.set_xticks(bits)
ax.set_xticklabels(['FP32', 'FP16', 'INT8', 'INT4', 'INT3', 'INT2'], rotation=90)

ax.axvspan(33, 7, color=COLORS['GreenL'], alpha=0.3)
ax.text(20, 50, "Free Lunch Zone\n(<1% Loss)", color=COLORS['GreenLine'], fontweight='bold', ha='center', fontsize=9)
ax.axvspan(5, 1, color=COLORS['RedL'], alpha=0.3)
ax.text(3.5, 30, "The Cliff", color=COLORS['RedLine'], fontweight='bold', ha='center', fontsize=9)
ax.legend(fontsize=8)
plt.show()
```

These physics-level savings translate directly into deployment capabilities. A model that cannot fit on a device at full precision may run comfortably, and faster, when quantized. The following calculation demonstrates this *quantization speedup* for a concrete LLM deployment scenario.

```{python}
#| label: quant-speedup-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ QUANTIZATION SPEEDUP CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The Quantization Speedup"
# │
# │ Goal: Demonstrate the bandwidth-driven speedup from quantization.
# │ Show: That INT4 yields a 4× speedup for bandwidth-bound LLM generation.
# │ How: Calculate attainable throughput for FP16 vs. INT4 on a 7B model.
# │
# │ Imports: mlsys.formatting (fmt), mlsys.constants (BYTES_FP16, BYTES_INT4, byte)
# │ Exports: params_b_str, bytes_fp16_str, bytes_int4_str, device_ram_gb_str,
# │          mem_bw_gbs_str, kv_cache_gb_str, fp16_size_str, fp16_total_str,
# │          fp16_latency_str, fp16_toks_str, int4_size_str, int4_latency_str,
# │          int4_toks_str, speedup_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
from mlsys.constants import KIB_TO_BYTES
from mlsys.constants import BYTES_FP16, BYTES_INT4, byte, MS_PER_SEC

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class QuantizationSpeedup:
    """
    Namespace for Quantization Speedup calculation.
    Scenario: Deploying a 7B LLM on a bandwidth-constrained device.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    params_b = 7
    bytes_fp16 = 2.0
    bytes_int4 = 0.5

    device_ram_gb = 16
    mem_bw_gbs = 50.0
    kv_cache_gb = 1.0

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    # Sizes
    fp16_size_gb = params_b * bytes_fp16
    fp16_total_gb = fp16_size_gb + kv_cache_gb

    int4_size_gb = params_b * bytes_int4

    # Latency (Bandwidth Bound)
    fp16_latency_ms = (fp16_size_gb / mem_bw_gbs) * 1000
    int4_latency_ms = (int4_size_gb / mem_bw_gbs) * 1000

    # Throughput (Tokens/sec)
    fp16_toks = MS_PER_SEC / fp16_latency_ms
    int4_toks = MS_PER_SEC / int4_latency_ms

    speedup = int4_toks / fp16_toks

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(3.5 <= speedup <= 4.5, f"INT4 should yield ~4x speedup vs FP16, got {speedup:.1f}x")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    params_b_str = f"{params_b}"
    bytes_fp16_str = f"{int(bytes_fp16)}"
    bytes_int4_str = f"{bytes_int4}"
    device_ram_gb_str = f"{device_ram_gb}"
    mem_bw_gbs_str = f"{int(mem_bw_gbs)}"
    kv_cache_gb_str = f"{int(kv_cache_gb)}"

    fp16_size_str = f"{int(fp16_size_gb)}"
    fp16_total_str = f"{int(fp16_total_gb)}"
    fp16_latency_str = f"{int(fp16_latency_ms)}"
    fp16_toks_str = f"{fp16_toks:.1f}"

    int4_size_str = f"{int4_size_gb:.1f}"
    int4_latency_str = f"{int(int4_latency_ms)}"
    int4_toks_str = f"{int(int4_toks)}"
    speedup_str = f"{int(speedup)}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
params_b_str = QuantizationSpeedup.params_b_str
bytes_fp16_str = QuantizationSpeedup.bytes_fp16_str
bytes_int4_str = QuantizationSpeedup.bytes_int4_str
device_ram_gb_str = QuantizationSpeedup.device_ram_gb_str
mem_bw_gbs_str = QuantizationSpeedup.mem_bw_gbs_str
kv_cache_gb_str = QuantizationSpeedup.kv_cache_gb_str
fp16_size_str = QuantizationSpeedup.fp16_size_str
fp16_total_str = QuantizationSpeedup.fp16_total_str
fp16_latency_str = QuantizationSpeedup.fp16_latency_str
fp16_toks_str = QuantizationSpeedup.fp16_toks_str
int4_size_str = QuantizationSpeedup.int4_size_str
int4_latency_str = QuantizationSpeedup.int4_latency_str
int4_toks_str = QuantizationSpeedup.int4_toks_str
speedup_str = QuantizationSpeedup.speedup_str
```

We call this phenomenon *the quantization speedup*.

::: {.callout-notebook title="The Quantization Speedup"}
**Problem**: You want to deploy a `{python} params_b_str` B parameter LLM on a device with `{python} device_ram_gb_str` GB RAM. The weights are FP16 (`{python} bytes_fp16_str` bytes).

**The Math**:

1.  **Model Size**: `{python} params_b_str` $\times 10^9$ $\times$ `{python} bytes_fp16_str` bytes = `{python} fp16_size_str` GB.
2.  **KV Cache**: Context window (4096 tokens) requires ≈ `{python} kv_cache_gb_str` GB.
3.  **Total Memory**: `{python} fp16_size_str` + `{python} kv_cache_gb_str` = `{python} fp16_total_str` GB. This barely fits, leaving no room for OS or buffers.
4.  **Bandwidth Cost**: Loading `{python} fp16_size_str` GB at `{python} mem_bw_gbs_str` GB/s takes **`{python} fp16_latency_str` ms** per token. That is `{python} fp16_toks_str` tokens/sec, too slow for chat.

**The Fix (INT4)**:

1.  **Quantization**: Convert weights to 4-bit integers (`{python} bytes_int4_str` bytes).
2.  **New Size**: `{python} params_b_str` $\times 10^9$ $\times$ `{python} bytes_int4_str` = `{python} int4_size_str` GB.
3.  **New Speed**: Loading `{python} int4_size_str` GB takes **`{python} int4_latency_str` ms**. Speed jumps to **`{python} int4_toks_str` tokens/sec**.

**The Conclusion**: Quantization is not just about "fitting" the model; it is a **`{python} speedup_str` $\times$ Linear Speedup** because LLM generation is bandwidth-bound.
:::

The relative importance of each dimension varies by deployment target. Cloud systems may tolerate larger models but demand throughput; mobile devices prioritize memory and energy; embedded systems face hard constraints on all resources simultaneously. Understanding these deployment contexts shapes which optimization dimensions to prioritize.


## Deployment Context {#sec-model-compression-deployment-context-0d88}

The optimization framework above identifies three dimensions of compression, but which dimensions matter most depends entirely on where the model will run. A datacenter GPU with 80 GB of HBM faces different binding constraints than a smartphone with shared RAM or a microcontroller with 256 KB of SRAM. @tbl-deployment-scenarios summarizes the key constraints across deployment environments.

| **Context**     | **Memory** | **Latency** | **Power** | **Primary Goal** |
|:----------------|:-----------|------------:|:----------|:-----------------|
| **Cloud**       | 10s GB     |   10–100 ms | Flexible  | Throughput, cost |
| **Mobile/Edge** | 100s MB–GB |    10–50 ms | Watts     | Size, latency    |
| **TinyML**      | KB–MB      |     1–10 ms | mW        | Size, energy     |

: **Deployment Constraints**: Each deployment context imposes different optimization priorities. {#tbl-deployment-scenarios}

### Deployment Scenarios {#sec-model-compression-deployment-scenarios-70c9}

Cloud inference centers on throughput (requests/second/dollar), where quantization enables serving more concurrent requests and operator fusion reduces per-request latency [@choudhary2020comprehensive; @dean2018new]. Mobile and edge deployments must fit device memory while meeting real-time targets. A camera app processing 30 fps has 33 ms per frame, so any optimization reducing inference below this threshold directly improves user experience.

TinyML\index{TinyML!model compression}[^fn-microcontroller-constraints] makes optimization existential, not optional. A microcontroller with 256 KB RAM cannot run a 100 MB model regardless of accuracy. The model must compress below hardware limits or deployment is impossible [@banbury2020benchmarking]. Even on mobile devices with comparatively generous resources, a single optimization technique can deliver a *4 $\times$ performance win* that means the difference between a feature that ships and one that never leaves the prototype stage.

\index{MobileNetV3!NAS-optimized architecture}

::: {.callout-example title="The 4 $\times$ MobileNet Win"}
**The Context**: A mobile app wants to add real-time "Background Blur" to video calls. The feature requires a segmentation model running at 30 FPS.

**The Bottleneck**: The unoptimized MobileNetV3 (FP32), a NAS-optimized successor to our MobileNetV2 lighthouse model, runs at 8 FPS on mid-tier Android phones. It is too slow to ship.

**The Optimization**:

1.  **Quantization**: Converting weights to INT8 reduces size by 4 $\times$ and uses the phone's DSP/NPU.
2.  **Result**: Speed jumps to 35 FPS. Energy per frame drops by 3 $\times$.

**The Business Value**: Compression did not just "optimize" the feature; it **enabled** it. Without INT8 quantization, the product simply could not exist for the target market.
:::

[^fn-microcontroller-constraints]: **Microcontroller Constraints**: Microcontrollers operate under severe constraints relative to servers and modern accelerators, often with *kilobytes to low megabytes* of RAM and limited persistent storage. A practical mental model is that you may have \(10^3\) to \(10^6\) bytes of memory available for the entire pipeline, which is why "model optimization" is often a prerequisite rather than an optional improvement in embedded deployments.

The deployment gap table below quantifies this mismatch using the Lighthouse models from @sec-ml-systems. The gap between model requirements and device capabilities explains *why* compression is not optional for resource-constrained deployment: without it, the models simply cannot run.

```{python}
#| label: model-device-comparison
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ MODEL-DEVICE MEMORY COMPARISON
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: @tbl-model-vs-device and deployment gap discussion
# │
# │ Goal: Contrast model requirements with device memory capacities.
# │ Show: The 6-order-of-magnitude gap from Cloud to TinyML.
# │ How: List VRAM and RAM constraints for standard hardware tiers.
# │
# │ Imports: mlsys.constants (GB, GiB, MiB, KiB, MB, KB, byte,
# │          CLOUD_MEM_GIB, MOBILE_MEM_GIB, TINY_MEM_KIB, DLRM_MODEL_SIZE_FP32)
# │ Exports: dlrm_str, gpt2_str, resnet_str, mobilenet_str, mobilenet_int8_str,
# │          dscnn_str, cloud_cap_str, mobile_cap_str, tiny_cap_str,
# │          dlrm_mobile_str, dlrm_tiny_str, gpt2_mobile_str, gpt2_tiny_str,
# │          resnet_tiny_str, mobilenet_tiny_str, mobilenet_int8_tiny_str,
# │          dscnn_tiny_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.constants import (GB, GiB, MiB, KiB, MB, KB, byte,
                             CLOUD_MEM_GIB, MOBILE_MEM_GIB, TINY_MEM_KIB,
                             DLRM_MODEL_SIZE_FP32)

# --- Inputs (device capacities and model sizes) ---
cloud_mem_value = CLOUD_MEM_GIB
mobile_mem_value = MOBILE_MEM_GIB
tiny_mem_value = TINY_MEM_KIB

dlrm_mem_value = DLRM_MODEL_SIZE_FP32
gpt2_mem_value = 6 * GiB
resnet_mem_value = 100 * MiB
mobilenet_mem_value = 14 * MiB
mobilenet_int8_mem_value = 3.5 * MiB
dscnn_mem_value = 500 * KiB

# --- Process (compute fit ratios) ---
def get_ratio(model_mem, device_mem):
    ratio = model_mem.to(byte).magnitude / device_mem.to(byte).magnitude
    if ratio < 1:
        return "ok"
    return f"no ({ratio:.0f}x)"

dlrm_mobile_value = get_ratio(dlrm_mem_value, mobile_mem_value)
dlrm_tiny_value = get_ratio(dlrm_mem_value, tiny_mem_value)

gpt2_mobile_value = get_ratio(gpt2_mem_value, mobile_mem_value)
gpt2_tiny_value = get_ratio(gpt2_mem_value, tiny_mem_value)

resnet_tiny_value = get_ratio(resnet_mem_value, tiny_mem_value)
mobilenet_tiny_value = get_ratio(mobilenet_mem_value, tiny_mem_value)
mobilenet_int8_tiny_value = get_ratio(mobilenet_int8_mem_value, tiny_mem_value)

# --- Outputs (formatted strings for prose) ---
dlrm_str = f"{dlrm_mem_value.to(GB).magnitude:.0f} GB"
gpt2_str = f"{gpt2_mem_value.to(GiB).magnitude:.0f} GB"
resnet_str = f"{resnet_mem_value.to(MiB).magnitude:.0f} MB"
mobilenet_str = f"{mobilenet_mem_value.to(MiB).magnitude:.0f} MB"
mobilenet_int8_str = f"{mobilenet_int8_mem_value.to(MiB).magnitude:.1f} MB"
dscnn_str = f"{dscnn_mem_value.to(KiB).magnitude:.0f} KB"

cloud_cap_str = f"~{cloud_mem_value.to(GiB).magnitude:.0f} GB"
mobile_cap_str = f"~{mobile_mem_value.to(GiB).magnitude:.0f} GB"
tiny_cap_str = f"~{tiny_mem_value.to(KiB).magnitude:.0f} KB"

dlrm_mobile_str = dlrm_mobile_value
dlrm_tiny_str = dlrm_tiny_value
gpt2_mobile_str = gpt2_mobile_value
gpt2_tiny_str = gpt2_tiny_value
resnet_tiny_str = resnet_tiny_value
mobilenet_tiny_str = mobilenet_tiny_value
mobilenet_int8_tiny_str = mobilenet_int8_tiny_value
dscnn_tiny_str = "ok"
```

| **Model**              | **Memory** **(Runtime)**      | **Storage** **(Weights)**     | **Cloud** **(`{python} cloud_cap_str`)** | **Mobile** **(`{python} mobile_cap_str`)** | **TinyML** **(`{python} tiny_cap_str`)** |
|:-----------------------|:------------------------------|:------------------------------|:-----------------------------------------|:-------------------------------------------|:-----------------------------------------|
| **DLRM**               | `{python} dlrm_str`           | `{python} dlrm_str`           | ok                                       | `{python} dlrm_mobile_str`                 | `{python} dlrm_tiny_str`                 |
| **GPT-2 XL**           | `{python} gpt2_str`           | `{python} gpt2_str`           | ok                                       | ok                                         | `{python} gpt2_tiny_str`                 |
| **ResNet-50**          | `{python} resnet_str`         | `{python} resnet_str`         | ok                                       | ok                                         | `{python} resnet_tiny_str`               |
| **MobileNetV2**        | `{python} mobilenet_str`      | `{python} mobilenet_str`      | ok                                       | ok                                         | `{python} mobilenet_tiny_str`            |
| **MobileNetV2 (INT8)** | `{python} mobilenet_int8_str` | `{python} mobilenet_int8_str` | ok                                       | ok                                         | `{python} mobilenet_int8_tiny_str`       |
| **DS-CNN (KWS)**       | `{python} dscnn_str`          | `{python} dscnn_str`          | ok                                       | ok                                         | `{python} dscnn_tiny_str`                |

: **The Deployment Gap**: Model memory requirements compared against typical device capacities. Even MobileNetV2 quantized to INT8 exceeds TinyML constraints by 7 $\times$, while the purpose-built DS-CNN keyword spotter fits comfortably. Numbers in parentheses show how many times the model exceeds device memory. {#tbl-model-vs-device}

\index{Environmental Impact!model compression benefits}
As @tbl-model-vs-device makes concrete, even aggressively optimized models like MobileNetV2 at INT8 precision exceed TinyML device memory by an order of magnitude. Optimization also contributes to sustainable and accessible AI deployment. Reducing a model's energy footprint is important as AI workloads scale, helping mitigate the environmental impact of large-scale ML training and inference [@patterson2021carbon]. At the same time, optimized models can expand the reach of machine learning, supporting applications in low-resource environments, from rural healthcare to autonomous systems operating in the field.

### Balancing Trade-offs {#sec-model-compression-balancing-tradeoffs-6ae3}
\index{Model Compression!accuracy-efficiency trade-off}
\index{Accuracy!compression trade-offs}
\index{Memory Footprint!deployment constraint}

The tension between accuracy and efficiency drives every optimization decision. Increasing model capacity generally enhances predictive performance while increasing computational cost, resulting in slower, more resource-intensive inference. The improvements introduce challenges related to memory footprint[^fn-memory-bandwidth-compression], inference latency, power consumption, and training efficiency.

[^fn-memory-bandwidth-compression]: Memory bandwidth (introduced in @sec-introduction) constrains how fast data moves between memory and processors. For compression, bandwidth differences across deployment targets matter: datacenter accelerators reach TB/s while mobile devices achieve only tens of GB/s. Compression techniques that reduce memory traffic often yield larger speedups than those that only reduce computation.

This tension manifests differently across deployment contexts. Training requires computational resources that scale with model size; inference demands strict latency and power constraints in real-time applications. Understanding where each optimization technique falls on the *compression-accuracy Pareto frontier* is essential for informed technique selection\index{Pareto Frontier!compression-accuracy}\index{Compression Ratio!accuracy trade-off}.

::: {.callout-perspective title="The Compression-Accuracy Tradeoff Curve"}
Optimization is a search for the **Pareto Frontier**.

*   **Region 1: The Free Lunch**. Techniques like "Operator Fusion" or "Dead Code Elimination" reduce latency without touching accuracy. Do these first.
*   **Region 2: The Efficient Trade**. Techniques like "INT8 Quantization" might drop accuracy by 0.5% but improve speed by 400%. This is usually a winning trade.
*   **Region 3: The Steep Drop**. Aggressive pruning (e.g., removing 90% of weights) might drop accuracy by 10% to gain another 20% speedup. This is the **danger zone** where the model becomes useless.

**Systems Rule**: Stop compressing when you hit the "knee" of the curve, where the marginal loss in accuracy exceeds the marginal gain in efficiency.
:::

@tbl-optimization-tradeoffs summarizes the key optimization techniques, their systems benefits, and their ML costs. These are empirical relationships—actual results depend on model architecture, task, and careful implementation.

| **Technique**                 | **Systems Gain**                           | **ML Cost**          | **Typical Impact**   | **Region** |
|:------------------------------|:-------------------------------------------|:---------------------|:---------------------|:----------:|
| **Operator Fusion**           | 10–30% latency reduction                   | None                 | No accuracy loss     |     1      |
| **FP32 → BF16**               | 2 $\times$ memory, ~2 $\times$ throughput  | Minimal              | <0.1% accuracy drop  |     1      |
| **FP16 → INT8**               | 2 $\times$ memory, 2–4 $\times$ throughput | Quantization error   | 0.5–1% accuracy drop |     2      |
| **50% Pruning**               | ~2 $\times$ smaller model                  | Capacity loss        | 0.5–1% accuracy drop |     2      |
| **Knowledge Distillation**    | 2–10 $\times$ smaller student              | Capability ceiling   | 1–3% accuracy drop   |     2      |
| **4-bit Quantization**        | 4 $\times$ memory reduction                | Significant error    | 2–5% accuracy drop   |    2–3     |
| **90% Pruning**               | ~10 $\times$ smaller model                 | Severe capacity loss | 5–15% accuracy drop  |     3      |
| **↑ Batch Size (8 $\times$)** | Higher throughput, better GPU util         | Generalization gap   | Requires LR scaling  |     —      |

: **The Optimization Tradeoffs.** Region 1 = Free Lunch, Region 2 = Efficient Trade, Region 3 = Danger Zone. Batch size affects training dynamics rather than model quality directly. These ranges are empirical guidelines from published benchmarks [@jacob2018quantization; @han2015deep; @hinton2015distilling]; actual results vary with architecture, task, and implementation quality. {#tbl-optimization-tradeoffs}

The table reveals a pattern: techniques that preserve model structure (fusion, precision reduction) tend to be "free" or cheap, while techniques that alter structure (pruning, distillation) extract more savings but require careful tuning. Before examining each technique in depth, verify your intuition about these trade-offs.

::: {.callout-checkpoint title="The Efficiency Frontier" collapse="false"}
Optimization is about trading one resource for another.

**Trade-offs**

- [ ] **Accuracy vs. Cost**: Do you understand why removing 50% of weights (Pruning) might drop accuracy by 1%, but reducing precision (Quantization) to INT8 might drop it by 0.5% while saving 4 $\times$ memory?
- [ ] **The "Free Lunch"**: Can you identify optimizations like **Operator Fusion** that improve speed without hurting accuracy?

**Technique Selection**

- [ ] **Pruning**: When should you prune (reduce parameters) vs. distill (train a smaller student)? (Hint: Pruning is for existing models; Distillation is for architectural changes).
:::

Each deployment context above imposes a binding constraint: memory capacity on mobile devices, latency on real-time systems, energy on battery-powered sensors. The optimization techniques that follow address these constraints at three successive levels of the stack. We begin with structural methods that modify *what* computations occur, reducing the model's parameter count and operation count to fit tighter memory and compute budgets. We then turn to precision techniques that reduce how many bits represent each value, directly shrinking memory footprint and accelerating arithmetic. Finally, we address architectural approaches that improve how efficiently the remaining operations execute on physical hardware, closing the gap between theoretical savings and measured performance.


## Structural Optimization {#sec-model-compression-structural-optimization-ee93}
\index{Model Compression!structural optimization}

Structural optimization addresses the first dimension of our framework, **Efficient Model Representation**, by modifying *what* the model computes. Modern neural networks are heavily overparameterized[^fn-gradient-checkpointing]: they carry far more parameters than any single task requires. This surplus is not a design flaw but a training necessity, since over-capacity helps optimization navigate complex loss landscapes. At deployment, however, every excess parameter translates directly into wasted memory, computation, and energy.

[^fn-gradient-checkpointing]: **Gradient Checkpointing**: Memory optimization technique that trades computation for memory by recomputing intermediate activations during backpropagation instead of storing them. Reduces memory usage by 20–50% in transformer models, enabling larger batch sizes or model sizes within same GPU memory.

\index{Conservation of Complexity!structural optimization}
Every technique in this chapter is governed by a single meta-law, analogous to conservation of energy in thermodynamics: the **Conservation of Complexity**. Just as a physical system cannot destroy energy but only convert it between forms, an ML system cannot destroy complexity but only relocate it between data, algorithm, and machine. This principle constrains all possible optimizations and explains why no compression technique achieves a free lunch. The engineer's task is to move complexity to where the cost is lowest given deployment constraints.

The challenge is removing that surplus without removing what matters, a direct manifestation of this law. You cannot destroy complexity, only move it. Pruning moves complexity from parameters to the hardware's ability to exploit sparse patterns: the model becomes simpler, but the system must now handle irregular memory access. Knowledge distillation moves complexity from inference compute to training compute: a smaller model at deployment, but a larger training budget to produce it. Neural Architecture Search moves complexity from human design effort to automated exploration: a more efficient architecture, but at the cost of a large search budget. Understanding where complexity should reside for your specific deployment target[^fn-pareto-frontier] is the central question of structural optimization.

[^fn-pareto-frontier]: **Pareto Frontier**\index{Pareto Frontier!definition}\index{Pareto Frontier!multi-objective optimization}: Named after Italian economist Vilfredo Pareto (1848-1923). In multi-objective optimization, the Pareto frontier represents the set of solutions where improving one objective (e.g., speed) necessarily requires sacrificing another (e.g., accuracy). EfficientNet traces this frontier: B0 (77.1%, 390M FLOPs) to B7 (84.4%, 37B FLOPs). Multi-objective NAS explicitly optimizes for Pareto-optimal architectures.

These three techniques address the challenge through complementary approaches. Pruning eliminates low-impact parameters from an existing model. Knowledge distillation transfers a large model's learned capabilities to a smaller architecture. NAS automates architecture design from the ground up, building optimized structures for specific constraints. In practice, these techniques are often combined: a NAS-designed architecture, distilled from a large teacher, then pruned for final deployment.

### Pruning {#sec-model-compression-pruning-d1cb}

\index{Pruning!etymology}
\index{Optimal Brain Damage!LeCun 1989}
\index{Optimal Brain Surgeon!Hassibi 1993}
\index{LeCun, Yann!optimal brain damage}
\index{Han, Song!modern pruning revival}
Consider a MobileNet trained for image classification on a wearable health monitor. The trained model occupies 14 MB, but the target microcontroller offers only 2 MB of flash memory. Retraining a smaller architecture from scratch would require weeks of data collection and validation — time the product schedule does not allow. Fortunately, analysis reveals that 85% of the model's weights hover near zero, contributing almost nothing to its predictions. Removing those weights and fine-tuning the remainder for a few epochs produces a model that fits in 2 MB with less than 1% accuracy loss. This is pruning in practice.

Pruning[^fn-pruning-etymology] directly addresses memory efficiency constraints by eliminating redundant parameters. Because neural networks carry far more weights than any single task demands (as established above), we can remove a significant fraction without substantial performance degradation. The central questions are *what* to prune (individual weights versus entire structures), *how* to decide what is expendable (magnitude, gradients, or activations), and *when* to prune (after training, during training, or even at initialization). As we will explore in @sec-hardware-acceleration, specialized hardware can further exploit the resulting sparse structures.

[^fn-pruning-etymology]: **Pruning**: Borrowed from horticulture, where gardeners prune branches to improve plant health and growth. The ML metaphor fits precisely: just as removing unproductive branches redirects resources to productive growth, removing low-magnitude weights redirects computational resources to parameters that matter. The technique dates to Yann LeCun's "Optimal Brain Damage" (Bell Labs, 1989) [@lecun1990optimal], which formalized the gardener's intuition with second-derivative analysis. Babak Hassibi and colleagues at Stanford extended this with "Optimal Brain Surgeon" (1993), using full Hessian information for more precise cuts. The field then lay largely dormant for two decades until Song Han (Stanford, later MIT) revived it for deep learning in 2015 [@han2015deep], demonstrating that modern networks could be pruned by 90% without accuracy loss—a finding that launched the modern model compression era.

::: {.callout-definition title="Pruning"}
***Pruning***\index{Pruning!definition} is the sparsification of the **Parameter Space**. It removes weights that contribute minimal information to the **Loss Landscape**, converting dense matrices into sparse structures to reduce **Memory Footprint** and, with specialized hardware, **FLOPs**, at the cost of **Regularity**.
:::

The goal of pruning is to find a sparse version of the model parameters $\hat{W}$ that minimizes the increase in prediction error (loss) while satisfying a fixed parameter budget $k$. Framing this goal mathematically clarifies both the objective and why approximate solutions are necessary:

$$
\min_{\hat{W}} \mathcal{L}(\hat{W}) \quad \text{subject to} \quad \|\hat{W}\|_0 \leq k
$$

\index{NP-hard!pruning optimization}
where $\|\hat{W}\|_0$ is the **L0-norm** (the count of non-zero parameters). Since minimizing the L0-norm is NP-hard[^fn-np-hard], we use heuristics[^fn-heuristic] like **magnitude-based pruning**. @lst-pruning_example demonstrates this approach, removing weights with small absolute values to transform a dense weight matrix into the sparse representation visualized in @fig-sparse-matrix.

[^fn-np-hard]: **NP-hard**: From computational complexity theory, "NP" stands for "nondeterministic polynomial time." A problem is NP-hard if solving it in polynomial time would imply P=NP, widely believed false. Finding the optimal sparse subnetwork requires examining exponentially many subsets, making exact solutions infeasible for networks with millions of parameters.

[^fn-heuristic]: **Heuristic**: From Greek "heuriskein" (to discover or find), the same root as "eureka." Heuristics are practical methods that find good solutions without guarantees of optimality. Magnitude-based pruning assumes larger weights are more important, a reasonable heuristic that works well empirically even though counterexamples exist.

\index{Pruning!binary mask}
\index{Hadamard Product!pruning mask}

::: {#lst-pruning_example lst-cap="**Magnitude-Based Pruning**: Removes weights below a threshold to create sparse matrices, reducing the number of nonzero parameters from 9 to 4 ($k=4$)."}
```{.python}
import torch

# Original dense weight matrix
weights = torch.tensor(
    [[0.8, 0.1, -0.7], [0.05, -0.9, 0.03], [-0.6, 0.02, 0.4]]
)

# Simple magnitude-based pruning: keep only the 4 largest weights
threshold = 0.5
mask = torch.abs(weights) >= threshold
pruned_weights = weights * mask

print("Original:", weights)
print("Pruned (4 non-zeros):", pruned_weights)
```
:::

\index{Sparse Matrix!storage efficiency}

::: {#fig-sparse-matrix fig-env="figure" fig-pos="htb" fig-cap="**Sparse Matrix Transformation**: Pruning removes small-magnitude weights (shown as white/zero in the right matrix) while preserving large-magnitude weights (shown in color), creating a sparse representation that reduces memory usage while maintaining model accuracy." fig-alt="Two 11x11 matrices side by side. Left matrix shows dense weights with colored cells indicating magnitudes. Right matrix shows sparse version with most cells white (zero) and only high-magnitude values retained in color."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}\footnotesize]
\tikzset{%
cell/.style={draw=black!80,line width=0.5pt, minimum size=\cellsize,
    minimum height=\cellheight}
}
\definecolor{Blue1}{RGB}{84,131,217}
\definecolor{Blue2}{RGB}{145,177,237}
\definecolor{Blue3}{RGB}{201,217,247}
\definecolor{Blue4}{RGB}{227,235,250}
\colorlet{Blue1}{RedFill}
\colorlet{Blue2}{RedFill}
\colorlet{Blue3}{RedFill}
\colorlet{Blue4}{RedFill}
\def\columns{3}
\def\rows{3}
\def\cellsize{8mm}
\def\cellheight{8mm}

%%LEFT
\begin{scope}[local bounding box=M1,shift={(0,0)}]
\def\columns{11}
\def\rows{11}
\def\br{M1}
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
\node[draw=black!80, fill=Blue4, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {0.01};
    }
}
%1
\foreach \c/\n/\f in {3/-1.9/Blue3,5/1.76/Blue3,8/3.75/Blue2,2/0.02/Blue4,4/0.02/Blue4,9/0.02/Blue4}{
\node[cell,fill=\f]at(cell-\c-1M1){\n};
}
%2
\foreach \c/\n/\f in {1/7.93/Blue1,4/0.68/Blue3,7/-1.1/Blue3,2/0.02/Blue4,5/0.02/Blue4,9/0.02/Blue4}{
\node[cell,fill=\f]at(cell-\c-2M1){\n};
}
%3
\foreach \c/\n/\f in {3/5.2/Blue2,4/0.2/Blue3,9/-6.2/Blue2,1/0.02/Blue4,5/0.02/Blue4,8/0.02/Blue4,10/0.02/Blue4}{
\node[cell,fill=\f]at(cell-\c-3M1){\n};
}
%4
\foreach \c/\n/\f in {9/-2.5/Blue3,2/0.02/Blue4,7/0.02/Blue4}{
\node[cell,fill=\f]at(cell-\c-4M1){\n};
}
%5
\foreach \c/\n/\f in {1/0.32/Blue3,3/-3.5/Blue3,5/0.88/Blue3,7/0.02/Blue4,9/0.02/Blue4,11/0.02/Blue4}{
\node[cell,fill=\f]at(cell-\c-5M1){\n};
}
 %6
\foreach \c/\n/\f in {4/2.4/Blue3,6/-3.1/Blue2,11/8.26/Blue1,2/0.02/Blue4,3/0.02/Blue4,5/0.02/Blue4,9/0.02/Blue4}{
\node[cell,fill=\f]at(cell-\c-6M1){\n};
}
%7
 \foreach \c/\n/\f in {1/0.96/Blue2,2/9.77/Blue1,3/0.92/Blue3,7/8.5/Blue1,8/6.6/Blue2}{
\node[cell,fill=\f]at(cell-\c-7M1){\n};
}
%8
\foreach \c/\n/\f in {2/0.8/Blue2,1/0.03/Blue4,4/0.03/Blue4,7/0.03/Blue4,6/0.02/Blue4,8/0.02/Blue4,9/0.02/Blue4,11/0.02/Blue4}{
\node[cell,fill=\f]at(cell-\c-8M1){\n};
}
%9
\foreach \c/\n/\f in {8/0.7/Blue3,9/14.8/Blue1,11/0.91/Blue3,2/0.02/Blue4,4/0.02/Blue4,7/0.03/Blue4}{
\node[cell,fill=\f]at(cell-\c-9M1){\n};
}
 %10
 \foreach \c/\n/\f in {7/-0.38/Blue2,11/10.1/Blue1,1/0.02/Blue4,2/0.02/Blue4,5/0.02/Blue4,10/0.03/Blue4}{
\node[cell,fill=\f,inner sep=0pt]at(cell-\c-10M1){\n};
}
 %11
 \foreach \c/\n/\f in {3/16.3/Blue1,6/2.9/Blue2,10/-5.4/Blue2,2/0.03/Blue4,4/0.03/Blue4,9/0.02/Blue4}{
\node[cell,fill=\f,inner sep=0pt]at(cell-\c-11M1){\n};
}
\end{scope}

%%RIGHT
\begin{scope}[local bounding box=M2,shift={(11,0)}]
\def\columns{11}
\def\rows{11}
\def\br{M2}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=white, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {0};
    }
}
\node[cell,fill=Blue3]at(cell-3-1M2){-1.9};
\node[cell,fill=Blue3]at(cell-5-1M2){1.76};
\node[cell,fill=Blue2]at(cell-8-1M2){3.75};
%2
\node[cell,fill=Blue1]at(cell-1-2M2){7.93};
\node[cell,fill=Blue3]at(cell-4-2M2){0.68};
\node[cell,fill=Blue3]at(cell-7-2M2){-1.1};
%3
\node[cell,fill=Blue2]at(cell-3-3M2){5.2};
\node[cell,fill=Blue3]at(cell-4-3M2){0.2};
\node[cell,fill=Blue2]at(cell-9-3M2){-6.2};
%4
 \node[cell,fill=Blue3]at(cell-9-4M2){-2.5};
%5
\node[cell,fill=Blue3]at(cell-1-5M2){0.32};
\node[cell,fill=Blue3]at(cell-3-5M2){-3.5};
\node[cell,fill=Blue3]at(cell-5-5M2){0.88};
%6
\node[cell,fill=Blue3]at(cell-4-6M2){2.4};
\node[cell,fill=Blue2]at(cell-6-6M2){-3.1};
\node[cell,fill=Blue1]at(cell-11-6M2){8.26};
%7
\node[cell,fill=Blue2]at(cell-1-7M2){0.96};
\node[cell,fill=Blue1]at(cell-2-7M2){9.77};
\node[cell,fill=Blue3]at(cell-3-7M2){0.92};
\node[cell,fill=Blue1]at(cell-7-7M2){8.5};
\node[cell,fill=Blue2]at(cell-8-7M2){6.6};
%8
 \node[cell,fill=Blue2]at(cell-2-8M2){0.8};
%9
\node[cell,fill=Blue3]at(cell-8-9M2){0.7};
\node[cell,fill=Blue1]at(cell-9-9M2){14.8};
\node[cell,fill=Blue3]at(cell-11-9M2){0.91};
%10
\node[cell,fill=Blue2]at(cell-7-10M2){-0.38};
\node[cell,fill=Blue1]at(cell-11-10M2){10.1};
%11
\node[cell,fill=Blue1]at(cell-3-11M2){16.3};
\node[cell,fill=Blue2]at(cell-6-11M2){2.9};
\node[cell,fill=Blue2]at(cell-10-11M2){-5.4};
\end{scope}
\node[above=0.2 of M1,align=center,
            font=\usefont{T1}{phv}{m}{n}\normalsize]{Weight matrix \\ (before pruning)};
\node[above=0.2 of M2,align=center,
            font=\usefont{T1}{phv}{m}{n}\normalsize]{Weight matrix \\ (after pruning -- very sparse)};
\path[draw=OrangeLine, line width=2mm, -{Triangle[length=4mm, bend]},
shorten >=1.1mm, shorten <=1.15mm](cell-11-1M1.north east) to [bend left] (cell-1-1M2.north west);
\end{tikzpicture}
```
:::

Notice how the sparse matrix on the right retains only the high-magnitude values (colored cells) while the near-zero weights become exactly zero. This transformation reveals an important property: the "important" information in neural network weights is often concentrated in a small fraction of parameters, while most weights contribute little to the final output. This observation motivates magnitude-based pruning as a practical heuristic.

To make pruning computationally feasible, practical methods often replace the hard L0 constraint with soft regularization\index{Pruning!L1 regularization} like L1-norm ($\lambda \| W \|_1$), which encourages small values that can later be thresholded to zero. Practitioners typically use **iterative pruning**, where parameters are removed in successive steps interleaved with fine-tuning\index{Fine-tuning!after pruning} to recover lost accuracy [@gale2019state; @blalock2020state].

#### Target Structures {#sec-model-compression-target-structures-1230}

The choice of what to prune depends on the deployment target's hardware constraints and which resource is the binding bottleneck.

When memory capacity is the primary constraint, as in fully connected classifiers destined for mobile deployment, neuron pruning\index{Pruning!neuron} offers the most direct relief: removing entire neurons along with their associated weights and biases reduces the width of a layer, shrinking the parameter count proportionally. Because fully connected layers dominate memory in many architectures, targeting neurons addresses the largest contributor to model size.

When inference latency on commodity accelerators is the bottleneck, channel pruning\index{Pruning!channel} (also called filter pruning) becomes the preferred approach. Eliminating entire channels or filters from convolutional layers reduces the depth of feature maps, which directly cuts the number of multiply-accumulate operations in subsequent layers. This reduction maps cleanly onto GPU and TPU execution patterns because the resulting model remains dense and regular, requiring no special sparse computation support. Channel pruning is therefore particularly effective for vision workloads where convolutional layers dominate computational cost.

When the most aggressive efficiency gains are required and the architecture has sufficient depth to absorb the loss, layer pruning\index{Pruning!layer} removes entire layers from the network. This approach yields the largest per-operation reduction because it eliminates all computation within a layer, but it also carries the highest risk: removing a layer reduces the model's representational depth, and the remaining layers must compensate for the lost capacity. Layer pruning therefore demands careful validation to ensure the model retains sufficient capacity to capture the patterns its task requires.

To see how these approaches differ in practice, compare the two sides of @fig-channel-layer-pruning. When a channel is pruned, the model's architecture must be adjusted to accommodate the structural change. Specifically, the number of input channels in subsequent layers must be modified, requiring alterations to the depths of the filters applied to the layer with the removed channel. In contrast, layer pruning removes all channels within a layer, necessitating more significant architectural modifications. In this case, connections between remaining layers must be reconfigured to bypass the removed layer. Regardless of the pruning approach, fine-tuning is important to adapt the remaining network and restore performance.

::: {#fig-channel-layer-pruning fig-env="figure" fig-pos="htb" fig-cap="**Channel vs. Layer Pruning.** Channel pruning adjusts filter sizes within layers, while layer pruning removes entire layers and necessitates reconnection of remaining network components. These approaches reduce model size and computational cost, but require fine-tuning to mitigate performance loss due to reduced model capacity." fig-alt="Side-by-side diagrams showing channel pruning (left) and layer pruning (right). Each shows three-stage CNN with feature maps as 3D blocks connected by dashed lines. Red highlights indicate pruned channels or layers."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
 Line/.style={line width=0.5pt,black!50,dashed},
 cubes/.pic={
\pgfkeys{/cubes/.cd, #1}
\begin{scope}[scale=\scalefac,every node/.style={scale=1*\scalefac}]
\pgfmathsetmacro{\cubex}{0.1}
\pgfmathsetmacro{\cubey}{1.5}
\pgfmathsetmacro{\cubez}{1.3}
\coordinate (\picname-tl) at (-\cubex,0,0); % top-left point
\coordinate (\picname-tr) at (0,0,0); % top-right point
\coordinate (\picname-br) at (0,-\cubey,0); % bottom-right point
\coordinate (\picname-bl) at (-\cubex,-\cubey,0); % bottom-left point
\coordinate (\picname-ztl) at (-\cubex,0,-\cubez); % ztop-left point
\coordinate (\picname-ztr) at (0,0,-\cubez); % ztop-right point
\coordinate (\picname-zbr) at (0,-\cubey,-\cubez); % zbottom-right point
\coordinate (\picname-zbl) at (-\cubex,-\cubey,-\cubez); %z bottom-left point
%front
\draw[draw=\drawcubecolor,fill=\cubecolor!15, \ifbox@dashed dashed\fi] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,-\cubey,0) -- ++(\cubex,0,0) -- cycle;
%right
\draw[draw=\drawcubecolor,fill=\cubecolor!30, \ifbox@dashed dashed\fi] (0,0,0) -- ++(0,0,-\cubez) -- ++(0,-\cubey,0) -- ++(0,0,\cubez) -- cycle;
%top
\draw[draw=\drawcubecolor,fill=\cubecolor!20, \ifbox@dashed dashed\fi] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,0,-\cubez) -- ++(\cubex,0,0) -- cycle;
            \end{scope}
        }
}
\newif\ifbox@dashed
\box@dashedfalse % default: not dashed

\pgfkeys{
  /cubes/.cd,
  cubecolor/.store in=\cubecolor,
  drawcubecolor/.store in=\drawcubecolor,
  scalefac/.store in=\scalefac,
  picname/.store in=\picname, % ← nova linija
  cubecolor=red,
  drawcubecolor=BrownLine,
  scalefac=1,
  dashed/.is if=box@dashed,
  dashed/.default=true,
  picname=C
}
\newcommand{\Desno}[1]{
\foreach \i /\da in {1,...,9} {
   \pic at ({\i*0.22}, {-0.022*\i}) {cubes={cubecolor=BrownLine,picname=\i-cube#1}};
}
}
\newcommand{\Levo}[1]{
\foreach \i /\da in {1,2,3} {
\pic at ({\i*0.25}, {-0.025*\i}) {cubes={scalefac=1.65,cubecolor=BrownLine,picname=\i-cube#1}};
}
}
\newcommand{\Sredina}[2]{
\foreach \i /\clr/\dclr/\da in {#2} {
\pic at ({\i*0.22}, {-0.022*\i}) {cubes={scalefac=1.35, drawcubecolor=\dclr,
cubecolor=\clr,picname=\i-cube#1,\da}};
}
}
%%%%%%%%%%%%%%%%%%%%%%
\begin{scope}[local bounding box=ROW1,shift={(0,0)}]
\begin{scope}[local bounding box=G1,shift={(0,0)}]
 \Desno{1}
\end{scope}
\begin{scope}[local bounding box=G2,shift={(-4,0.5)}]
\Sredina{2}{1/BrownLine/BrownLine/,
2/red/red/,
3/BrownLine/BrownLine/,
4/BrownLine/BrownLine/,
5/BrownLine/BrownLine/,
6/BrownLine/BrownLine/,
7/BrownLine/BrownLine/,
8/BrownLine/BrownLine/,
9/BrownLine/BrownLine/}
\end{scope}
\begin{scope}[local bounding box=G3,shift={(-7,0.8)}]
\Levo{3}
\end{scope}
\draw[Line] (1-cube1-bl) -- (9-cube2-br);
 \draw[Line] (1-cube1-tl) -- (9-cube2-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (9-cube2-zbr);
 \draw[Line] (1-cube1-ztl) -- (9-cube2-ztr);
 %
 \draw[Line] (1-cube2-bl) -- (3-cube3-br);
 \draw[Line] (1-cube2-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube2-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube2-ztl) -- (3-cube3-ztr);
\end{scope}
%%%%%%%%%%%%%%%%
\begin{scope}[local bounding box=ROW2,shift={(0,-4.5)}]
\begin{scope}[local bounding box=G1,shift={(0,0)}]
 \Desno{1}
\end{scope}
\begin{scope}[local bounding box=G2,shift={(-4,0.5)}]
\Sredina{2}{1/BrownLine/BrownLine/,
2/green!30!/red/dashed,
3/BrownLine/BrownLine/,
4/BrownLine/BrownLine/,
5/BrownLine/BrownLine/,
6/BrownLine/BrownLine/,
7/BrownLine/BrownLine/,
8/BrownLine/BrownLine/,
9/BrownLine/BrownLine/}
\end{scope}
\begin{scope}[local bounding box=G3,shift={(-7,0.8)}]
\Levo{3}
\end{scope}
\draw[Line] (1-cube1-bl) -- (9-cube2-br);
 \draw[Line] (1-cube1-tl) -- (9-cube2-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (9-cube2-zbr);
 \draw[Line] (1-cube1-ztl) -- (9-cube2-ztr);
 %
 \draw[Line] (1-cube2-bl) -- (3-cube3-br);
 \draw[Line] (1-cube2-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube2-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube2-ztl) -- (3-cube3-ztr);
\end{scope}
%%%%%%%%%%%%%%%%
\begin{scope}[local bounding box=ROW3,shift={(0,-9)}]
\begin{scope}[local bounding box=G1,shift={(0,0)}]
 \Desno{1}
\end{scope}
\begin{scope}[local bounding box=G2,shift={(-4,0.5)}]
\Sredina{2}{1/BrownLine/BrownLine/,
2/BrownLine/BrownLine/,
3/BrownLine/BrownLine/,
4/BrownLine/BrownLine/,
5/BrownLine/BrownLine/,
6/BrownLine/BrownLine/,
7/BrownLine/BrownLine/,
8/BrownLine/BrownLine/}
\end{scope}
\begin{scope}[local bounding box=G3,shift={(-7,0.8)}]
\Levo{3}
\end{scope}
\draw[Line] (1-cube1-bl) -- (8-cube2-br);
 \draw[Line] (1-cube1-tl) -- (8-cube2-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (8-cube2-zbr);
 \draw[Line] (1-cube1-ztl) -- (8-cube2-ztr);
 %
 \draw[Line] (1-cube2-bl) -- (3-cube3-br);
 \draw[Line] (1-cube2-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube2-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube2-ztl) -- (3-cube3-ztr);
\end{scope}
\node[draw,
      single arrow, draw=red, fill=red,rotate=-90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=13mm, line width=1pt] (ST1)
      at($(ROW1.south)!0.75!(ROW2.north)$){};
\node[below right=1pt and 12pt of ST1.south,align=center,anchor=west]{Prune the selected\\ channel (in red)};
\node[draw,
      single arrow, draw=green!90!black, fill=green!90!black,rotate=-90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=13mm, line width=1pt] (ST2)
      at($(ROW2.south)!0.75!(ROW3.north)$){};
\node[below right=1pt and 12pt of ST2.south,align=center,anchor=west]{Reconfigure model's\\
architecture to adjust \\ to the changes};
\node[above=2pt of ROW1]{\textbf{Channel/Filter Pruning}};
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%RIGHT
\begin{scope}[local bounding box=RROW1,shift={(11,0)}]
\begin{scope}[local bounding box=G1,shift={(0,0)}]
 \Desno{1}
\end{scope}
\begin{scope}[local bounding box=G2,shift={(-4,0.5)}]
\Sredina{2}{1/red/red/,
2/red/red/,
3/red/red/,
4/red/red/,
5/red/red/,
6/red/red/,
7/red/red/,
8/red/red/,
9/red/red/}
\end{scope}
\begin{scope}[local bounding box=G3,shift={(-7,0.8)}]
\Levo{3}
\end{scope}
\draw[Line] (1-cube1-bl) -- (9-cube2-br);
 \draw[Line] (1-cube1-tl) -- (9-cube2-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (9-cube2-zbr);
 \draw[Line] (1-cube1-ztl) -- (9-cube2-ztr);
 %
 \draw[Line] (1-cube2-bl) -- (3-cube3-br);
 \draw[Line] (1-cube2-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube2-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube2-ztl) -- (3-cube3-ztr);
\end{scope}
%%%%%%%%%%%%%%%%
\begin{scope}[local bounding box=RROW2,shift={(11,-4.5)}]
\begin{scope}[local bounding box=RG1,shift={(0,0)}]
 \Desno{1}
\end{scope}
\begin{scope}[local bounding box=RG2,shift={(-4,0.5)}]
\Sredina{2}{1/green!30!/red/dashed,
2/green!30!/red/dashed,
3/green!30!/red/dashed,
4/green!30!/red/dashed,
5/green!30!/red/dashed,
6/green!30!/red/dashed,
7/green!30!/red/dashed,
8/green!30!/red/dashed,
9/green!30!/red/dashed}
\end{scope}
\begin{scope}[local bounding box=RG3,shift={(-7,0.8)}]
\Levo{3}
\end{scope}
\draw[Line] (1-cube1-bl) -- (9-cube2-br);
 \draw[Line] (1-cube1-tl) -- (9-cube2-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (9-cube2-zbr);
 \draw[Line] (1-cube1-ztl) -- (9-cube2-ztr);
 %
 \draw[Line] (1-cube2-bl) -- (3-cube3-br);
 \draw[Line] (1-cube2-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube2-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube2-ztl) -- (3-cube3-ztr);
\end{scope}
%%%%%%%%%%%%%%%%
\begin{scope}[local bounding box=RROW3,shift={(11,-9)}]
\begin{scope}[local bounding box=RG1,shift={(-2,0)}]
 \Desno{1}
\end{scope}
\begin{scope}[local bounding box=RG3,shift={(-5.5,0.8)}]
\Levo{3}
\end{scope}
\draw[Line] (1-cube1-bl) -- (3-cube3-br);
 \draw[Line] (1-cube1-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube1-ztl) -- (3-cube3-ztr);
 %
\end{scope}
 \node[draw,
      single arrow, draw=red, fill=red,rotate=-90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=13mm, line width=1pt] (RST1)
      at($(RROW1.south)!0.75!(RROW2.north)$){};
\node[below right=1pt and 12pt of RST1.south,align=center,anchor=west]{Prune the entire layer\\
(all channels in red)};
\node[draw,
      single arrow, draw=green!90!black, fill=green!90!black,rotate=-90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=13mm, line width=1pt] (RST2)
      at($(RROW2.south)!0.75!(RROW3.north)$){};
\node[below right=1pt and 12pt of RST2.south,align=center,anchor=west]{Reconfigure model's\\
architecture to adjust \\ to the changes};
\node[above=2pt of RROW1]{\textbf{Layer Pruning}};
\draw[violet!30,line width=2pt]($(ROW1.north east)!0.5!(RROW1.north west)$)--
($(ROW3.south east)!0.26!(RROW3.south west)$);
\end{tikzpicture}
```
:::

#### Unstructured Pruning {#sec-model-compression-unstructured-pruning-1c47}

Unstructured pruning\index{Pruning!unstructured} removes individual weights while preserving the overall network architecture. Some connections become redundant during training, contributing little to the final output. Pruning these weak connections reduces memory requirements while preserving most of the model's accuracy.

Formalizing this process, let $W \in \mathbb{R}^{m \times n}$ represent a weight matrix in a given layer. Pruning removes a subset of weights by applying a binary mask $M \in \{0,1\}^{m \times n}$, yielding a pruned weight matrix:
$$
\hat{W} = M \odot W
$$
where $\odot$ represents the element-wise *Hadamard product*. The mask $M$ is constructed based on a pruning criterion, typically weight magnitude. A common approach is magnitude-based pruning, which removes a fraction $s$ of the lowest-magnitude weights. This is achieved by defining a threshold $\tau$ such that:
$$
M_{i,j} =
\begin{cases}
1, & \text{if } |W_{i,j}| > \tau \\
0, & \text{otherwise}
\end{cases}
$$
where $\tau$ is chosen to ensure that only the largest $(1 - s)$ fraction of weights remain. This method assumes that larger-magnitude weights contribute more to the network's function, making them preferable for retention.

The primary advantage of unstructured pruning is memory efficiency. By reducing the number of nonzero parameters, pruned models require less storage, which benefits deployment on embedded or mobile devices with limited memory.

Unstructured pruning does not necessarily improve computational efficiency on modern hardware, however. Standard GPUs and TPUs are optimized for dense matrix multiplications, and a sparse weight matrix often cannot fully utilize hardware acceleration unless specialized sparse computation kernels are available. Unstructured pruning therefore primarily benefits model storage rather than inference acceleration.

#### Structured Pruning {#sec-model-compression-structured-pruning-9692}

\index{Pruning!structured}
Where unstructured pruning removes individual weights, structured pruning [@li2017pruning][^fn-structured-pruning] eliminates entire computational units: neurons, filters, channels, or layers. This approach produces smaller dense models that map directly to modern machine learning accelerators. Because the resulting architecture remains fully dense, structured pruning leads to more efficient inference on general-purpose hardware than unstructured pruning, which requires specialized execution kernels to exploit its sparse weight matrices.

[^fn-structured-pruning]: **Structured Pruning**: Removing entire filters/channels rather than individual weights, enabling immediate hardware speedups without sparse computation support. ResNet-34 filter pruning achieves 50% FLOP reduction with 1.0% accuracy loss; MobileNetV2 channel pruning yields 3.2 $\times$ faster ARM inference at 96.5% accuracy. Importance metrics include magnitude, gradient, and Taylor expansion.

Neurons, filters, and layers vary dramatically in their contribution to a model's predictions. Some units primarily carry redundant or low-impact information, and removing them does not significantly degrade model performance. Identifying which structures can be pruned while preserving accuracy remains the core challenge.

\index{Pruning!regularity-compression trade-off}
Hardware-aware pruning\index{Pruning!hardware-aware} strategies, such as N:M structured sparsity\index{Sparsity!N:M structured}, enforce specific patterns (e.g., ensuring 2 out of every 4 weights are zero) to align with specialized accelerator capabilities. The hardware implementation details of these patterns, including how they leverage sparse tensor cores, are covered in @sec-hardware-acceleration.

To ground these distinctions, examine @fig-structured-unstructured from left to right. On the left, unstructured pruning removes individual weights (depicted as dashed connections), creating a sparse weight matrix. This can disrupt the original network structure, as shown in the fully connected network where certain connections have been randomly pruned. While this reduces the number of active parameters, the resulting sparsity requires specialized execution kernels to fully utilize computational benefits.

::: {#fig-structured-unstructured fig-env="figure" fig-pos="htb" fig-cap="**Unstructured vs. Structured Pruning.** Unstructured pruning (left) achieves sparsity by removing individual weights, requiring specialized hardware, while structured pruning (middle, right) removes entire neurons or filters, preserving network structure for standard hardware acceleration. Source: [@qi2021efficient]." fig-alt="Three-panel diagram. Left shows unstructured pruning with dashed connections in a neural network. Middle and right show structured pruning: fully connected network with pruned neurons and CNN with pruned filters shown as dashed squares."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
 Line/.style={line width=0.5pt,black!50,text=black},
  LineD/.style={line width=0.5pt,black!50,text=black,dashed},
}
\newif\ifbox@dashed
\box@dashedfalse % default: not dashed

\tikzset{
channel/.pic={
\pgfkeys{/channel/.cd, #1}
\node[rectangle,draw=\channelcolor,line width=1pt,fill=\channelcolor!10,
minimum size=56,\ifbox@dashed dashed\fi](\picname){};
\node[rectangle,draw=BrownLine,line width=0.5pt,fill=white,
minimum size=18](\smallpicname){};
        }
}

\tikzset{
circles/.pic={
\pgfkeys{/channel/.cd, #1}
\node[circle,draw=\channelcolor,line width=1pt,fill=\channelcolor!10,
minimum size=9mm,\ifbox@dashed dashed\fi](\picname){};
        }
}

\tikzset{
channelw/.pic={
\pgfkeys{/channel/.cd, #1}
\node[rectangle,draw=\channelcolor,line width=1pt,fill=\channelcolor!10,
minimum size=56,\ifbox@dashed dashed\fi](\picname){};
        }
}

\pgfkeys{
  /channel/.cd,
  channelcolor/.store in=\channelcolor,
  scalefac/.store in=\scalefac,
  picname/.store in=\picname,
  smallpicname/.store in=\smallpicname,
  channelcolor=BrownLine,
  scalefac=1,
  dashed/.is if=box@dashed,
  dashed/.default=true,
  picname=C
}

\begin{scope}[local bounding box=CHANEL1,shift={(0,0)}]
\foreach \i/\da in {1/dashed,2/,3/dashed,4/} {
\pic at ({-\i*0.8}, {-0.8*\i}) {channel={picname=\i-CH1,smallpicname=\i-SCH1,\da}};
}
\end{scope}

\begin{scope}[local bounding box=CHANEL2,shift={(4.5,0)}]
\foreach \i/\da in {2/dashed,3/} {
\pic at ({-\i*0.8}, {-0.8*\i}) {channelw={picname=\i-CH2,smallpicname=\i-SCH2,\da}};
}
\end{scope}
\node[below =5pt of CHANEL2,align=center]{Convolutional\\ neural network};
\draw[Line](4-SCH1.center)--++(120:3.2)coordinate(CE1);
\draw[Line](2-SCH1.north)--(CE1);
\draw[Line](1-SCH1.north)--(CE1)node[above,align=center,text=black]{Convolutional\\ kernel};
%%
\coordinate(CE2)at ($(3-CH2.north west)!0.35!(3-CH2.south east)$);
\coordinate(CE3)at ($(3-CH2.north east)!0.2!(3-CH2.south west)$);
\coordinate(CE4)at ($(1-CH1.north east)!0.15!(1-CH1.south west)$);

\draw[Line](4-SCH1.north east)--(CE2);
\draw[Line](4-SCH1.south east)--(CE2);
\foreach \i in {1,2,3}{
\draw[Line](\i-SCH1.east)--(CE2);
}
\draw[Line](CE3)--++(80:1.8)node[above]{Channels}--(CE4);
%%
\begin{scope}[local bounding box=CIRCLE1,shift={($(CHANEL1)+(-5.6,0)$)}]
\foreach \i/\da in {1/,2/dashed,3/} {
  \pgfmathsetmacro{\y}{(2-\i)*1.5}
  \pic at (0,\y) {circles={channelcolor=OrangeLine,picname=1CL\i,\da}};
}
%right -2 neurons
\foreach \j/\da in {1/dashed,2/} {
  \pgfmathsetmacro{\y}{(1-\j)*1.5 + 0.6}
  \pic at (1.8,\y) {circles={channelcolor=OrangeLine,picname=1CR\j,\da}};
}
\end{scope}

\draw[Line](1CL3)--(1CR2);
\draw[Line](1CL1)--(1CR2);
\foreach \i in {1,2,3}{
  \foreach \j in {1,2}{
\draw[LineD](1CL\i)--(1CR\j);
}}

\scoped[on background layer]
\node[draw=BlueLine,inner xsep=8,inner ysep=9,yshift=-2mm,
minimum height=57mm,
           fill=BlueL!20,fit=(CIRCLE1)(CHANEL1)(CHANEL2),line width=1.0pt](BB1){};
\node[above=2pt of BB1.south,anchor=south]{Structured pruning};
%%
\begin{scope}[local bounding box=CIRCLE2,shift={($(CIRCLE1)+(-4.6,0)$)}]
\foreach \i/\da in {1/,2/,3/} {
  \pgfmathsetmacro{\y}{(2-\i)*1.5}
  \pic at (0,\y) {circles={channelcolor=OrangeLine,picname=2CL\i,\da}};
}
%right -2 neurons
\foreach \j/\da in {1/,2/} {
  \pgfmathsetmacro{\y}{(1-\j)*1.5 + 0.6}
  \pic at (1.8,\y) {circles={channelcolor=OrangeLine,picname=2CR\j,\da}};
}
\draw[Line](2CL3)--(2CR1);
\draw[Line](2CL1)--(2CR2);
\draw[Line](2CL2)--(2CR2);

\foreach \i in {1,2,3}{
  \foreach \j in {1,2}{
\draw[LineD](2CL\i)--(2CR\j);
}}
\end{scope}
\scoped[on background layer]
\node[draw=OliveLine,inner xsep=10,inner ysep=9,yshift=-2mm,
minimum height=57mm,
           fill=yellow!10,fit=(CIRCLE2),line width=1.0pt](BB1){};
\node[above=2pt of BB1.south,anchor=south]{Unstructured pruning};
\end{tikzpicture}
```
:::

In contrast, structured pruning (depicted in the middle and right sections of @fig-structured-unstructured) removes entire neurons or filters while preserving the network's overall structure. In the middle section, a pruned fully connected network retains its fully connected nature but with fewer neurons. On the right, structured pruning is applied to a CNN by removing convolutional kernels or entire channels (dashed squares). This method maintains the CNN's core convolutional operations while reducing the computational load, making it more compatible with hardware accelerators.

A common approach to structured pruning is magnitude-based pruning\index{Pruning!magnitude-based}, where entire neurons or filters are removed based on the magnitude of their associated weights. The intuition is that parameters whose magnitude falls below the layer's pruning threshold contribute negligibly to the model's output, making them candidates for elimination. The importance of a neuron or filter is measured using a norm function, such as the $\ell_1$-norm or $\ell_2$-norm, applied to the weights associated with that unit. If the norm falls below a predefined threshold, the corresponding neuron or filter is pruned. This method is straightforward to implement and requires no additional computational overhead beyond computing norms across layers.

\index{Pruning!activation-conditioned}
Another strategy is activation-based pruning\index{Pruning!activation-based}, which evaluates the average activation values of neurons or filters over a dataset. Neurons that consistently produce low activations contribute less information to the network's decision process and can be safely removed. This method captures the dynamic behavior of the network rather than relying solely on static weight values. Activation-based pruning requires profiling the model over a representative dataset to estimate the average activation magnitudes before making pruning decisions.

Gradient-based pruning\index{Pruning!gradient-based} uses information from the training process to identify less significant neurons or filters. Units with smaller gradient magnitudes contribute less to reducing the loss function, making them candidates for removal. By ranking neurons based on their gradient values, structured pruning can remove those with the least impact on model optimization. Unlike magnitude-based or activation-based pruning, which rely on static properties of the trained model, gradient-based pruning requires access to gradient computations and is typically applied during training rather than as a post-processing step.

These three methods form a progression from static to dynamic assessment of parameter importance, and each presents distinct trade-offs. Magnitude-based pruning is computationally inexpensive and straightforward to implement, making it the default starting point, but it does not account for how neurons behave across different data distributions. Activation-based pruning captures more of this dynamic behavior by evaluating neurons over representative inputs, though it requires additional computation to estimate neuron importance. Gradient-based pruning leverages training dynamics most directly but may introduce prohibitive complexity for large-scale models. In practice, the choice depends on the specific constraints of the target deployment environment: magnitude-based methods suffice for most production scenarios, while gradient-based approaches justify their overhead only when accuracy preservation is paramount.

#### Dynamic Pruning {#sec-model-compression-dynamic-pruning-b794}

Traditional pruning methods, whether unstructured or structured, involve static pruning\index{Pruning!static}: parameters are permanently removed after training or at fixed intervals during training, assuming that parameter importance is fixed. Dynamic pruning\index{Pruning!dynamic} relaxes this assumption by adapting pruning decisions based on input data or training dynamics, allowing the model to adjust its structure in real time.

Dynamic pruning can be implemented using runtime sparsity techniques, where the model actively determines which parameters to utilize based on input characteristics. Activation-conditioned pruning exemplifies this approach by selectively deactivating neurons or channels that exhibit low activation values for specific inputs [@dynamicpruning2023]. This method introduces input-dependent sparsity patterns, effectively reducing the computational workload during inference without permanently modifying the model architecture.

For instance, consider a convolutional neural network processing images with varying complexity. During inference of a simple image containing mostly uniform regions, many convolutional filters may produce negligible activations. Dynamic pruning identifies these low-impact filters and temporarily excludes them from computation, improving efficiency while maintaining accuracy for the current input. This adaptive behavior is particularly advantageous in latency-sensitive applications, where computational resources must be allocated judiciously based on input complexity. @sec-benchmarking presents measurement strategies for evaluating such efficiency gains.

Another class of dynamic pruning operates during training, gradually introducing and adjusting sparsity throughout the optimization process. Methods such as gradual magnitude pruning start with a dense network and progressively increase the fraction of pruned parameters as training progresses. Instead of permanently removing parameters, these approaches allow the network to recover from pruning-induced capacity loss by regrowing connections that prove to be important in later stages of training.

Dynamic pruning offers several advantages over its static counterpart. By allowing models to adapt to different workloads, it improves efficiency while maintaining accuracy across a wider range of inputs. Where static pruning risks over-pruning and permanently degrading performance, dynamic pruning can selectively reactivate parameters when they prove necessary for a particular input. The cost of this flexibility is additional computational overhead, as pruning decisions must be made in real time during training or inference, making dynamic pruning harder to integrate into standard machine learning pipelines. The kind of sophisticated production deployment strategies and monitoring frameworks required are covered in @sec-ml-operations.

Despite these challenges, dynamic pruning excels in edge computing and efficient AI contexts discussed in @sec-introduction, where resource constraints and real-time efficiency requirements vary across inputs.

#### Pruning Trade-offs {#sec-model-compression-pruning-tradeoffs-fada}

The three pruning approaches represent distinct positions on the regularity-versus-compression trade-off. Unstructured pruning achieves the highest compression ratios because it can remove any individual weight, but the resulting irregular sparsity patterns are difficult for hardware to exploit: accelerators optimized for dense matrix operations[^fn-flops] cannot skip individual zero values without specialized sparse execution kernels. Structured pruning sacrifices some compression potential by removing entire channels, filters, or layers; the resulting dense sub-network runs efficiently on commodity hardware without sparse computation support. Dynamic pruning adapts pruning decisions to each input at runtime, offering the most flexibility at the cost of implementation complexity and computational overhead. @tbl-pruning formalizes these comparisons across the dimensions that matter most for deployment.

[^fn-flops]: **FLOPs (Floating-Point Operations)**: Computational complexity metric counting multiply-add operations. ResNet-50 requires approximately `{python} resnet_gflops_str` billion FLOPs per inference [@he2016deep], GPT-3 training required an estimated 3.14E23 FLOPs [@patterson2021carbon]. Modern GPUs achieve 100-300 TFLOPS (trillion FLOPs/second), making FLOP reduction important for efficiency.

| **Aspect**                 | **Unstructured Pruning**                                                                      | **Structured Pruning**                                                    | **Dynamic Pruning**                                    |
|:---------------------------|:----------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------|:-------------------------------------------------------|
| **What is removed?**       | Individual weights in the model                                                               | Entire neurons, channels, filters, or layers                              | Adjusts pruning based on runtime conditions            |
| **Model structure**        | Sparse weight matrices; original architecture remains unchanged                               | Model architecture is modified; pruned layers are fully removed           | Structure adapts dynamically                           |
| **Impact on memory**       | Reduces model storage by eliminating nonzero weights                                          | Reduces model storage by removing entire components                       | Varies based on real-time pruning                      |
| **Impact on computation**  | Limited; dense matrix operations still required unless specialized sparse computation is used | Directly reduces FLOPs and speeds up inference                            | Balances accuracy and efficiency dynamically           |
| **Hardware compatibility** | Sparse weight matrices require specialized execution support for efficiency                   | Works efficiently with standard deep learning hardware                    | Requires adaptive inference engines                    |
| **Fine-tuning required?**  | Often necessary to recover accuracy after pruning                                             | More likely to require fine-tuning due to larger structural modifications | Adjusts dynamically, reducing the need for fine-tuning |
| **Use cases**              | Memory-efficient model compression for cloud deployment                                       | Real-time inference optimization, mobile/edge AI, and efficient training  | Adaptive AI applications, real-time systems            |

: **Pruning Strategies**: Unstructured, structured, and dynamic pruning each modify model weights differently, impacting both model size and computational efficiency. Unstructured pruning offers the greatest compression but requires specialized hardware, while dynamic pruning adapts to input data for a balance between accuracy and resource usage. {#tbl-pruning}

#### Pruning Strategies {#sec-model-compression-pruning-strategies-03f7}

Beyond the broad categories of unstructured, structured, and dynamic pruning, different pruning workflows can impact model efficiency and accuracy retention. Two widely used pruning strategies are iterative pruning and one-shot pruning, each with distinct benefits and trade-offs.

##### Iterative Pruning {#sec-model-compression-iterative-pruning-ee12}

Iterative pruning\index{Pruning!iterative} removes structure gradually through multiple cycles of pruning followed by fine-tuning. During each cycle, the algorithm removes a small subset of structures based on predefined importance metrics. The model then undergoes fine-tuning to adapt to these structural modifications before proceeding to the next pruning iteration. This gradual approach helps prevent sudden drops in accuracy while allowing the network to progressively adjust to reduced complexity.

Follow the three rows of @fig-iterative-pruning to see this gradual process in action on a convolutional neural network where six channels are pruned. Rather than removing all channels simultaneously, iterative pruning eliminates two channels per iteration over three cycles. Following each pruning step, the model undergoes fine-tuning to recover performance. The first iteration, which removes two channels, results in an accuracy decrease from 0.995 to 0.971, but subsequent fine-tuning restores accuracy to 0.992. After completing two additional pruning-tuning cycles, the final model achieves 0.991 accuracy, which represents only a 0.4% reduction from the original, while operating with 27% fewer channels. By distributing structural modifications across multiple iterations, the network maintains its performance capabilities while achieving improved computational efficiency.

::: {#fig-iterative-pruning fig-env="figure" fig-pos="htb" fig-cap="**Iterative Pruning Performance**: Three rows depict successive prune-then-fine-tune cycles, each removing two of the original 22 channels. Accuracy drops from 0.995 to 0.971 after the first prune, recovers to 0.992 after fine-tuning, and settles at 0.991 after all three cycles, a 0.4% loss with 27% fewer channels." fig-alt="Three-row workflow showing iterative pruning. Each row displays CNN architecture, prune step with red arrow, accuracy drop box, fine-tune gears icon, and accuracy recovery. Values progress from 0.995 to 0.991 final accuracy."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{
 Line/.style={line width=0.5pt,black!50,dashed},
 cubes/.pic={
\pgfkeys{/cubes/.cd, #1}
\begin{scope}[scale=\scalefac,every node/.style={scale=1*\scalefac}]
\pgfmathsetmacro{\cubex}{0.08}
\pgfmathsetmacro{\cubey}{1.6}
\pgfmathsetmacro{\cubez}{1.6}
%front
\coordinate (\picname-tl) at (-\cubex,0,0); % top-left point
\coordinate (\picname-tr) at (0,0,0); % top-right point
\coordinate (\picname-br) at (0,-\cubey,0); % bottom-right point
\coordinate (\picname-bl) at (-\cubex,-\cubey,0); % bottom-left point
\coordinate (\picname-ztl) at (-\cubex,0,-\cubez); % ztop-left point
\coordinate (\picname-ztr) at (0,0,-\cubez); % ztop-right point
\coordinate (\picname-zbr) at (0,-\cubey,-\cubez); % zbottom-right point
\coordinate (\picname-zbl) at (-\cubex,-\cubey,-\cubez); %z bottom-left point
\draw[draw=\cubecolor,fill=\cubecolor!15, \ifbox@dashed dashed\fi] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,-\cubey,0) -- ++(\cubex,0,0) -- cycle;
%right
\draw[draw=\cubecolor,fill=\cubecolor!30, \ifbox@dashed dashed\fi] (0,0,0) -- ++(0,0,-\cubez) -- ++(0,-\cubey,0) -- ++(0,0,\cubez) -- cycle;
%top
\draw[draw=\cubecolor,fill=\cubecolor!20, \ifbox@dashed dashed\fi] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,0,-\cubez) -- ++(\cubex,0,0) -- cycle;
            \end{scope}
        }
}
\newif\ifbox@dashed
\box@dashedfalse % default: not dashed

\pgfkeys{
  /cubes/.cd,
  cubecolor/.store in=\cubecolor,
  scalefac/.store in=\scalefac,
  picname/.store in=\picname, % ← nova linija
  cubecolor=red,
  scalefac=1,
  dashed/.is if=box@dashed,
  dashed/.default=true,
  picname=C
}
\newcommand{\Iteration}[8]{%
\begin{scope}[local bounding box=G1,shift={(0,0)}]
\foreach \i in {#1} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==#2 || \i==#3 ,
      "red", "BrownLine")}
\pic at ({\i*0.15}, {-0.02*\i}) {cubes={cubecolor=\colorname,picname=\i-cube1}};
}
\end{scope}

\begin{scope}[local bounding box=G2,shift={(-2.75,0.4)}]
\foreach \i in {#5} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==#6 || \i==#7,
      "red", "BrownLine")}
\pic at ({\i*0.18}, {-0.02*\i}) {cubes={scalefac=1.35,cubecolor=\colorname,picname=\i-cube2}};
}
\end{scope}

\begin{scope}[local bounding box=G3,shift={(-5.5,0.6)}]
\foreach \i in {1,2,3} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==13 || \i==14,"red", "BrownLine")}
\pic at ({\i*0.30}, {-0.02*\i}) {cubes={scalefac=1.5,cubecolor=\colorname,picname=\i-cube3}};
}
\end{scope}

\begin{scope}[local bounding box=G4,shift={(-7.75,1.0)}]
\foreach \i in {1} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==13 || \i==14,"red", "BrownLine")}
\pic at ({\i*0.25}, {-0.02*\i}) {cubes={scalefac=1.8,cubecolor=\colorname,picname=\i-cube4}};
}
\end{scope}
\draw[Line] (1-cube1-bl) -- (6-cube2-br);
 \draw[Line] (1-cube1-tl) -- (6-cube2-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (6-cube2-zbr);
 \draw[Line] (1-cube1-ztl) -- (6-cube2-ztr);
 %
 \draw[Line] (1-cube2-bl) -- (3-cube3-br);
 \draw[Line] (1-cube2-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube2-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube2-ztl) -- (3-cube3-ztr);
 %
  \draw[Line] (1-cube3-bl) -- (1-cube4-br);
 \draw[Line] (1-cube3-tl) -- (1-cube4-tr);
 \scoped[on background layer]
\draw[Line] (1-cube3-zbl) -- (1-cube4-zbr);
 \draw[Line] (1-cube3-ztl) -- (1-cube4-ztr);

\scoped[on background layer]
\node[draw=BlueLine,inner xsep=8,inner ysep=14,yshift=0mm,
           fill=BlueL!10,fit=(G1)(G4),line width=1.0pt](BB1){};
\node[fill=BlueL,below left=5.5pt and 11pt of  BB1.north east,anchor=north east,align=center]{
Starting Accuracy:\\ \textbf{#4}};
\node[above=1pt of  BB1.north west,anchor=south west,align=left]{\large #8};
}

%%%%%%
% #1 number of teeth
% #2 radius intern
% #3 radius extern
% #4 angle from start to end of the first arc
% #5 angle to decale the second arc from the first
% #6 inner radius to cut off
\newcommand{\gear}[6]{%
  (0:#2)
  \foreach \i [evaluate=\i as \n using {\i-1)*360/#1}] in {1,...,#1}{%
    arc (\n:\n+#4:#2) {[rounded corners=1.5pt] -- (\n+#4+#5:#3)
    arc (\n+#4+#5:\n+360/#1-#5:#3)} --  (\n+360/#1:#2)
  }%
  (0,0) circle[radius=#6];
}
\newcommand{\Test}[2]{%
\begin{scope}[local bounding box=GEAR,shift={($(BB1.east)+(8,0)$)},
scale=1.0, every node/.append style={transform shape}]
\def\ra{20mm}
\tikzset{%
 Arrow/.style={-{Triangle[width=15pt,length=8pt]}, line width=7pt,}
}
\draw[Arrow,violet!60] (-80:0.5*\ra)
arc[radius=0.5*\ra, start angle=-80, end angle= 80]coordinate(K1);
\draw[Arrow,orange!80!black!90] (100:0.5*\ra)
arc[radius=0.5*\ra, start angle=100, end angle= 260]coordinate(K2);
\node[circle,minimum size=\ra](KR){};

\fill[draw=none,fill=black,even odd rule,xshift=-2mm]\gear{10}{0.23}{0.28}{10}{2}{0.1};
\fill[draw=none,fill=black,even odd rule,xshift=3mm,yshift=2mm]\gear{10}{0.18}{0.22}{10}{2}{0.08};

\scoped[on background layer]
\node[draw=BlueLine,minimum width=27mm,minimum height=29mm,
           fill=BlueL!20,fit=(K1)(K2)(KR),line width=1.0pt](BB3){};
\end{scope}
%
\begin{scope}[local bounding box=TA1,shift={($(BB1.east)+(3.25,0)$)},
scale=1.0, every node/.append style={transform shape}]
\node[draw=BrownLine,
minimum width=27mm,minimum height=29mm,
           fill=brown!10,line width=1.0pt](BB2){};
\node[below=0.2 of BB2.north](TTA1){\textbf{Test Accuracy:}};
\node[below right= 0.7 and -0.6 of TTA1, draw,
      single arrow, draw=red, fill=red, rotate=-90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=9mm, line width=1pt] (ST1) {};
\node[left=0.3 of ST1,anchor=north east](BR1){\large\textbf{#1}};
\end{scope}
%%
\begin{scope}[local bounding box=TA2,shift={($(BB3.east)+(2.5,0)$)},
scale=1.0, every node/.append style={transform shape}]
\node[draw=GreenLine,
minimum width=27mm,minimum height=29mm,
           fill=yellow!10,line width=1.0pt](BB4){};
\node[below=0.2 of BB4.north](TTA1){\textbf{Test Accuracy:}};
\node[below right= 1.1 and -0.9 of TTA1, draw,
      single arrow, draw=GreenLine, fill=GreenL!90!black, rotate=90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=9mm, line width=1pt] (ST1) {};

\node[left=0.3 of ST1,anchor=south east](BR1){\large\textbf{#2}};
\end{scope}
     \node[draw,
      single arrow, draw=red, fill=red,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=16mm, line width=1pt] (ST1)
      at($(BB1.east)!0.5!(BB2.west)$){};
      \node[draw,
      single arrow, draw=VioletLine, fill=VioletL,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=16mm, line width=1pt] (ST2)
      at($(BB2.east)!0.5!(BB3.west)$){};
      \node[draw,
      single arrow, draw=GreenLine, fill=GreenL!90!black,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=9mm, line width=1pt] (ST3)
      at($(BB3.east)!0.5!(BB4.west)$){};
\node[align=center,above=3pt of ST1]{Prune\\ selected\\ channels};
\node[align=center,above=3pt of ST2]{Fine-tune\\ on new\\ structure};
}

%%%%%%%%%%%%%%%%%%%%
%#1 number of plants - first group
%#2 and #3 red plants - first group
%#4 starting accuracy - first group
%#5 number of plants - second group
%#6 and #7 red plants - second group
%#8 iteration name
%\Iteration{#1}{#1}{#2}{#3}{#4}{#5}{#6}{#7}{#8}
%%%%%%%%%%%%%%%%%%%%%%%
\begin{scope}[local bounding box=ROW1,shift={(0,0)}]
\Iteration{1,...,12}{3}{4}{0.995}{1,...,6}{53}{54}{1st Iteration}\Test{0.971}{0.992};
\end{scope}
\begin{scope}[local bounding box=ROW2,shift={(0,-6)}]
\Iteration{1,2,5,6,...,12}{3}{4}{0.992}{1,...,6}{3}{4}{2nd Iteration}\Test{0.956}{0.993};
\end{scope}
\begin{scope}[local bounding box=ROW3,shift={(0,-12)}]
\Iteration{1,2,5,6,...,12}{9}{10}{0.993}{1,2,5,6}{3}{4}{3rd Iteration}\Test{0.967}{0.991};
\end{scope}
\end{tikzpicture}
```
:::

##### One-shot Pruning {#sec-model-compression-oneshot-pruning-12a1}

One-shot pruning\index{Pruning!one-shot} removes multiple architectural components in a single step, followed by an extensive fine-tuning phase to recover model accuracy. This aggressive approach compresses the model quickly but risks greater accuracy degradation, as the network must adapt to significant structural changes simultaneously.

Consider applying one-shot pruning to the same network from the iterative pruning example. Instead of removing two channels at a time over multiple iterations, one-shot pruning eliminates all six channels simultaneously. Compare the single-row workflow in @fig-oneshot-pruning to the iterative case: removing 27% of the network's channels simultaneously causes the accuracy to drop significantly, from 0.995 to 0.914. Even after fine-tuning, the network only recovers to an accuracy of 0.943, which is a 5% degradation from the original unpruned network. While both iterative and one-shot pruning ultimately produce identical network structures, the gradual approach of iterative pruning better preserves model performance.

::: {#fig-oneshot-pruning fig-env="figure" fig-pos="htb" fig-cap="**One-Shot Pruning Impact**: All six channels (27%) are removed simultaneously, causing accuracy to drop from 0.995 to 0.914. Fine-tuning recovers only to 0.943, a 5% degradation compared to the 0.4% loss from iterative pruning, illustrating why gradual removal preserves accuracy more effectively." fig-alt="Single-row workflow showing one-shot pruning. CNN with six red-highlighted channels to prune, followed by accuracy drop from 0.995 to 0.914, fine-tuning gears, and partial recovery to 0.943."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{
    Line/.style={line width=0.5pt,black!50,dashed},
 cubes/.pic={
   \pgfkeys{/cubes/.cd, #1}
\begin{scope}[scale=\scalefac,every node/.style={scale=1*\scalefac}]
\pgfmathsetmacro{\cubex}{0.08}
\pgfmathsetmacro{\cubey}{1.6}
\pgfmathsetmacro{\cubez}{1.6}
%front
\coordinate (\picname-tl) at (-\cubex,0,0); % top-left point
\coordinate (\picname-tr) at (0,0,0); % top-right point
\coordinate (\picname-br) at (0,-\cubey,0); % bottom-right point
\coordinate (\picname-bl) at (-\cubex,-\cubey,0); % bottom-left point
\coordinate (\picname-ztl) at (-\cubex,0,-\cubez); % ztop-left point
\coordinate (\picname-ztr) at (0,0,-\cubez); % ztop-right point
\coordinate (\picname-zbr) at (0,-\cubey,-\cubez); % zbottom-right point
\coordinate (\picname-zbl) at (-\cubex,-\cubey,-\cubez); %z bottom-left point
\draw[draw=\cubecolor,fill=\cubecolor!15, \ifbox@dashed dashed\fi] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,-\cubey,0) -- ++(\cubex,0,0) -- cycle;
%right
\draw[draw=\cubecolor,fill=\cubecolor!30, \ifbox@dashed dashed\fi] (0,0,0) -- ++(0,0,-\cubez) -- ++(0,-\cubey,0) -- ++(0,0,\cubez) -- cycle;
%top
\draw[draw=\cubecolor,fill=\cubecolor!20, \ifbox@dashed dashed\fi] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,0,-\cubez) -- ++(\cubex,0,0) -- cycle;
            \end{scope}
        }
}
\newif\ifbox@dashed
\box@dashedfalse % default: not dashed

\pgfkeys{
  /cubes/.cd,
  cubecolor/.store in=\cubecolor,
  scalefac/.store in=\scalefac,
  picname/.store in=\picname, % ← nova linija
  cubecolor=red,
  scalefac=1,
  dashed/.is if=box@dashed,
  dashed/.default=true,
  picname=C
}
\begin{scope}[local bounding box=G1,shift={(0,0)}]
\foreach \i in {1,...,12} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==3 || \i==4 || \i==9 || \i==10,
      "red", "BrownLine")}
\pic at ({\i*0.15}, {-0.02*\i}) {cubes={cubecolor=\colorname,picname=\i-cube1}};
}
\end{scope}

\begin{scope}[local bounding box=G2,shift={(-2.75,0.4)}]
\foreach \i in {1,...,6} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==3 || \i==4,
      "red", "BrownLine")}
\pic at ({\i*0.18}, {-0.02*\i}) {cubes={scalefac=1.35,cubecolor=\colorname,picname=\i-cube2}};
}
\end{scope}

\begin{scope}[local bounding box=G3,shift={(-5.5,0.6)}]
\foreach \i in {1,2,3} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==13 || \i==14,"red", "BrownLine")}
\pic at ({\i*0.30}, {-0.02*\i}) {cubes={scalefac=1.5,cubecolor=\colorname,picname=\i-cube3}};
}
\end{scope}

\begin{scope}[local bounding box=G4,shift={(-7.75,1.0)}]
\foreach \i in {1} {
  \pgfmathsetmacro\colorname{%
    ifthenelse(\i==13 || \i==14,"red", "BrownLine")}
\pic at ({\i*0.25}, {-0.02*\i}) {cubes={scalefac=1.8,cubecolor=\colorname,picname=\i-cube4}};
}
\end{scope}
\draw[Line] (1-cube1-bl) -- (6-cube2-br);
 \draw[Line] (1-cube1-tl) -- (6-cube2-tr);
 \scoped[on background layer]
\draw[Line] (1-cube1-zbl) -- (6-cube2-zbr);
 \draw[Line] (1-cube1-ztl) -- (6-cube2-ztr);
 %
 \draw[Line] (1-cube2-bl) -- (3-cube3-br);
 \draw[Line] (1-cube2-tl) -- (3-cube3-tr);
 \scoped[on background layer]
\draw[Line] (1-cube2-zbl) -- (3-cube3-zbr);
 \draw[Line] (1-cube2-ztl) -- (3-cube3-ztr);
 %
  \draw[Line] (1-cube3-bl) -- (1-cube4-br);
 \draw[Line] (1-cube3-tl) -- (1-cube4-tr);
 \scoped[on background layer]
\draw[Line] (1-cube3-zbl) -- (1-cube4-zbr);
 \draw[Line] (1-cube3-ztl) -- (1-cube4-ztr);

\scoped[on background layer]
\node[draw=BlueLine,inner xsep=8,inner ysep=14,yshift=0mm,
           fill=BlueL!10,fit=(G1)(G4),line width=1.0pt](BB1){};
\node[fill=BlueL,below left=5.5pt and 11pt of  BB1.north east,anchor=north east,align=center]{
Starting Accuracy:\\ \textbf{0.995}};
\node[above=1pt of  BB1.north west,anchor=south west,align=left]{\large One-shot (a single iteration)};

\begin{scope}[local bounding box=GEAR,
shift={($(BB1.east)+(8,0)$)},
scale=1.0, every node/.append style={transform shape}]
\def\ra{20mm}
\tikzset{%
 Arrow/.style={-{Triangle[width=15pt,length=8pt]}, line width=7pt,}
}
\draw[Arrow,violet!60] (-80:0.5*\ra)
arc[radius=0.5*\ra, start angle=-80, end angle= 80]coordinate(K1);
\draw[Arrow,orange!80!black!90] (100:0.5*\ra)
arc[radius=0.5*\ra, start angle=100, end angle= 260]coordinate(K2);
\node[circle,minimum size=\ra](KR){};

% #1 number of teeth
% #2 radius intern
% #3 radius extern
% #4 angle from start to end of the first arc
% #5 angle to decale the second arc from the first
% #6 inner radius to cut off
\newcommand{\gear}[6]{%
  (0:#2)
  \foreach \i [evaluate=\i as \n using {\i-1)*360/#1}] in {1,...,#1}{%
    arc (\n:\n+#4:#2) {[rounded corners=1.5pt] -- (\n+#4+#5:#3)
    arc (\n+#4+#5:\n+360/#1-#5:#3)} --  (\n+360/#1:#2)
  }%
  (0,0) circle[radius=#6];
}

\fill[draw=none,fill=black,even odd rule,xshift=-2mm]\gear{10}{0.23}{0.28}{10}{2}{0.1};
\fill[draw=none,fill=black,even odd rule,xshift=3mm,yshift=2mm]\gear{10}{0.18}{0.22}{10}{2}{0.08};

\scoped[on background layer]
\node[draw=BlueLine,%inner xsep=8,inner ysep=8,yshift=0mm,
minimum width=27mm,minimum height=29mm,
           fill=BlueL!20,fit=(K1)(K2)(KR),line width=1.0pt](BB3){};
\end{scope}

%%
\begin{scope}[local bounding box=TA1,
shift={($(BB1.east)+(3.25,0)$)},
scale=1.0, every node/.append style={transform shape}]
\node[draw=BrownLine,
minimum width=27mm,minimum height=29mm,
           fill=brown!10,line width=1.0pt](BB2){};
\node[below=0.2 of BB2.north](TTA1){\textbf{Test Accuracy:}};
\node[below right= 0.7 and -0.9 of TTA1, draw,
      single arrow, draw=red, fill=red, rotate=-90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=9mm, line width=1pt] (ST1) {};

\node[below right= 0.7 and -0.9 of TTA1, draw, xshift=6mm,
      single arrow, draw=red, fill=red, rotate=-90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=9mm, line width=1pt] (ST2) {};
\node[left=0.3 of ST1,anchor=north east](BR1){\large\textbf{0.914}};
\end{scope}
%%
\begin{scope}[local bounding box=TA2,
shift={($(BB3.east)+(2.5,0)$)},
scale=1.0, every node/.append style={transform shape}]
\node[draw=GreenLine,
minimum width=27mm,minimum height=29mm,
           fill=yellow!10,line width=1.0pt](BB4){};
\node[below=0.2 of BB4.north](TTA1){\textbf{Test Accuracy:}};
\node[below right= 1.1 and -0.9 of TTA1, draw,
      single arrow, draw=GreenLine, fill=GreenL!90!black, rotate=90,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=9mm, line width=1pt] (ST1) {};

\node[left=0.3 of ST1,anchor=south east](BR1){\large\textbf{0.943}};
\end{scope}

\node[draw,
      single arrow, draw=red, fill=red,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=16mm, line width=1pt] (ST1)
      at($(BB1.east)!0.5!(BB2.west)$){};
      \node[draw,
      single arrow, draw=VioletLine, fill=VioletL,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=16mm, line width=1pt] (ST2)
      at($(BB2.east)!0.5!(BB3.west)$){};
      \node[draw,
      single arrow, draw=GreenLine, fill=GreenL!90!black,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=9mm, line width=1pt] (ST3)
      at($(BB3.east)!0.5!(BB4.west)$){};
\node[align=center,above=3pt of ST1]{Prune\\ selected\\ channels};
\node[align=center,above=3pt of ST2]{Fine-tune\\ on new\\ structure};
\end{tikzpicture}
```
:::

The choice between strategies depends on three interrelated factors. First, the sparsity target: higher reduction targets often necessitate iterative approaches to maintain accuracy, while moderate goals may be achievable with one-shot methods. Second, available resources: iterative pruning demands significant compute for multiple fine-tuning cycles, whereas one-shot approaches trade accuracy for speed. Third, the deployment timeline and target platform: one-shot methods enable faster deployment, but certain hardware architectures better support specific sparsity patterns, making iterative approaches more advantageous when time permits.

#### Lottery Ticket Hypothesis {#sec-model-compression-lottery-ticket-hypothesis-1b3d}

\index{Lottery Ticket Hypothesis!etymology}
\index{Frankle, Jonathan!lottery ticket hypothesis}
The pruning strategies discussed above share a common assumption: we start with a trained network and then decide which parameters to remove. What if the relationship between network structure and trainability runs deeper? What if the "winning" sparse network already exists at initialization, hidden within the dense structure?

Traditional pruning methods eliminate weights based on magnitude, structure, or dynamic conditions. But pruning may also reveal something more fundamental: inherently efficient subnetworks hidden within the original model.

This perspective leads to the Lottery Ticket Hypothesis\index{Lottery Ticket Hypothesis!winning tickets}\index{Lottery Ticket Hypothesis!sparse subnetworks}[^fn-lottery-ticket] (LTH), which challenges conventional pruning workflows by proposing that within large neural networks, there exist small, well-initialized subnetworks ("winning tickets") that can achieve comparable accuracy to the full model when trained in isolation. Rather than viewing pruning as a post-training compression step, LTH suggests it can serve as a discovery mechanism to identify these efficient subnetworks early in training.

[^fn-lottery-ticket]: **Lottery Ticket Hypothesis**: Named for the intuition that training a large network is like buying many lottery tickets: most lose, but a few "winning tickets" (sparse subnetworks with lucky initializations) can win (train to full accuracy) on their own. Frankle and Carbin [@frankle2019lottery] showed ResNet-18 subnetworks at 10–20% original size achieve 93.2% vs. 94.1% accuracy. BERT-base winning tickets retain 97% performance with 90% fewer parameters.

\index{Lottery Ticket Hypothesis!iterative pruning validation}
LTH is validated through an iterative pruning process. Trace the cycle in @fig-winning-ticket: a large network is first trained to convergence. The lowest-magnitude weights are then pruned, and the remaining weights are reset to their original initialization rather than being re-randomized. This process is repeated iteratively, gradually reducing the network's size while preserving performance. After multiple iterations, the remaining subnetwork (the "winning ticket") proves capable of training to the same or higher accuracy as the original full model.

::: {#fig-winning-ticket fig-env="figure" fig-pos="htb" fig-cap="**Lottery Ticket Iteration Cycle.** A dense network is trained to convergence, the smallest-magnitude weights are pruned, and the surviving weights are reset to their original initialization. Repeating this cycle progressively identifies a sparse subnetwork (the winning ticket) that matches or exceeds the full model's accuracy." fig-alt="Cyclic flowchart with four stages: dense network, train to convergence, prune smallest weights, reset remaining weights to initial values. Arrows form iterative loop that progressively identifies winning ticket subnetwork."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
Line/.style={line width=0.5pt,black!50,text=black},
LineD/.style={-{Triangle[width=7pt,length=8pt]},red,line width=1.25pt},
}

\begin{scope}[local bounding box=CIRCLES,shift={($(0,0)+(-5.6,0)$)}]
\tikzset{
circles/.pic={
\pgfkeys{/channel/.cd, #1}
\node[circle,draw=\channelcolor,line width=\linewidth,fill=\channelcolor!10,
minimum size=9mm](\picname){};
        }
}
\pgfkeys{
  /channel/.cd,
  linewidth/.store in=\linewidth,
  channelcolor/.store in=\channelcolor,
  scalefac/.store in=\scalefac,
  picname/.store in=\picname,
  channelcolor=BlueLine,
  scalefac=1,
  linewidth=0.3pt,
  picname=C
}

\def\vi{1.75}
\foreach \i in {1,...,7} {
  \pgfmathsetmacro{\y}{(7-\i)*\vi}
  \pic at (0,\y) {circles={channelcolor=VioletLine2!70!,picname=2CI\i}};
}
 %
\foreach \i in {2,4,6} {
  \pgfmathsetmacro{\y}{(7-\i)*\vi}
  \pic at (0,\y) {circles={channelcolor=red,linewidth=1pt}};
}
\foreach \i in {1,...,7} {
  \pgfmathsetmacro{\y}{(7-\i)*\vi}
  \pic at (3.5,\y) {circles={channelcolor=VioletLine2!70!,,picname=3CI\i}};
}
\foreach \i in {2,5} {
  \pgfmathsetmacro{\y}{(7-\i)*\vi}
  \pic at (3.5,\y) {circles={channelcolor=red,linewidth=1pt}};
}
%right -2 neurons
\foreach \j in {1,...,2} {
  \pgfmathsetmacro{\y}{(4-\j)*\vi + 0.6}
  \pic at (6,\y) {circles={channelcolor=red,linewidth=1pt,picname=4CI\j}};
}
%left -4 neurons
\foreach \j in {1,...,4} {
  \pgfmathsetmacro{\y}{(5-\j)*\vi + 0.6}
  \pic at (-2.85,\y) {circles={channelcolor=red,linewidth=1pt,picname=1CI\j}};
}
\foreach \i in {1,...,4} {
  \foreach \j in {1,...,7} {
\draw[VioletLine2!70!,](1CI\i )--(2CI\j);
}}
\foreach \i in {1,...,7} {
  \foreach \j in {1,...,7} {
\draw[VioletLine2!70!,](2CI\i )--(3CI\j);
}}
\foreach \i in {1,...,7} {
  \foreach \j in {1,...,2} {
\draw[VioletLine2!70!,](3CI\i )--(4CI\j);
}}
\draw[LineD](1CI1)--(2CI2);
\draw[LineD](1CI1)--(2CI4);
\draw[LineD](1CI2)--(2CI4);
\draw[LineD](1CI3)--(2CI6);
\draw[LineD](1CI4)--(2CI6);
\draw[LineD](1CI4)--(2CI2);
\draw[LineD](2CI2)--(3CI2);
\draw[LineD](2CI4)--(3CI5);
\draw[LineD](2CI6)--(3CI5);
\draw[LineD](2CI6)--(3CI2);
\draw[LineD](3CI2)--(4CI1);
\draw[LineD](3CI2)--(4CI2);
\draw[LineD](3CI5)--(4CI1);
\draw[LineD](3CI5)--(4CI2);
\end{scope}
%%%%%%%%%%%%%%
%left figure
\begin{scope}[local bounding box=krug,shift={($(CIRCLES)+(-10.2,-2.1)$)}]
\def\ra{65mm}
\draw[{Triangle[width=18pt,length=8pt]}-, line width=10pt,violet!60] (1:0.5*\ra)
arc[radius=0.5*\ra, start angle=1, end angle= 57];
\draw[{Triangle[width=18pt,length=8pt]}-, line width=10pt,cyan!80!black!90] (123:0.5*\ra)
arc[radius=0.5*\ra, start angle=123, end angle= 180];
\draw[{Triangle[width=18pt,length=8pt]}-, line width=10pt,orange!70] (245:0.53*\ra)
arc[radius=0.53*\ra, start angle=245, end angle= 290];
\node[]at(0,0){\large Iterate};
%%top
\begin{scope}[local bounding box=GEAR,shift={($(90: 0.5*\ra)+(0,-0.75)$)},
scale=1.0, every node/.append style={transform shape}]
% #1 number of teeth
% #2 radius intern
% #3 radius extern
% #4 angle from start to end of the first arc
% #5 angle to decale the second arc from the first
% #6 inner radius to cut off
\newcommand{\gear}[6]{%
  (0:#2)
  \foreach \i [evaluate=\i as \n using {\i-1)*360/#1}] in {1,...,#1}{%
    arc (\n:\n+#4:#2) {[rounded corners=1.5pt] -- (\n+#4+#5:#3)
    arc (\n+#4+#5:\n+360/#1-#5:#3)} --  (\n+360/#1:#2)
  }%
  (0,0) circle[radius=#6];
}
\fill[draw=none,fill=green!40!black,even odd rule,xshift=-2mm]\gear{12}{0.4}{0.33}{10}{2}{0.1};
\fill[draw=none,fill=green!40!black,even odd rule,xshift=4mm,yshift=4mm]\gear{10}{0.22}{0.28}{10}{2}{0.08};
\node[align=center](TTN) at (0,1.25){Train the network\\ until convergence};
\node[align=center](TTN1) at (0,-0.5){};

\scoped[on background layer]
\node[draw=BlueLine,minimum width=27mm,minimum height=27mm,
           fill=BlueL!20,fit=(TTN1)(TTN),line width=1.0pt](5BB3){};
\end{scope}
%%right
\begin{scope}[local bounding box=PRUNE,shift={($(330: 0.5*\ra)+(0,-0.5)$)},
scale=1.0, every node/.append style={transform shape}]
\node[align=center](TTN) at (0,1.25){Prune a \\percentage of\\ the lowest weights};
\node[align=center](TTN1) at (0,-0.5){};

\begin{scope}[local bounding box=MC,shift={($(TTN)+(-0.2,-0.5)$)},
scale=1.0, every node/.append style={transform shape}]

\foreach \i in {1,...,3}{
  \pgfmathsetmacro{\y}{(-\i)*0.37}
  \node[circle,draw, minimum size=2.5mm,inner sep=0pt,fill=red](2K\i) at(0,\y){};
}
\foreach \i in {1,...,3}{
  \pgfmathsetmacro{\y}{(-\i)*0.37}
  \node[circle,draw, minimum size=2.5mm,inner sep=0pt,fill=brown](3K\i) at(0.6,\y){};
}
\foreach \i in {1,...,2}{
  \pgfmathsetmacro{\y}{-(3.5-\i)*0.37}
  \node[circle,draw, minimum size=2.5mm,inner sep=0pt,fill=cyan](1K\i) at(-0.6,\y){};
}
\foreach \i in {1}{
  \pgfmathsetmacro{\y}{-(3.0-\i)*0.37}
  \node[circle,draw, minimum size=2.5mm,inner sep=0pt,fill=green!40!black](4K\i) at(1.2,\y){};
}
\foreach \i in {1,...,2}{
  \foreach \j in {1,...,3}{
\draw[Line](1K\i)--(2K\j);
}}
\foreach \i in {1,...,3}{
  \foreach \j in {1,...,3}{
\draw[Line](2K\i)--(3K\j);
}}\foreach \i in {1,...,3}{
  \foreach \j in {1}{
\draw[Line](3K\i)--(4K\j);
}}
\end{scope}
\scoped[on background layer]
\node[draw=BlueLine,minimum width=30mm,minimum height=27mm,
           fill=BlueL!20,fit=(TTN1)(TTN),line width=1.0pt](6BB3){};
\end{scope}
%%left
\begin{scope}[local bounding box=PUMPE,shift={($(210: 0.5*\ra)+(0,-0.3)$)},
scale=1.0, every node/.append style={transform shape}]
\node[align=center](TTN) at (0,1.25){Reset weights\\ to initial values};
\node[align=center](TTN1) at (0,-0.5){};
\scoped[on background layer]
\node[draw=BlueLine,yshift=-1mm,
minimum width=30mm,minimum height=27mm,
           fill=BlueL!20,fit=(TTN1)(TTN),line width=1.0pt](BB3){};
%
\begin{scope}[local bounding box=RESETA,shift={($(TTN)+(0.25,-1.2)$)}]
\def\ra{12mm}
\tikzset{%
 Arrow/.style={{Triangle[width=15pt,length=8pt]}-, line width=7pt,}
}
\draw[Arrow,violet!60] (-80:0.5*\ra)
arc[radius=0.5*\ra, start angle=-80, end angle= 80]coordinate(K1);
\node[circle,minimum size=\ra](KR){};
\end{scope}
\node[]at(-0.50,0){$\left[\begin{array}{c} 0.5\\ 0.08\\ 0.45\\ 0.98\end{array}\right]$};
\end{scope}
\end{scope}
%%%%TOP
\begin{scope}[local bounding box=KOCKICE,shift={($(GEAR)+(0,3.2)$)},
scale=1.0, every node/.append style={transform shape}]
\node[align=center](TTN) at (0,1.25){Randomly\\ initialize\\ the weights};
\node[align=center](TTN1) at (0,-0.5){};
\scoped[on background layer]
\node[draw=RedLine,yshift=-1mm,
minimum width=30mm,minimum height=27mm,
           fill=RedL!20,fit=(TTN1)(TTN),line width=1.0pt](2BB3){};
%
\begin{scope}[local bounding box=VK,shift={($(TTN)+(0.2,-1.0)$)},
scale=0.15, every node/.append style={transform shape}]
\pgfmathsetmacro{\cubex}{4}
\pgfmathsetmacro{\cubey}{4}
\pgfmathsetmacro{\cubez}{3}
\draw[fill=yellow!10] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,-\cubey,0) -- ++(\cubex,0,0) -- cycle;
\draw[fill=yellow!60] (0,0,0) -- ++(0,0,-\cubez) -- ++(0,-\cubey,0) -- ++(0,0,\cubez) -- cycle;
\draw[fill=yellow!30] (0,0,0) -- ++(-\cubex,0,0) -- ++(0,0,-\cubez) -- ++(\cubex,0,0) -- cycle;
 \node[circle,draw, minimum size=5mm,inner sep=0pt,fill=green!40!black]
at($(0,0,0)!0.5!(-\cubex,-\cubey,0)$){};
 \node[circle,draw, minimum size=5mm,inner sep=0pt,fill=green!40!black]
at($(0,0,0)!0.22!(-\cubex,-\cubey,0)$){};
 \node[circle,draw, minimum size=5mm,inner sep=0pt,fill=green!40!black]
at($(0,0,0)!0.78!(-\cubex,-\cubey,0)$){};
 \node[circle,draw, minimum size=5mm,inner sep=0pt,fill=green!40!black]
at($(-\cubex,0,0)!0.78!(0,-\cubey,0)$){};
 \node[circle,draw, minimum size=5mm,inner sep=0pt,fill=green!40!black]
at($(-\cubex,0,0)!0.22!(0,-\cubey,0)$){};
%
 \node[ellipse,draw, minimum width=6mm,minimum height=3mm,inner sep=0pt,fill=green!40!black]
at($(0,0,0)!0.5!(-\cubex,0,-\cubez)$){};
 \node[ellipse,draw, minimum width=3mm,minimum height=6mm,inner sep=0pt,fill=green!40!black]
at($(0,0,0)!0.3!(0,-\cubey,-\cubez)$){};
 \node[ellipse,draw, minimum width=3mm,minimum height=6mm,inner sep=0pt,fill=green!40!black]
at($(0,0,0)!0.7!(0,-\cubey,-\cubez)$){};
\end{scope}
\end{scope}
\path[red](2BB3.north east)--++(0:3.4)coordinate(GO)|-coordinate(DO)(6BB3.south east);
\draw[brown!60,line width=2pt,dash pattern={on 10pt off 8pt}](GO)--(DO);
\node[draw,
      single arrow, draw=VioletLine, fill=VioletL,rotate=270,
      minimum width=8pt, single arrow head extend=3pt,
      minimum height=8mm, line width=1pt] (1ST2)
      at($(KOCKICE.south)!0.45!(GEAR.north)$){};
%
      \node[draw, align=left,anchor=south west,
      single arrow, draw=BlueLine, fill=BlueL,
      minimum width=8pt, single arrow head extend=10pt,
      minimum height=8mm, line width=1pt] (2ST2)
      at($(KOCKICE.east)+(2.26,-0.5)$){Remaining structure\\constitutes the winning\\
      lottery ticket subnetwork};
\end{tikzpicture}
```
:::

The implications of the Lottery Ticket Hypothesis\index{Lottery Ticket Hypothesis!implications} extend beyond conventional pruning. Instead of training large models and pruning them later, LTH suggests that compact, high-performing subnetworks could be trained directly from the start, eliminating the need for overparameterization. This insight challenges the traditional assumption that model size is necessary for effective learning. It also emphasizes the importance of initialization, as winning tickets only retain their performance when reset to their original weight values, raising deeper questions about how initialization shapes a network's learning trajectory.

The hypothesis further reinforces the effectiveness of iterative pruning over one-shot pruning. Gradually refining the model structure allows the network to adapt at each stage, preserving accuracy more effectively than removing large portions of the model in a single step. This process aligns well with practical pruning strategies used in deployment, where preserving accuracy while reducing computation is important.

Despite its promise, applying LTH in practice remains computationally expensive because identifying winning tickets requires multiple cycles of pruning and retraining. Ongoing research explores whether winning subnetworks can be detected early without full training, potentially enabling more efficient sparse training. If such methods become practical, LTH could reshape model training, shifting the focus from pruning large networks after training to discovering and training only the important components from the beginning.

While LTH presents a compelling theoretical perspective on pruning, practical implementations rely on established framework-level tools to integrate structured and unstructured pruning techniques.

#### Pruning in Practice {#sec-model-compression-pruning-practice-8059}

Modern machine learning frameworks provide dedicated APIs to automate the pruning and fine-tuning workflow. In PyTorch\index{Framework Toolkits!pruning support}, the `torch.nn.utils.prune` module provides a flexible interface for pruning individual layers or entire models. Users can apply unstructured pruning (e.g., `l1_unstructured`) or structured pruning (e.g., `ln_structured`) with just a few lines of code. PyTorch uses "masks" to handle pruning: the original parameters are preserved, but a binary mask is multiplied element-wise during the forward pass. To realize actual memory savings for deployment, these masks must be "permanently" applied using `prune.remove(module, 'weight')`.

TensorFlow takes a different approach through the TensorFlow Model Optimization Toolkit (TF-MOT). Unlike PyTorch's post-training workflow, TF-MOT often integrates pruning into the training process itself. By using `prune_low_magnitude`, the framework gradually increases sparsity during training, allowing the model to adapt its remaining weights to the sparse structure in real-time.

```{python}
#| label: mobilenet-pruning-stats
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ MOBILENET PRUNING STATISTICS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Pruning in Practice" section - MobileNet example
# │
# │ Goal: Demonstrate realistic pruning gains for a mobile-class model.
# │ Show: 85% pruning with significant size reduction (14MB to 2MB).
# │ How: Define standard MobileNet pruning anchor points.
# │
# │ Imports: (none)
# │ Exports: mobilenet_pruning_pct_str, mobilenet_original_size_str, mobilenet_pruned_size_str
# └─────────────────────────────────────────────────────────────────────────────

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class MobileNetCompressionAnchor:
    """
    Namespace for MobileNet pruning anchor.
    """
    pruning_pct = 85
    original_size_mb = 14
    pruned_size_mb = 2

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
mobilenet_pruning_pct_str = f"{MobileNetCompressionAnchor.pruning_pct}%"
mobilenet_original_size_str = f"{MobileNetCompressionAnchor.original_size_mb}MB"
mobilenet_pruned_size_str = f"{MobileNetCompressionAnchor.pruned_size_mb}MB"
```

\index{BERT!compression techniques}
These trade-offs become concrete when examining real-world deployments. Several high-profile models have successfully integrated pruning to optimize performance. MobileNet, designed for mobile and embedded applications, has been pruned to reduce inference latency while preserving accuracy [@howard2017mobilenets]. Concretely, removing `{python} mobilenet_pruning_pct_str` of near-zero weights in a MobileNet can reduce its size from `{python} mobilenet_original_size_str` to `{python} mobilenet_pruned_size_str` with less than 1% accuracy loss. BERT[^fn-bert-compression], a widely used transformer model for natural language processing, has undergone structured pruning of attention heads and intermediate layers to create efficient versions such as DistilBERT and TinyBERT, which retain much of the original performance while reducing computational overhead [@sanh2019distilbert].
 In computer vision, EfficientNet[^fn-efficientnet-pruning] has been pruned to remove unnecessary filters, optimizing it for deployment in resource-constrained environments [@tan2019efficientnet].

[^fn-bert-compression]: **BERT Compression**: BERT-Base (110M params) can be compressed to 67M params (39% reduction) with only 1.2% GLUE score drop. Attention head pruning removes 144 of 192 heads with minimal impact, while layer pruning reduces 12 layers to 6 layers maintaining 97.8% performance.

[^fn-efficientnet-pruning]: **EfficientNet Pruning**: Compound scaling makes EfficientNet amenable to structured pruning. EfficientNet-B0 with 70% pruning maintains 75.8% accuracy (vs. 77.1% baseline), achieving 2.8 $\times$ speedup. Channel pruning reduces FLOPs from 390M to 140M, enabling sub-20 ms inference on Pixel 4. Iterative magnitude pruning with fine-tuning preserves accuracy better than one-shot approaches.

Pruning is powerful but has an inherent limitation: it starts with an existing architecture and carves away pieces. The pruned model inherits its structure from the original—same layer types, same connectivity patterns, just fewer parameters. What if the original architecture itself is inefficient for deployment? What if we want a model with a completely different structure, such as a 6-layer transformer instead of a 12-layer one, that still captures the original model's capabilities?

This limitation motivates **knowledge distillation**, a categorically different approach. Rather than modifying an existing model's weights, distillation trains a new, compact "student" model to mimic the behavior of a larger "teacher" model. The student inherits the teacher's learned knowledge without inheriting its computational overhead.

### Knowledge Distillation {#sec-model-compression-knowledge-distillation-1842}

\index{Knowledge Distillation!etymology}
\index{Hinton, Geoffrey!knowledge distillation}
A large language model achieves state-of-the-art accuracy on medical question-answering, but at hundreds of billions of parameters it cannot run on a hospital's on-premise server constrained to a single GPU. Pruning alone cannot bridge this gap — the target architecture needs to be fundamentally different, not merely sparser. Knowledge distillation solves this problem by training a compact "student" model to replicate the large "teacher" model's behavior, achieving 90% or more of the teacher's accuracy at a fraction of the compute. The key insight is that the teacher's predictions carry far more information than the raw training labels.

Knowledge distillation[^fn-distillation-etymology]\index{Knowledge Distillation!teacher-student framework}\index{Knowledge Distillation!definition} trains a smaller *student*\index{Knowledge Distillation!student model} model using guidance from a larger, pre-trained *teacher*\index{Knowledge Distillation!teacher model} model. A well-trained teacher provides a richer learning signal than simple ground-truth labels. While a hard label\index{Knowledge Distillation!hard labels} is binary (e.g., $[1, 0, 0]$ for cat), a teacher's probability distribution (e.g., $[0.85, 0.10, 0.05]$) reveals **inter-class similarity**\index{Knowledge Distillation!soft labels}, showing that a cat shares more features with a dog than a fox. Notice in @fig-kd-targets how this "dark knowledge" embedded in the teacher's probability distribution reveals inter-class relationships that guide the student to generalize better.

[^fn-distillation-etymology]: **Distillation**: Borrowed from chemistry, where distillation separates mixtures by selective evaporation and condensation, extracting the essence while leaving impurities behind. The concept predates its famous 2015 formulation: Rich Caruana and colleagues at Cornell demonstrated in 2006 that ensemble models could be "compressed" into single networks by training on the ensemble's predictions, achieving a model "a thousand times smaller and faster." Geoffrey Hinton, Oriol Vinyals, and Jeff Dean at Google [@hinton2015distilling] transformed this into a practical technique for deep networks by introducing temperature-scaled softmax, which controls how much "dark knowledge" about class relationships the student can absorb. The temperature parameter $T$ even mirrors the literal temperature control in chemical distillation—higher temperatures reveal more of the teacher's uncertainty structure.

::: {#fig-kd-targets fig-env="figure" fig-pos="htb" fig-cap="**Soft Target Distribution**: The teacher's relative confidence levels indicate which classes are semantically similar (e.g., cat vs. dog), providing a much richer supervision signal than a binary \"correct\" label." fig-alt="Bar chart showing probability distribution across three animal classes: Cat at 85 percent, Dog at 10 percent, Fox at 5 percent. Demonstrates how soft labels capture inter-class similarity."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{Softmax}{HTML}{FDAE61}
\definecolor{ReLU}{HTML}{ABDDA4}
\definecolor{Tanh}{HTML}{2B83BA}
\begin{axis}[
  width=65mm,
  height=55mm,
   axis line style={draw=none},
    ylabel={Probability},
    xlabel={Animal},
    ymin=0,
    axis lines=left,
   axis line style={thick,-latex},
 ytick={0,20,40,60,80,100},
  yticklabels={0\%,20\%,40\%,60\%,80\%,100\%},
    tick label style={/pgf/number format/assume math mode=true},
    yticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},
    /pgf/number format/.cd, fixed, fixed zerofill, precision=2},
    xticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
    ylabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
    xlabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
    ymax=101,
    enlarge x limits=0.3,
   y tick style={draw=none},
    x tick style={draw=black,thin},
    tick align=outside,
    major tick length=1mm,
    bar width=30pt,
     grid=both,
    major grid style={thin,black!60},
    minor tick num=1,
    xtick={1,2,3},
    xticklabels={Cat,Dog,Fox},
nodes near coords={\pgfmathprintnumber{\pgfplotspointmeta}\%},
    every node near coord/.append style={yshift=0pt,
  font=\scriptsize\usefont{T1}{phv}{m}{n}, anchor=south,black,
  /pgf/number format/assume math mode=true,fill=white,
   /pgf/number format/.cd, fixed, fixed zerofill, precision=2,zerofill=false},
    every axis plot/.append style={
          ybar,
          bar width=0.55,
          bar shift=0pt,
          fill
        }]
      \addplot[red]coordinates {(1,85)};
      \addplot[Tanh]coordinates{(2,10)};
      \addplot[ReLU]coordinates{(3,5)};
\end{axis}
\end{tikzpicture}
```
:::

The distillation workflow, laid out in @fig-kd-overview, trains the student model to minimize a combination of two loss functions:

::: {#fig-kd-overview fig-env="figure" fig-pos="htb" fig-cap="**Knowledge Distillation Workflow**: An input sample passes through both the teacher and the student network. The teacher produces soft labels via temperature-scaled softmax, while the student output is compared against both the soft labels (distillation loss) and the hard labels (student loss)." fig-alt="Block diagram showing knowledge distillation. Input flows to both teacher and student models. Teacher outputs soft labels via temperature-scaled softmax. Student outputs feed into distillation loss and student loss functions."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]

\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50},
Box/.style={inner xsep=2pt,
    node distance=0.7,
    draw=GreenLine,
    line width=0.75pt,
    fill=GreenL,
    align=flush center,
    minimum width=15mm, minimum height=9mm
  },
Box2/.style={Box, minimum width=25mm, minimum height=9mm}
}

\node[Box,fill=BrownL,draw=BrownLine](B1){Layer 1};
\node[Box,right=of B1,fill=BrownL,draw=BrownLine](B2){Layer 2};
\node[, node distance=0.7,right=of B2,fill=none,draw=none,
           font=\Large\bfseries](B0){$\cdots$};
\node[Box,right=of B0,fill=BrownL,draw=BrownLine](B3){Layer n};
\draw[Line,-latex](B1)--(B2);
\draw[Line,-latex](B2)--(B0);
\draw[Line,-latex](B0)--(B3);
\scoped[on background layer]
\node[draw=BrownLine,inner xsep=4mm,inner ysep=5mm,
yshift=2.5mm,fill=none,fit=(B1)(B3),line width=0.75pt](BB2){};
\node[below=4pt of  BB2.north,inner sep=0pt,
anchor=north]{Student (distilled) model};
%%
\node[Box,above=1.95 of B1,fill=RedL,draw=RedLine](GB1){Layer 1};
\node[Box,right=of GB1,fill=RedL,draw=RedLine](GB2){Layer 2};
\node[, node distance=0.7,right=of GB2,fill=none,draw=none,
           font=\Large\bfseries](GB0){$\cdots$};
\node[Box,right=of GB0,fill=RedL,draw=RedLine](GB3){Layer m};
\draw[Line,-latex](GB1)--(GB2);
\draw[Line,-latex](GB2)--(GB0);
\draw[Line,-latex](GB0)--(GB3);
\scoped[on background layer]
\node[draw=red,inner xsep=4mm,inner ysep=5mm,
yshift=2.5mm,fill=none,fit=(GB1)(GB3),line width=0.75pt](GBB2){};
\node[below=4pt of  GBB2.north,inner sep=0pt,
anchor=north]{Teacher model};
%%
\node[Box, rounded corners=7pt, left=2of $(GB1)!0.5!(B1)$](IN){Input x};
\draw[Line,-latex](IN.east)--++(0:0.4)|-(GB1);
\draw[Line,-latex](IN.east)--++(0:0.4)|-(B1);
%%
\node[Box2, right= 1.3of GB3,fill=OliveL,draw=OliveLine](S1){Softmax (T = t)};
\node[Box2,above right=0 and 1.3 of B3,fill=OliveL,draw=OliveLine](S2){Softmax (T = t)};
\node[Box2,below right=0 and 1.3 of B3,fill=OliveL,draw=OliveLine](S3){Softmax (T = 1)};
%
\node[Box2, right= 1.3of S1,fill=BlueL,draw=BlueLine](SL1){Soft labels};
\node[Box2, right= 1.3of S2,fill=BlueL,draw=BlueLine](SL2){Soft predictions};
\node[Box2, right= 1.3of S3,fill=BlueL,draw=BlueLine](SL3){Hard predictions};
\node[Box, rounded corners=7pt, below=1.0of SL3](HL){Hard\\ label y};
%
\node[Box,right=2of $(SL1)!0.5!(SL2)$,fill=OliveL,draw=OliveLine](L1){Loss Fn};
\node[Box,below right=0.2 and 0.7of SL3,fill=OliveL,draw=OliveLine](L2){Loss Fn};
%%
\node[left=2pt of L1,align=right,violet]{Distillation\\ loss};
\node[left=2pt of L2,align=right,violet]{Student\\ loss};
\node[below=2pt of HL,align=center]{(Ground truth)};
%
\draw[Line,-latex](GB3)--(S1);
\draw[Line,-latex](S1)--(SL1);
\draw[Line,-latex](SL1)-|(L1);
\draw[Line,-latex](SL2)-|(L1);
%
\draw[Line,-latex](B3.east)--++(0:0.74)|-(S2);
\draw[Line,-latex](B3.east)--++(0:0.74)|-(S3);
\draw[Line,-latex](S2)--(SL2);
\draw[Line,-latex](S3)--(SL3);
\draw[Line,-latex](SL3)-|(L2);
\draw[Line,-latex](L2)|-(HL);
%
\end{tikzpicture}
```
:::

1.  **Distillation Loss**\index{Knowledge Distillation!distillation loss}: Typically the Kullback-Leibler (KL) divergence\index{Kullback-Leibler divergence}[^fn-kl-divergence] between the teacher's softened output distribution and the student's distribution.
2.  **Student Loss**\index{Knowledge Distillation!student loss}: Standard cross-entropy loss against the ground-truth hard labels.

[^fn-kl-divergence]: **Kullback-Leibler Divergence**: Named after Solomon Kullback and Richard Leibler, who introduced it at the National Security Agency in 1951 for cryptanalysis. Measures how one probability distribution differs from a reference distribution. In information theory, KL(P||Q) quantifies the extra bits needed to encode samples from P using a code optimized for Q. Zero when distributions match; always non-negative.

##### Distillation Mathematics {#sec-model-compression-distillation-mathematics-4af6}

\index{Knowledge Distillation!loss function}
\index{Softmax!temperature scaling}
To reveal the inter-class similarity information, we use a **temperature parameter**\index{Knowledge Distillation!temperature parameter}[^fn-temperature-softmax] $T$ to soften the probability distribution. The softmax output for class $i$ becomes:

[^fn-temperature-softmax]: **Temperature in Softmax**: Borrowed from statistical mechanics, where the Boltzmann distribution $p_i \propto \exp(-E_i/kT)$ describes particle states at temperature $T$. Higher temperature means more uniform distribution across states. Hinton adopted this analogy for neural networks: temperature $T$ in softmax controls how "soft" the probability distribution becomes. At $T=1$ (standard softmax), peaks are sharp; at $T \to \infty$, the distribution becomes uniform.

$$
p_i(T) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
$$

\index{Cross-Entropy Loss!student loss}
\index{Kullback-Leibler divergence!etymology}
A higher $T$ (typically 3 to 5) produces a smoother distribution, allowing the student to learn from the "uncertainty" the teacher assigns to incorrect classes. The total loss $\mathcal{L}_{\text{distill}}$ balances standard cross-entropy with the KL divergence:

$$
\mathcal{L}_{\text{distill}} = (1 - \alpha) \mathcal{L}_{\text{CE}}(y_s, y) + \alpha T^2 \text{KL}(p_{\text{teacher}}^T, p_{\text{student}}^T)
$$

The factor $T^2$ ensures that gradient scales remain consistent when $T$ is changed. This hybrid approach enables compact models (like DistilBERT) to achieve up to 97% of their teacher's performance with a fraction of the memory and compute.

##### Efficiency Gains and Trade-offs {#sec-model-compression-efficiency-gains-tradeoffs-fb5b}

Distillation's primary advantage over pruning is that it produces a *dense* model, not a sparse one. A distilled student runs efficiently on commodity hardware — GPUs, TPUs, edge AI chips — without requiring specialized sparse execution kernels. Models such as DistilBERT\index{DistilBERT}\index{Knowledge Distillation!DistilBERT}[^fn-distilbert-metrics] retain up to 97% of the teacher's accuracy with 40% fewer parameters and 60% faster inference, a compression level difficult to achieve through pruning alone [@sanh2019distilbert]. MobileNet distillation variants [@howard2017mobilenets] demonstrate similar results in computer vision. The student also inherits the teacher's generalization properties\index{Knowledge Distillation!memory efficiency}: large models trained on extensive datasets are less sensitive to noise and data shifts, and well-trained students inherit this robustness — particularly valuable in low-data regimes where training a small model from scratch leads to poor generalization.

[^fn-distilbert-metrics]: **DistilBERT Performance**: Achieves 97% of BERT-Base performance with 40% fewer parameters (66M vs. 110M) and 60% faster inference. On SQuAD v1.1, DistilBERT scores 86.9 F1 vs. BERT's 88.5 F1, while reducing memory from 1.35 GB to 0.54 GB and latency from 85 ms to 34 ms.

Distillation also enables *multi-task deployment*: a single large teacher can guide multiple specialized students for different tasks (e.g., language-specific NLP models, task-specific vision models), amortizing the teacher's training cost across many deployment targets. The resulting students can be further optimized with pruning and quantization for hardware-specific acceleration [@gordon2020compressing].

\index{Knowledge Distillation!limitations}
The limitations are real, however. Distillation requires training a new model, which means higher upfront computational cost than pruning (which modifies an existing model in place). The effectiveness depends on teacher quality — a poorly trained teacher transfers incorrect biases. And designing an appropriate student architecture requires care: overly small students lack the capacity to absorb the teacher's knowledge, while overly large students defeat the purpose of compression. @sec-benchmarking provides structured evaluation approaches for measuring these efficiency gains.

Compared to pruning, knowledge distillation preserves accuracy better but demands higher training complexity: it requires training a new model rather than modifying an existing one. Pruning, conversely, provides more direct computational efficiency gains, especially in its structured form. In practice, combining the two often yields the best trade-off, as DistilBERT and MobileBERT demonstrate: pruning first reduces unnecessary parameters, then distillation optimizes a final student model. @tbl-kd-pruning contrasts the key trade-offs between knowledge distillation and pruning across accuracy retention, training cost, inference speed, hardware compatibility, and implementation complexity.

| **Criterion**              | **Knowledge Distillation**                                | **Pruning**                                                                   |
|:---------------------------|:----------------------------------------------------------|:------------------------------------------------------------------------------|
| **Accuracy retention**     | High – Student learns from teacher, better generalization | Varies – Can degrade accuracy if over-pruned                                  |
| **Training cost**          | Higher – Requires training both teacher and student       | Lower – Only fine-tuning needed                                               |
| **Inference speed**        | High – Produces dense, optimized models                   | Depends – Structured pruning is efficient, unstructured needs special support |
| **Hardware compatibility** | High – Works on standard accelerators                     | Limited – Sparse models may need specialized execution                        |
| **Ease of implementation** | Complex – Requires designing a teacher-student pipeline   | Simple – Applied post-training                                                |

: **Model Compression Trade-Offs**: Knowledge distillation and pruning represent distinct approaches to reducing model size and improving efficiency, each with unique strengths and weaknesses regarding accuracy, computational cost, and implementation complexity. Distillation prioritizes preserving accuracy through knowledge transfer, while pruning directly reduces computational demands by eliminating redundant parameters, making their combined use a common strategy for optimal performance. {#tbl-kd-pruning}

Knowledge distillation is frequently used alongside pruning and quantization for deployment-ready models. How distillation interacts with these complementary techniques determines the effectiveness of multi-stage optimization pipelines.

Pruning and distillation both reduce the number of parameters a model carries, but they take the parameter count as given and decide which parameters to keep or how to transfer their knowledge. Neither technique questions whether the model's internal representations are efficiently organized. A 4096 $\times$ 4096 weight matrix in a transformer layer may have an effective rank of only 128 — meaning 97% of its information content can be captured by a much smaller pair of matrices. Structured approximation methods exploit exactly this mathematical redundancy.

### Structured Approximations {#sec-model-compression-structured-approximations-4798}

Rather than eliminating parameters through pruning or transferring knowledge through distillation, structured approximation methods decompose large weight matrices and tensors into lower-dimensional components. These techniques exploit the mathematical structure of neural network parameters, leveraging the observation that high-dimensional representations often admit compact, low-rank approximations. The following subsections examine low-rank factorization and tensor decomposition as complementary strategies for achieving this compression.

```{python}
#| label: lowrank-bandwidth-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ LOW-RANK BANDWIDTH CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: SVD footnote and "The Bandwidth-Compute Trade-off" callout
# │
# │ Goal: Demonstrate memory savings from low-rank factorization.
# │ Show: The 16× storage reduction achieved by rank-128 SVD on a 4096×4096 matrix.
# │ How: Contrast weight counts for full vs. factored matrix representations.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: full_mb_str, factored_mb_str, data_reduction_str,
# │          mat_dim_str, rank_k_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
from mlsys.constants import KIB_TO_BYTES
from mlsys.constants import MIB_TO_BYTES

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class LowRankFactorization:
    """
    Namespace for Low-Rank Factorization Bandwidth calculation.
    Scenario: Factoring a 4096 x 4096 matrix into rank 128 components.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    rank_k = 128
    bytes_per_param = 4  # FP32

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    # Full Matrix: N x N
    full_params = mat_dim * mat_dim
    full_mb = (full_params * bytes_per_param) / MIB_TO_BYTES

    # Factored: 2 x N x K
    factored_params = 2 * mat_dim * rank_k
    factored_mb = (factored_params * bytes_per_param) / MIB_TO_BYTES

    data_reduction = full_mb / factored_mb

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(data_reduction > 10, f"Low-rank reduction ({data_reduction:.1f}x) is too low for K=128.")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    mat_dim_str = f"{mat_dim}"
    rank_k_str = f"{rank_k}"
    full_mb_str = fmt(full_mb, precision=0, commas=False)
    factored_mb_str = fmt(factored_mb, precision=1, commas=False)
    data_reduction_str = fmt(data_reduction, precision=0, commas=False)

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
mat_dim_str = LowRankFactorization.mat_dim_str
rank_k_str = LowRankFactorization.rank_k_str
full_mb_str = LowRankFactorization.full_mb_str
factored_mb_str = LowRankFactorization.factored_mb_str
data_reduction_str = LowRankFactorization.data_reduction_str
    rank_k = 128
    bytes_fp32 = 4

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    full_bytes = mat_dim * mat_dim * bytes_fp32
    full_mb = full_bytes / MIB_TO_BYTES
    factored_bytes = (mat_dim * rank_k + rank_k * mat_dim) * bytes_fp32
    factored_mb = factored_bytes / MIB_TO_BYTES

    reduction_factor = full_mb / factored_mb

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(reduction_factor >= 10, f"Low-rank reduction ({reduction_factor:.1f}x) is too small.")


    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    full_mb_str = f"{int(full_mb)}"
    factored_mb_str = f"{int(factored_mb)}"
    data_reduction_str = f"{int(reduction_factor)}"
    mat_dim_str = f"{mat_dim}"
    rank_k_str = f"{rank_k}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
full_mb_str = LowRankFactorization.full_mb_str
factored_mb_str = LowRankFactorization.factored_mb_str
data_reduction_str = LowRankFactorization.data_reduction_str
mat_dim_str = LowRankFactorization.mat_dim_str
rank_k_str = LowRankFactorization.rank_k_str
```

#### Low-Rank Factorization {#sec-model-compression-lowrank-factorization-955e}

\index{Eckart-Young Theorem!SVD optimality}
Low-Rank Matrix Factorization (LRMF)\index{Low-Rank Factorization!definition}\index{Model Compression!low-rank factorization} approximates weight matrices with lower-rank representations. Given a matrix $A \in \mathbb{R}^{m \times n}$, LRMF finds matrices $U \in \mathbb{R}^{m \times k}$ and $V \in \mathbb{R}^{k \times n}$ such that:
$$
A \approx UV
$$
where $k \ll m, n$ is the approximation rank. This is typically computed via singular value decomposition (SVD)\index{Singular Value Decomposition (SVD)}[^fn-svd], retaining only the top $k$ singular values.

[^fn-svd]: **Singular Value Decomposition (SVD)**: Factorizes matrix $A = U \Sigma V^T$ where singular values in $\Sigma$ indicate importance. Truncating to top-k values minimizes Frobenius norm error (Eckart-Young theorem). For a `{python} mat_dim_str` $\times$ `{python} mat_dim_str` weight matrix, rank-`{python} rank_k_str` SVD reduces storage from `{python} full_mb_str`MB to `{python} factored_mb_str`MB while preserving 95% of spectral energy. GPU implementations achieve O(mn min(m,n)) complexity.

This factorization reveals a fundamental *bandwidth-compute trade-off* that recurs throughout systems design.

::: {.callout-notebook title="The Bandwidth-Compute Trade-off"}
**Reducing the Memory Pressure**: Low-rank factorization illustrates a classic systems trade-off: **trading computation for bandwidth reduction**. Storing a `{python} mat_dim_str` $\times$ `{python} mat_dim_str` matrix requires `{python} full_mb_str` MB (at FP32). Fetching this matrix for a single inference is a massive memory bandwidth hit, especially when limited by physical memory bandwidth constraints.

If we factorize it with rank k = `{python} rank_k_str`, we store two matrices (`{python} mat_dim_str` $\times$ `{python} rank_k_str` and `{python} rank_k_str` $\times$ `{python} mat_dim_str`), totaling only `{python} factored_mb_str` MB, a **`{python} data_reduction_str` $\times$ reduction in data movement**. While the number of floating-point operations (FLOPs) actually *increases* slightly because we perform two smaller matrix multiplies instead of one large one, the system speedup is often dramatic. By reducing data movement by `{python} data_reduction_str` $\times$, we allow the processor to spend more time computing and less time waiting for memory.
:::

This bandwidth-compute trade-off reflects the broader memory wall[^fn-memory-wall-preview] phenomenon where memory access becomes the dominant bottleneck.

[^fn-memory-wall-preview]: **Memory Wall**: The growing disparity between processor speed and memory access speed, where memory bandwidth becomes the dominant bottleneck. We explore this constraint in depth in @sec-hardware-acceleration.

To see why this matters, study @fig-matrix-factorization: the matrix $M$ can be approximated by the product of matrices $L_k$ and $R_k^T$. For intuition, most fully connected layers in networks are stored as a projection matrix $M$, which requires $m \times n$ parameters to be loaded during computation. However, by decomposing and approximating it as the product of two lower-rank matrices, we only need to store $m \times k + k \times n$ parameters in terms of storage while incurring an additional compute cost of the matrix multiplication. So long as $k < n/2$, this factorization has fewer total parameters to store while adding a computation of runtime $O(mkn)$ [@gu2023deep].

::: {#fig-matrix-factorization fig-env="figure" fig-pos="htb" fig-cap="**Low-Rank Factorization**: A weight matrix $M$ of size $m \times n$ is approximated as the product of two smaller matrices, $L_k$ ($m \times k$) and $R_k^T$ ($k \times n$), reducing storage from $m \times n$ to $m \times k + k \times n$ parameters at the cost of one additional matrix multiplication during inference." fig-alt="Three rectangular boxes showing matrix factorization. Large M matrix of size m by n approximately equals product of narrower L matrix of size m by k and wider R-transpose matrix of size k by n."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{
  Box/.style={inner xsep=2pt,
  draw=black!90,node distance=0.8,
  line width=0.65pt,
  anchor=west,
  align=flush center,
  minimum width=12mm,
  minimum height=17mm
  },
}
\node[Box,fill=red!30](B1)at (0.33,0.5){\textit{M}};
\node[Box,fill=Brown!20,minimum width=8mm,
             right=of B1](B2){\textit{L\textsubscript{k}}};
\node[Box,fill=BlueL!90,minimum width=12mm,   minimum height=8mm,
             right=of B2](B3){\textit{R\textsubscript{k}\kern-3pt\textsuperscript{T}}};
\node[]at($(B1)!0.53!(B2)$){$\boldsymbol{\approx}$};
\node[]at($(B2)!0.47!(B3)$){$\boldsymbol{\times}$};
\node[below=2pt of B1]{\textit{m $\boldsymbol{\times}$ n}};
\node[below=2pt of B2]{\textit{m $\boldsymbol{\times}$ k}};
\node[below=2pt of B3]{\textit{k $\boldsymbol{\times}$ n}};
\end{tikzpicture}
```
:::

LRMF applies to fully connected layers (large weight matrices) and convolutional layers (via depthwise-separable convolutions). The key trade-off: storage reduces from $O(mn)$ to $O(mk + kn)$, but inference requires an additional matrix multiplication. Choosing rank $k$ balances compression against information loss.

#### Tensor Decomposition {#sec-model-compression-tensor-decomposition-5e9e}

\index{Tensor Decomposition!etymology}
Tensor decomposition\index{Tensor Decomposition!definition}\index{Model Compression!tensor decomposition}[^fn-tensor-etymology-decomposition] extends factorization to multi-dimensional tensors common in convolutional layers and attention mechanisms.

[^fn-tensor-etymology-decomposition]: **Tensor**: An n-dimensional array---scalars (0D), vectors (1D), matrices (2D), and higher---as introduced in @sec-neural-computation. Tensor decomposition exploits the fact that high-dimensional tensors can often be approximated as products of smaller factor matrices, reducing both storage and computation.

@fig-tensor-decomposition breaks down a 3D tensor into its factor matrices—pay attention to how each rank-one component contributes to the reconstruction. Common decomposition methods include:

::: {#fig-tensor-decomposition fig-env="figure" fig-pos="htb" fig-cap="**Tensor Decomposition**: A 3D tensor with dimensions $M \times N \times T$ is decomposed into a sum of rank-one components, each formed by the outer product of three factor vectors (U, V, W). This extends low-rank matrix factorization to multi-dimensional data, reducing storage and computation for convolutional layers. Source: [@xinyu]." fig-alt="3D tensor cube with dimensions M, N, T decomposed into sum of three factor matrices U, V, W of reduced dimensions. Small highlighted element shows how single tensor entry decomposes into factor products."}
```{.tikz}
\scalebox{0.7}{%
\begin{tikzpicture}[line width=0.35pt,line join=round]

\begin{scope}
\newcommand{\Depth}{3}
\newcommand{\Height}{3}
\newcommand{\Width}{3}
\coordinate (O) at (0,0,0);
\coordinate (A) at (0,\Width,0);
\coordinate (B) at (0,\Width,\Height);
\coordinate (C) at (0,0,\Height);
\coordinate (D) at (\Depth,0,0);
\coordinate (E) at (\Depth,\Width,0);
\coordinate (F) at (\Depth,\Width,\Height);
\coordinate (G) at (\Depth,0,\Height);

\draw[GreenLine,fill=GreenFill] (O) -- (C) -- (G) -- (D) -- cycle;% Bottom Face
\draw[GreenLine,fill=GreenFill] (O) -- (A) -- (E) -- (D) -- cycle;% Back Face
\draw[GreenLine,fill=GreenFill] (O) -- (A) -- (B) -- (C) -- cycle;% Left Face
\draw[GreenLine,fill=none] (D) -- (E) -- (F) -- (G) -- cycle;% Right Face
\draw[GreenLine,fill=none] (C) -- (B) -- (F) -- (G) -- (C);% Front Face
\draw[GreenLine,fill=none] (A) -- (B) -- (F) -- (E) -- cycle;% Top Face
%
\draw[GreenLine,line width=0.75pt](B)--(C)--(G)--(F)--(B)
(A)--(E)--(D)--(G)
(B)--(A) (F)--(E);
\path [every edge/.append style={line width=0.75pt,draw=blue, |-|}](C)+(0,-7pt)coordinate (C2)
edge [auto, text=blue, "$N$"']  (C2 -|G)
(G) +(4.5pt,-4.5pt) coordinate (G2) edge [text=blue,"$T$"'] ([xshift=4.5pt,yshift=-4.5pt]D)
(C) +(-7pt,0) coordinate (C1) edge [blue,"$M$"] (C1 |- B);
\end{scope}

\begin{scope}[shift={(0.75,0.75)},line width=0.5pt]
\newcommand{\Depth}{0.4}
\newcommand{\Height}{0.4}
\newcommand{\Width}{0.4}
\coordinate (MO) at (0,0,0);
\coordinate (MA) at (0,\Width,0);
\coordinate (MB) at (0,\Width,\Height);
\coordinate (MC) at (0,0,\Height);
\coordinate (MD) at (\Depth,0,0);
\coordinate (ME) at (\Depth,\Width,0);
\coordinate (MF) at (\Depth,\Width,\Height);
\coordinate (MG) at (\Depth,0,\Height);

\draw[RedLine,fill=RedFill] (MO) -- (MC) -- (MG) -- (MD) -- cycle;% Bottom Face
\draw[RedLine,fill=RedFill] (MO) -- (MA) -- (ME) -- (MD) -- cycle;% Back Face
\draw[RedLine,fill=RedFill] (MO) -- (MA) -- (MB) -- (MC) -- cycle;% Left Face
\draw[RedLine,fill=none] (MD) -- (ME) -- (MF) -- (MG) -- cycle;% Right Face
\draw[RedLine,fill=none] (MC) -- (MB) -- (MF) -- (MG) -- cycle;% Front Face
\draw[RedLine,fill=none] (MA) -- (MB) -- (MF) -- (ME) -- cycle;% Top Face
\draw[latex-]($(MC)!0.5!(MG)$)--++(260:0.81)node[below,text=black]{$(i,j,t)$-th};
\node[RedLine,below right=0pt and 0pt of MG]{$\boldsymbol{y_{ijt}}$};
%
\draw[RedLine,line width=0.75pt](MB)--(MC)--(MG)--(MF)--(MB)
(MA)--(ME)--(MD)--(MG)
(MB)--(MA) (MF)--(ME);
%
\node[below=0.8of $(C)!0.5!(G)$]{$y\in\mathbb{R}^{M\times N\times T}$};
\end{scope}

%the second
\begin{scope}[shift={(5,-0.50)}]
\newcommand{\Depth}{1}
\newcommand{\Height}{0.5}
\newcommand{\Width}{3}
\coordinate (O2) at (0,0,0);
\coordinate (A2) at (0,\Width,0);
\coordinate (B2) at (0,\Width,\Height);
\coordinate (C2) at (0,0,\Height);
\coordinate (D2) at (\Depth,0,0);
\coordinate (E2) at (\Depth,\Width,0);
\coordinate (F2) at (\Depth,\Width,\Height);
\coordinate (G2) at (\Depth,0,\Height);

\draw[BrownLine,fill=brown!07] (O2) -- (C2) -- (G2) -- (D2) -- cycle;% Bottom Face
\draw[BrownLine,fill=brown!07] (O2) -- (A2) -- (E2) -- (D2) -- cycle;% Back Face
\draw[BrownLine,fill=brown!07] (O2) -- (A2) -- (B2) -- (C2) -- cycle;% Left Face
\draw[BrownLine,fill=none] (D2) -- (E2) -- (F2) -- (G2) -- cycle;% Right Face
\draw[BrownLine,fill=none] (C2) -- (B2) -- (F2) -- (G2) -- cycle;% Front Face
\draw[BrownLine,fill=none] (A2) -- (B2) -- (F2) -- (E2) -- cycle;% Top Face
\draw[BrownLine,line width=0.75pt](B2)--(C2)--(G2)--(F2)--(B2)
(A2)--(E2)--(D2)--(G2)
(B2)--(A2) (F2)--(E2);
%
\node[below=0.3 of $(C2)!0.5!(G2)$]{$U\in\mathbb{R}^{M\times R}$};
\end{scope}

%the second small
\begin{scope}[shift={(5,0.950)},line width=0.5pt]
\newcommand{\Depth}{1}
\newcommand{\Height}{0.5}
\newcommand{\Width}{0.4}
\coordinate (MO2) at (0,0,0);
\coordinate (MA2) at (0,\Width,0);
\coordinate (MB2) at (0,\Width,\Height);
\coordinate (MC2) at (0,0,\Height);
\coordinate (MD2) at (\Depth,0,0);
\coordinate (ME2) at (\Depth,\Width,0);
\coordinate (MF2) at (\Depth,\Width,\Height);
\coordinate (MG2) at (\Depth,0,\Height);

\draw[RedLine,fill=magenta!10] (MO2) -- (MC2) -- (MG2) -- (MD2) -- cycle;% Bottom Face
\draw[RedLine,fill=magenta!10] (MO2) -- (MA2) -- (ME2) -- (MD2) -- cycle;% Back Face
\draw[RedLine,fill=magenta!10] (MO2) -- (MA2) -- (MB2) -- (MC2) -- cycle;% Left Face
\draw[RedLine,fill=none] (MD2) -- (ME2) -- (MF2) -- (MG2) -- cycle;% Right Face
\draw[RedLine,fill=none] (MC2) -- (MB2) -- (MF2) -- (MG2) -- cycle;% Front Face
\draw[RedLine,fill=none] (MA2) -- (MB2) -- (MF2) -- (ME2) -- cycle;% Top Face
\draw[BrownLine,fill=none,line width=0.75pt] (F2) -- (G2) -- cycle;% Right Face
\draw[RedLine,line width=0.75pt](MB2)--(MC2)--(MG2)--(MF2)--(MB2)
(MA2)--(ME2)--(MD2)--(MG2)
(MB2)--(MA2) (MF2)--(ME2);
%
\node[RedLine,left=1pt of $(MB2)!0.5!(MC2)$](UI){$\boldsymbol{u_i}$};
\node[left=0.17 of UI,font=\Large]{$\boldsymbol{\approx}$};
\end{scope}

%%%%%%%%
%the threed
\begin{scope}[shift={(7,4)}]
\newcommand{\Depth}{1}
\newcommand{\Height}{3}
\newcommand{\Width}{0.5}
\coordinate (O3) at (0,0,0);
\coordinate (A3) at (0,\Width,0);
\coordinate (B3) at (0,\Width,\Height);
\coordinate (C3) at (0,0,\Height);
\coordinate (D3) at (\Depth,0,0);
\coordinate (E3) at (\Depth,\Width,0);
\coordinate (F3) at (\Depth,\Width,\Height);
\coordinate (G3) at (\Depth,0,\Height);

\draw[BlueLine,fill=BlueFill] (O3) -- (C3) -- (G3) -- (D3) -- cycle;% Bottom Face
\draw[BlueLine,fill=BlueFill] (O3) -- (A3) -- (E3) -- (D3) -- cycle;% Back Face
\draw[BlueLine,fill=BlueFill] (O3) -- (A3) -- (B3) -- (C3) -- cycle;% Left Face
\draw[BlueLine,fill=none] (D3) -- (E3) -- (F3) -- (G3) -- cycle;% Right Face
\draw[BlueLine,fill=none] (C3) -- (B3) -- (F3) -- (G3) -- cycle;% Front Face
\draw[BlueLine,fill=none] (A3) -- (B3) -- (F3) -- (E3) -- cycle;% Top Face
\draw[BlueLine,line width=0.75pt](B3)--(C3)--(G3)--(F3)--(B3)
(A3)--(E3)--(D3)--(G3)
(B3)--(A3) (F3)--(E3);
%
\node[right=0.3 of $(G3)!0.5!(D3)$]{$X\in\mathbb{R}^{T\times R}$};
\end{scope}

%the threed small
\begin{scope}[shift={(6.55,3.55)}]
\newcommand{\Depth}{1}
\newcommand{\Height}{0.4}
\newcommand{\Width}{0.5}
\coordinate (MO3) at (0,0,0);
\coordinate (MA3) at (0,\Width,0);
\coordinate (MB3) at (0,\Width,\Height);
\coordinate (MC3) at (0,0,\Height);
\coordinate (MD3) at (\Depth,0,0);
\coordinate (ME3) at (\Depth,\Width,0);
\coordinate (MF3) at (\Depth,\Width,\Height);
\coordinate (MG3) at (\Depth,0,\Height);

\draw[RedLine,fill=magenta!10] (MO3) -- (MC3) -- (MG3) -- (MD3) -- cycle;% Bottom Face
\draw[RedLine,fill=magenta!10] (MO3) -- (MA3) -- (ME3) -- (MD3) -- cycle;% Back Face
\draw[RedLine,fill=magenta!10] (MO3) -- (MA3) -- (MB3) -- (MC3) -- cycle;% Left Face
\draw[RedLine,fill=none] (MD3) -- (ME3) -- (MF3) -- (MG3) -- cycle;% Right Face
\draw[RedLine,fill=none] (MC3) -- (MB3) -- (MF3) -- (MG3) -- cycle;% Front Face
\draw[RedLine,fill=none] (MA3) -- (MB3) -- (MF3) -- (ME3) -- cycle;% Top Face
\draw[RedLine,line width=0.75pt](MB3)--(MC3)--(MG3)--(MF3)--(MB3)
(MA3)--(ME3)--(MD3)--(MG3)
(MB3)--(MA3) (MF3)--(ME3);
%
\draw[BlueLine,fill=none,line width=0.75pt] (F3) -- (E3) -- cycle;% Right Face
%
\node[right=0.3 of $(G3)!0.5!(D3)$]{$X\in\mathbb{R}^{T\times R}$};
\node[RedLine,left=2pt of $(MB3)!0.9!(MA3)$](UI){$\boldsymbol{x_i}$};
\end{scope}

%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%
%the fourth
\begin{scope}[shift={(7,1)}]
\newcommand{\Depth}{3}
\newcommand{\Height}{0.4}
\newcommand{\Width}{1}
\coordinate (O4) at (0,0,0);
\coordinate (A4) at (0,\Width,0);
\coordinate (B4) at (0,\Width,\Height);
\coordinate (C4) at (0,0,\Height);
\coordinate (D4) at (\Depth,0,0);
\coordinate (E4) at (\Depth,\Width,0);
\coordinate (F4) at (\Depth,\Width,\Height);
\coordinate (G4) at (\Depth,0,\Height);

\draw[OliveLine,fill=yellow!10] (O4) -- (C4) -- (G4) -- (D4) -- cycle;% Bottom Face
\draw[OliveLine,fill=yellow!10] (O4) -- (A4) -- (E4) -- (D4) -- cycle;% Back Face
\draw[OliveLine,fill=yellow!10] (O4) -- (A4) -- (B4) -- (C4) -- cycle;% Left Face
\draw[OliveLine,fill=none] (D4) -- (E4) -- (F4) -- (G4) -- cycle;% Right Face
\draw[OliveLine,fill=none] (C4) -- (B4) -- (F4) -- (G4) -- cycle;% Front Face
\draw[OliveLine,fill=none] (A4) -- (B4) -- (F4) -- (E4) -- cycle;% Top Face
\draw[OliveLine,line width=0.75pt](B4)--(C4)--(G4)--(F4)--(B4)
(A4)--(E4)--(D4)--(G4)
(B4)--(A4)  (F4)--(E4);
%
\node[below=0.6 of $(C4)!0.5!(G4)$]{$V\in\mathbb{R}^{N\times R}$};
\end{scope}
%
%the fourth small
\begin{scope}[shift={(8.8,1)}]
\newcommand{\Depth}{0.4}
\newcommand{\Height}{0.4}
\newcommand{\Width}{1}
\coordinate (MO4) at (0,0,0);
\coordinate (MA4) at (0,\Width,0);
\coordinate (MB4) at (0,\Width,\Height);
\coordinate (MC4) at (0,0,\Height);
\coordinate (MD4) at (\Depth,0,0);
\coordinate (ME4) at (\Depth,\Width,0);
\coordinate (MF4) at (\Depth,\Width,\Height);
\coordinate (MG4) at (\Depth,0,\Height);

\draw[RedLine,fill=magenta!10] (MO4) -- (MC4) -- (MG4) -- (MD4) -- cycle;% Bottom Face
\draw[RedLine,fill=magenta!10] (MO4) -- (MA4) -- (ME4) -- (MD4) -- cycle;% Back Face
\draw[RedLine,fill=magenta!10] (MO4) -- (MA4) -- (MB4) -- (MC4) -- cycle;% Left Face
\draw[RedLine,fill=none] (MD4) -- (ME4) -- (MF4) -- (MG4) -- cycle;% Right Face
\draw[RedLine,fill=none] (MC4) -- (MB4) -- (MF4) -- (MG4) -- cycle;% Front Face
\draw[RedLine,fill=none] (MA4) -- (MB4) -- (MF4) -- (ME4) -- cycle;% Top Face
\draw[RedLine,line width=0.75pt](MB4)--(MC4)--(MG4)--(MF4)--(MB4)
(MA4)--(ME4)--(MD4)--(MG4)
(MB4)--(MA4)  (MF4)--(ME4);
\node[RedLine,below=2pt of $(MC4)!0.5!(MG4)$](UI){$\boldsymbol{v_i}$};
%
\draw[OliveLine,fill=none,line width=0.75pt] (B4) -- (F4) -- cycle;% Right Face
\end{scope}
\end{tikzpicture}}
```
:::

- **CP decomposition**\index{Tensor Decomposition!CP decomposition}: Expresses a tensor as a sum of rank-one components: $\mathcal{A} \approx \sum_{r=1}^{k} u_r \otimes v_r \otimes w_r$
- **Tucker decomposition**\index{Tensor Decomposition!Tucker decomposition}: Uses a core tensor with factor matrices: $\mathcal{A} \approx \mathcal{G} \times_1 U \times_2 V \times_3 W$
- **Tensor-Train (TT)**\index{Tensor Decomposition!Tensor-Train}: Factorizes into a sequence of lower-rank matrices, particularly effective for very high-dimensional tensors

Tensor decomposition applies to convolutional filters (approximating 4D weight tensors), attention mechanisms in transformers, and embedding layers in NLP models. The trade-offs mirror LRMF: compression versus information loss, and the additional computational overhead of tensor contractions during inference.

@tbl-lrmf-tensor compares LRMF and tensor decomposition:

| **Feature**                   | **Low-Rank Matrix Factorization (LRMF)**                            | **Tensor Decomposition**                                                          |
|:------------------------------|:--------------------------------------------------------------------|:----------------------------------------------------------------------------------|
| **Applicable Data Structure** | Two-dimensional matrices                                            | Multi-dimensional tensors                                                         |
| **Compression Mechanism**     | Factorizes a matrix into two or more lower-rank matrices            | Decomposes a tensor into multiple lower-rank components                           |
| **Common Methods**            | Singular Value Decomposition (SVD), Alternating Least Squares (ALS) | CP Decomposition, Tucker Decomposition, Tensor-Train (TT)                         |
| **Computational Complexity**  | Generally lower, often $ O(mnk) $ for a rank-$ k $ approximation    | Higher, due to iterative optimization and tensor contractions                     |
| **Storage Reduction**         | Reduces storage from $ O(mn) $ to $ O(mk + kn) $                    | Achieves higher compression but requires more complex storage representations     |
| **Inference Overhead**        | Requires additional matrix multiplication                           | Introduces additional tensor operations, potentially increasing inference latency |
| **Primary Use Cases**         | Fully connected layers, embeddings, recommendation systems          | Convolutional filters, attention mechanisms, multi-modal learning                 |
| **Implementation Complexity** | Easier to implement, often involves direct factorization methods    | More complex, requiring iterative optimization and rank selection                 |

: **Dimensionality & Factorization**: Low-rank matrix factorization (LRMF) and tensor decomposition reduce model storage requirements by representing data with fewer parameters, but introduce computational trade-offs during inference; LRMF applies to two-dimensional matrices, while tensor decomposition extends this approach to multi-dimensional tensors for greater compression potential. {#tbl-lrmf-tensor}

In practice, LRMF and tensor decomposition can be combined: fully connected layers compressed via LRMF while convolutional kernels use tensor decomposition. The choice depends on the model's structure and whether memory or latency is the primary constraint.

The techniques explored so far (pruning, distillation, and factorization) all optimize *existing* architectures. Neural Architecture Search takes a different approach: discovering architectures that are efficient *by construction*.

### Neural Architecture Search {#sec-model-compression-neural-architecture-search-cf12}

\index{Neural Architecture Search (NAS)!bi-level optimization}
Pruning, knowledge distillation, and other techniques explored in previous sections rely on human expertise to determine optimal model configurations.\index{Neural Architecture Search (NAS)!definition} Selecting optimal architectures requires extensive experimentation, and even experienced practitioners may overlook more efficient designs [@elsken2019neural]. Neural Architecture Search (NAS) automates this process by systematically exploring large spaces of possible architectures to identify those that best balance accuracy, computational cost, memory efficiency, and inference latency [@zoph2017neural].

The three-stage feedback loop in @fig-nas-flow captures the essence of how NAS works. NAS[^fn-hardware-aware-nas] operates through three interconnected stages: defining the search space (architectural components and constraints), applying search strategies (reinforcement learning[@zoph2017neural], evolutionary algorithms, or gradient-based methods) to explore candidate architectures, and evaluating performance to ensure discovered designs satisfy accuracy and efficiency objectives. The key insight is that this feedback loop allows the search to learn from each evaluation, progressively focusing on promising regions of the architecture space. This automation enables the discovery of novel architectures that often match or surpass human-designed models while requiring substantially less expert effort.

[^fn-hardware-aware-nas]: **Hardware-Aware NAS**: Architecture search [@tan2019mnasnet] directly optimizing for target hardware latency rather than proxy metrics like FLOPs. MnasNet (2019) uses actual measured latency in the search objective, finding architectures with 1.8 $\times$ speedup over MobileNetV2 at higher accuracy. Platform-specific search discovers that optimal architectures differ significantly between mobile CPUs, GPUs, and TPUs.

::: {#fig-nas-flow fig-env="figure" fig-pos="htb" fig-cap="**Neural Architecture Search Flow**: Three components form a feedback loop: a Search Space defines candidate operations, a Search Strategy selects architectures, and a Performance Estimation Strategy evaluates each candidate. The strategy iterates by feeding performance estimates back into the search until convergence." fig-alt="Three-box flowchart showing NAS process. Search Space box feeds into Search Strategy box, which exchanges Architecture and Performance estimate with Performance Estimation Strategy box in a feedback loop."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}\small]
\tikzset{%
    Line/.style={line width=1.0pt,black!50,text=black},
Box/.style={align=center,
    inner xsep=2pt,
    node distance=2.7,
    draw=BlueLine,
    line width=0.75pt,
    fill=BlueL,
    text width=32mm,
    minimum width=32mm, minimum height=10mm
  },
Box2/.style={Box,fill=VioletL2,draw=VioletLine}
 }

\node[Box](B1){Search Space \\ $\mathcal{A}$};
\node[Box2,right=of B1](B2){Search Strategy};
\node[Box2,right=of B2](B3){Performance\\ Estimation Strategy};
  \scoped[on background layer]
\node[draw=BackLine,inner xsep=5mm,inner ysep=5mm,minimum height=40mm,
yshift=6.5mm,fill=BackColor!30,fit=(B2)(B3),line width=1pt](BB1){};
\node[below=4pt of BB1.north,inner sep=0pt,
anchor=north,align=center]{One-shot approach:\\
learning model architecture parameters and weights together};
\draw[Line,-latex](B1)--(B2);
\draw[Line,-latex](B2.8)--node[above,align=center]{Architecture \\ $A\in\mathcal{A}$}(B3.172);
\draw[Line,latex-](B2.352)--node[below,align=center]{Performance\\ estimate of $A$}(B3.188);
\end{tikzpicture}
```
:::

The effectiveness of NAS depends on three design decisions: what architectures to search over (the search space), how to explore that space efficiently (the search strategy[^fn-rl-nas][^fn-evolutionary-nas]), and how to evaluate each candidate's fitness for deployment (the performance metrics[^fn-nas-evaluation-metrics]). The following subsections formalize each decision, beginning with the optimization problem that NAS must solve.

[^fn-rl-nas]: **Reinforcement Learning NAS**: Uses RL controller networks to generate architectures, with accuracy as reward signal. Google's NASNet controller was trained for 22,400 GPU-hours on 800 GPUs, but discovered architectures achieving 82.7% ImageNet accuracy, 28% better than human-designed ResNet at similar FLOP budgets.

[^fn-evolutionary-nas]: **Evolutionary NAS**: Treating architectures as genomes evolved through mutation (adding/removing layers) and crossover (combining parent architectures). AmoebaNet required 3,150 GPU-days achieving 83.9% ImageNet accuracy. Regularized evolution outperformed RL-based NAS in head-to-head comparisons. Modern approaches combine evolution with weight-sharing for 1000 $\times$ speedup.

[^fn-nas-evaluation-metrics]: **NAS Evaluation Metrics**: Multi-objective optimization balancing accuracy, latency, memory, and energy creates Pareto frontiers of non-dominated architectures. Practitioners select architectures based on deployment constraints: edge devices prioritize latency/energy; servers prioritize throughput. Scalarization weights or evolutionary multi-objective methods explore these tradeoffs systematically.

#### The NAS Optimization Problem {#sec-model-compression-nas-optimization-problem-7f8e}

NAS faces a chicken-and-egg problem: we cannot know how good an architecture is until we train it, but training is expensive. This creates two nested decisions—choosing which operations to include (the architecture) and finding the best parameters for those operations (the weights). The architecture defines *what* to optimize; the weights define *how well* that architecture can perform.

NAS is therefore a **bi-level optimization problem**: the outer loop searches the architecture space $\mathcal{A}$, while the inner loop trains candidate architectures to evaluate performance. Formally, we seek the optimal architecture $\alpha^*$ that minimizes validation loss $\mathcal{L}_{\text{val}}$ under constraints $C$ (latency, memory):

$$
\alpha^* = \arg\min_{\alpha \in \mathcal{A}} \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \quad \text{subject to} \quad C(\alpha) \leq C_{\text{max}}
$$

where $w^*(\alpha)$ represents the optimal weights for architecture $\alpha$, obtained by minimizing training loss:

$$
w^*(\alpha) = \arg\min_{w} \mathcal{L}_{\text{train}}(w, \alpha)
$$

The core challenge is the cost of the inner loop: evaluating each candidate requires expensive training. A search space with just 10 choices across 20 layers yields $10^{20}$ architectures, making exhaustive search impossible. Efficient NAS methods address this by restricting the search space, using faster search strategies, or accelerating evaluation.

#### Search Space Design {#sec-model-compression-search-space-design-62db}

The search space defines what architectures NAS can discover. Well-designed search spaces incorporate domain knowledge to focus search on promising regions while remaining flexible enough to discover novel patterns.

##### Cell-Based Search Spaces {#sec-model-compression-cellbased-search-spaces-9eb7}

Rather than searching entire network architectures, cell-based NAS searches for reusable computational blocks (cells) that can be stacked to form complete networks. For example, a convolutional cell might choose from operations like 3 $\times$ 3 convolution, 5 $\times$ 5 convolution, depthwise separable convolution, max pooling, or identity connections. A simplified cell with 4 nodes and 2 operations per edge yields roughly 10,000 possible cell designs, far more tractable than searching full architectures. EfficientNet uses this approach to discover scalable cell designs that generalize across different model sizes.

##### Hardware-Aware Search Spaces {#sec-model-compression-hardwareaware-search-spaces-9841}

Hardware-aware NAS\index{Neural Architecture Search (NAS)!hardware-aware}\index{Hardware-Aware Optimization!neural architecture search} extends search spaces to include deployment constraints as first-class objectives. Rather than optimizing solely for accuracy and FLOPs, the search explicitly minimizes actual latency on target hardware (mobile CPUs, GPUs, edge accelerators). MobileNetV3's search space includes a latency prediction model that estimates inference time for each candidate architecture on Pixel phones without actually deploying them. This hardware-in-the-loop approach ensures discovered architectures run efficiently on real devices rather than just achieving low theoretical FLOP counts.

#### Search Strategies {#sec-model-compression-search-strategies-84ad}

Search strategies determine how to explore the architecture space efficiently without exhaustive enumeration. @tbl-nas-strategies compares the trade-offs between search cost, architectural diversity, and optimality guarantees for each approach.

| **Strategy**                | **Search Efficiency** | **When to Use**                     | **Key Challenge**                       |
|:----------------------------|----------------------:|:------------------------------------|:----------------------------------------|
| **Reinforcement Learning**  |     400-1000 GPU-days | Novel domains, unconstrained search | High computational cost                 |
| **Evolutionary Algorithms** |      200-500 GPU-days | Parallel infrastructure available   | Requires large populations              |
| **Gradient-Based (DARTS)**  |          1-4 GPU-days | Limited compute budget              | May converge to suboptimal local minima |

: **NAS Search Strategy Comparison**: Trade-offs between search efficiency, use cases, and limitations for different NAS approaches. Reinforcement learning offers unconstrained exploration at high cost, evolutionary methods leverage parallelism, and gradient-based approaches achieve dramatic speedups with potential optimality trade-offs. {#tbl-nas-strategies}

\index{Reinforcement Learning!NAS application}
\index{Neural Architecture Search (NAS)!reinforcement learning strategy}
\index{NASNet!RL-discovered architecture}
Reinforcement learning based NAS treats architecture search as a sequential decision problem where a controller generates architectures and receives accuracy as reward. The controller (typically an LSTM) learns to propose better architectures over time through policy gradient optimization. While this approach discovered groundbreaking architectures like NASNet, the sequential nature limits parallelism and requires hundreds of GPU-days.

\index{Evolutionary Algorithms!NAS application}
\index{Neural Architecture Search (NAS)!evolutionary strategy}
\index{AmoebaNet!evolutionary NAS}
Evolutionary algorithms maintain a population of candidate architectures and iteratively apply mutations (changing operations, adding connections) and crossover (combining parent architectures) to generate offspring. Fitness-based selection retains high-performing architectures for the next generation. AmoebaNet used evolution to achieve state-of-the-art results, with massive parallelism amortizing the cost across thousands of workers.

Gradient-based methods like DARTS\index{DARTS (Differentiable Architecture Search)} (Differentiable Architecture Search) [@liu2019darts] represent the search space as a continuous relaxation where all possible operations are weighted combinations. Rather than discrete sampling, DARTS optimizes architecture weights and model weights jointly using gradient descent. By making the search differentiable, DARTS reduces search cost from hundreds to just 1-4 GPU-days, though the continuous relaxation may miss discrete architectural patterns that discrete search methods discover.

\index{MnasNet!hardware-aware NAS}
Hardware-aware NAS moves beyond FLOPs as a proxy for efficiency, directly optimizing for actual deployment metrics. MnasNet's search incorporates a latency prediction model trained on thousands of architecture-latency pairs measured on actual mobile phones. The search objective combines accuracy and latency through a weighted product:

$$
\text{Reward}(\alpha) = \text{Accuracy}(\alpha) \times \left(\frac{L_{\text{lat}}(\alpha)}{L_{\text{lat,target}}}\right)^\beta
$$

where $L_{\text{lat}}(\alpha)$ is measured latency, $L_{\text{lat,target}}$ is the latency constraint, and $\beta$ controls the accuracy-latency trade-off. This formulation penalizes architectures that exceed latency targets while rewarding those that achieve high accuracy within the budget. MnasNet discovered that inverted residuals with varying expansion ratios achieve better accuracy-latency trade-offs than uniform expansion, a design insight that manual exploration likely would have missed.

#### When to Use NAS {#sec-model-compression-use-nas-2b47}

Neural Architecture Search is a powerful tool, but its significant computational cost demands careful consideration of when the investment is justified.

NAS becomes worthwhile for novel hardware platforms with unique constraints (new accelerator architectures, extreme edge devices) where existing architectures are poorly optimized. It also makes sense at massive deployment scale (billions of inferences) where even 1-2% efficiency improvements justify the upfront search cost, or when multiple deployment configurations require architecture families (cloud, edge, mobile) that amortize one search across many variants.

Conversely, avoid NAS when working with standard deployment constraints (e.g., ResNet-50 accuracy on NVIDIA GPUs) where well-optimized architectures already exist. Similarly, if the compute budget is limited (less than 100 GPU-days available), even efficient NAS methods like DARTS become infeasible. Rapidly changing requirements also make NAS impractical, as architecture selection may become obsolete before the search completes.

For most practitioners, starting with existing NAS-discovered architectures (EfficientNet, MobileNetV3, MnasNet) provides better ROI than running NAS from scratch. These architectures are highly tuned and generalize well across tasks. Reserve custom NAS for scenarios with truly novel constraints or deployment scales that justify the investment.

#### Architecture Examples {#sec-model-compression-architecture-examples-b1ce}

NAS-discovered architectures consistently demonstrate design insights that manual exploration would likely miss. EfficientNet\index{EfficientNet}\index{Neural Architecture Search (NAS)!EfficientNet}\index{Compound Scaling} discovered that depth, width, and resolution should scale with fixed compound coefficients rather than independently — a principle that achieves higher accuracy with fewer parameters across the entire model family from mobile to cloud deployment. MobileNetV3\index{MobileNetV3}\index{Neural Architecture Search (NAS)!MobileNetV3}\index{Hardware-Aware Optimization!mobile deployment} optimized specifically for mobile hardware, discovering that inverted residual blocks with squeeze-and-excitation layers and the h-swish activation function achieve better accuracy-latency trade-offs than any prior MobileNet variant. FBNet extended this to real-time inference on mobile CPUs by incorporating device-specific latency constraints directly into the search objective [@radosavovic2020designing].

Beyond convolutional networks, NAS has been applied to transformer architectures: NAS-BERT discovers efficient structures that retain strong language understanding while reducing compute and memory overhead, and similar approaches design lightweight vision transformers with attention mechanisms tailored for edge deployment. The common thread is that encoding efficiency constraints directly into the search process produces architectures that are more computationally efficient and hardware-adapted than manual design.

The structural techniques covered so far (pruning, distillation, factorization, and NAS) all optimize *what* computations the model performs: which parameters exist, which connections remain, and how the architecture is structured. These techniques can dramatically reduce parameter counts and theoretical FLOPs. Even a perfectly pruned model with an optimal architecture, however, faces a fundamental constraint: every surviving weight and activation must be stored and processed at some numerical precision.

This brings us to the second dimension of our optimization framework: *how precisely* should those computations be performed? Numerical precision directly determines memory footprint and arithmetic cost, two resources that structural optimization cannot touch. A 32-bit floating-point number uses 4 bytes of memory and requires expensive floating-point arithmetic; an 8-bit integer uses 1 byte and enables fast integer math. For many models, this 4 $\times$ reduction in precision translates directly to 4 $\times$ reduction in memory bandwidth, and since LLM inference is bandwidth-bound, this means 4 $\times$ faster token generation. The accuracy cost is often less than 1%.

Quantization is arguably the single most impactful optimization technique for deployment, especially for large language models. It requires no architectural changes, applies post-training in many cases, and delivers immediate, hardware-agnostic benefits. Before examining the techniques in detail, the following checkpoint tests understanding of the structural optimization methods covered so far.

::: {.callout-checkpoint title="Structural Optimization Checkpoint" collapse="true"}

Test your understanding of the structural optimization techniques covered so far:

- [ ] Can you explain the key difference between structured and unstructured pruning in terms of hardware efficiency? Consider how each interacts with GPU and TPU execution patterns.
- [ ] Do you understand why knowledge distillation typically preserves accuracy better than aggressive pruning? Think about what information each method retains from the original model.
- [ ] Can you identify when to choose Neural Architecture Search over manual architecture design? Consider the trade-offs in computational cost, design space coverage, and hardware-specific optimization.
:::


## Quantization and Precision {#sec-model-compression-quantization-precision-cd46}
\index{Model Compression!precision optimization}

A `{python} llm_7b_str` billion parameter language model stored in FP16 consumes `{python} llm_7b_mem_str` GB, yet users expect it to run on a smartphone with `{python} smartphone_ram_str` GB of shared RAM. Structural optimization alone cannot bridge this gap: even aggressive pruning rarely exceeds 50--70% parameter reduction, leaving a model far too large for the target device. The remaining leverage comes from a different dimension entirely: reducing the number of bits used to represent each parameter. *Quantization*, the process of reducing numerical precision, offers one of the most impactful optimizations for deployment, because it trades bits for speed and efficiency with minimal accuracy loss.

::: {.callout-definition title="Quantization"}

***Quantization***\index{Quantization} is the reduction of **Information Fidelity** to match the **Noise Floor** of the model. It maps continuous values to discrete bins (e.g., FP32 to INT8), linearly reducing **Memory Bandwidth** and **Energy** consumption while exploiting the robustness of neural networks to low-precision arithmetic.

:::

\index{Quantization!etymology}
\index{Shannon, Claude!quantization theory}
Quantization[^fn-quantization-etymology] affects every neural network weight and activation stored at some numerical precision: FP32 (32 bits), FP16 (16 bits), INT8 (8 bits), or lower.

[^fn-quantization-etymology]: **Quantization**: From Latin "quantus" (how much), via quantum physics where it describes discrete energy levels. The technique's theoretical foundations trace to Claude Shannon at Bell Labs, whose 1948 paper "A Mathematical Theory of Communication" established the fundamental limits of representing continuous signals with discrete values. In signal processing (1940s-1960s), quantization meant mapping continuous values to discrete levels, introducing "quantization error"—and Shannon proved exactly how much precision you could sacrifice before information was irretrievably lost. ML borrowed both the term and the mathematics directly: converting FP32 weights to INT8 maps continuous values to 256 discrete levels, trading precision for efficiency. When we analyze quantization error in neural networks, we are applying Shannon's six-decade-old framework to a new domain.

This choice directly impacts three system properties. Memory shrinks because an INT8 model is 4 $\times$ smaller than FP32, enabling deployment on devices that could never hold the full-precision weights. Bandwidth demand drops proportionally: loading INT8 weights requires 4 $\times$ less memory traffic, directly accelerating the bandwidth-bound inference that dominates LLM generation. Compute cost falls as well, since INT8 arithmetic is faster and cheaper than FP32 on most hardware with dedicated low-precision units [@gupta2015deep; @wang2019benchmarking].

The accuracy cost of reduced precision varies by model and technique. CNNs typically tolerate INT8 quantization with <1% accuracy loss; transformers may require more care. This section covers three approaches in increasing complexity: **post-training quantization**\index{Quantization!post-training (PTQ)} (PTQ) for rapid deployment, **quantization-aware training**\index{Quantization!aware training (QAT)} (QAT) for production systems requiring minimal accuracy loss, and **extreme quantization**\index{Quantization!extreme (INT4, binary)} (INT4, binary) for the most constrained environments.

### Precision and Energy {#sec-model-compression-precision-energy-cc8e}

\index{Energy Efficiency!numerical precision impact}
Efficient numerical representations reduce storage requirements, computation latency, and power usage, benefiting mobile AI, embedded systems, and cloud inference alike. Precision levels can be tuned to specific hardware capabilities, maximizing throughput on AI accelerators such as GPUs, TPUs, NPUs, and edge AI chips.

#### Energy Costs {#sec-model-compression-energy-costs-d1d5}

\index{DRAM!energy cost of access}
Beyond computational and memory benefits, the energy costs associated with different numerical precisions reinforce the case for reduced precision. @fig-quantized-energy quantifies these energy differences: a 32-bit floating-point addition (FAdd) consumes approximately `{python} energy_add_fp32_str` pJ, whereas a 16-bit floating-point addition requires only `{python} energy_add_fp16_str` pJ. Similarly, a 32-bit integer addition costs `{python} energy_add_int32_str` pJ, while an 8-bit integer addition is just `{python} energy_add_int8_str` pJ. These savings compound across large-scale models operating over billions of operations, supporting both cost reduction and sustainability goals.

::: {#fig-quantized-energy fig-env="figure" fig-pos="htb" fig-cap="**Energy per Operation by Precision.** Bar chart comparing energy in picojoules for arithmetic operations (FP32 multiply: 3.7 pJ, INT8 add: 0.03 pJ) and SRAM memory accesses (5 to 50 pJ by cache size). Lower precision yields order-of-magnitude energy savings. Source: IEEE Spectrum." fig-alt="Bar chart comparing energy consumption in picojoules for arithmetic operations and memory accesses. FP32 multiply uses 3.7 pJ, INT8 add uses 0.03 pJ. SRAM reads range from 5 to 50 pJ depending on cache size."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{Softmax}{HTML}{FDAE61}
\definecolor{ReLU}{HTML}{ABDDA4}
\definecolor{Tanh}{HTML}{2B83BA}
%RIGHT
 \begin{scope}[local bounding box=RR2,shift={(7,0)}]
\begin{axis}[
    axis line style={draw=none},
    width=105mm,
    height=75mm,
    xlabel={Operation},
    ylabel={Energy (pJ)},
    title={Energy Consumption of Different Operations},
    title style={yshift=-4pt},
    ymin=-1.1,ymax=53,
    ytick={0,10,...,50},
    tick label style={/pgf/number format/assume math mode=true},
    yticklabel style={font=\footnotesize\usefont{T1}{phv}{m}{n},
    /pgf/number format/.cd, fixed, fixed zerofill, precision=0},
    xticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},rotate=25,anchor=north east},
    ylabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
    enlarge x limits=0.1,
    grid=both,
    minor tick num=1,
    major grid style={black!60},
    tick style={draw=none},
    nodes near coords,
    every node near coord/.append style={yshift=2pt,
  font=\scriptsize\usefont{T1}{phv}{m}{n}, anchor=south,black,
  /pgf/number format/assume math mode=true,fill=white,
   /pgf/number format/.cd, fixed, fixed zerofill, precision=2,zerofill=false,},
    major tick length=1mm,
    xtick={1,2,3,4,5,6,7,8},
    xticklabels={Integer ADD (8b), Integer ADD (16b), Integer ADD (32b),
                Integer MULT (8b), Integer MULT (32b),
                8 KB SRAM Read (32b), 32 KB SRAM Read (32b), 1 MB SRAM Read (32b) },
    every axis plot/.append style={
          ybar,
          bar width=9mm,
          bar shift=0pt,
          fill
        }]
      \addplot[VioletLine]coordinates {(1,0.03)};
      \addplot[BrownLine]coordinates{(2,0.05)};
      \addplot[BlueLine]coordinates{(3,0.1)};
      \addplot[Softmax]coordinates{(4,0.2)};
      \addplot[Softmax]coordinates {(5,3.1)};
      \addplot[Tanh]coordinates{(6,5)};
      \addplot[ReLU]coordinates{(7,10)};
      \addplot[RedLine]coordinates{(8,50)};
%
\coordinate(L)at(axis cs:2,0.05);
\coordinate(D)at(axis cs:8,25);
\coordinate(S1)at(axis cs:0,27);
\coordinate(S2)at(axis cs:2,30);
\end{axis}
\node[fill=white,text=red, font=\bfseries\large\usefont{T1}{phv}{m}{n}] at (S2) {100 $\times$};
\draw[red,-latex,line width=2pt](L)--(D);
\end{scope}
%LEFT
\path[red](S1)--++(180:5)coordinate(S);
%%
\begin{scope}[local bounding box=RR1,shift={(S)}]
\colorlet{col1}{BrownLine!35}
\colorlet{col2}{BrownLine!15}
\colorlet{col3}{BrownLine!5}
\matrix(T)[%nodes in empty cells,
  matrix of nodes,
  row sep =3\pgflinewidth,
  column sep = 3\pgflinewidth,
  nodes={text height=1.5ex,text depth=0.25ex, text width=2mm, draw=white,
  line width=0.25pt, font=\footnotesize\usefont{T1}{phv}{m}{n}},
  row 1/.style={nodes={align=center,fill=col1}},
  column 2/.style = {nodes={text width=40mm,align=left}},
  column 3/.style = {nodes={text width=20mm,align=center}},
  ]
  {
&\textbf{Operation}&\textbf{Energy\_pJ}\\
1&|[fill=col3]| Integer ADD (8b) &|[fill=col3]| 0.03\\
2&|[fill=col2]| Integer ADD (16b)&|[fill=col2]| 0.05\\
3&|[fill=col3]| Integer ADD (32b)&|[fill=col3]| 0.10\\
4&|[fill=col2]| Integer MULT (8b)&|[fill=col2]| 0.20\\
5&|[fill=col3]| Integer MULT (32b)&|[fill=col3]|3.10\\
6&|[fill=col2]| 8 KB SRAM Read (32b)&|[fill=col2]|5.00\\
7&|[fill=col3]| 32 KB SRAM Read (32b)&|[fill=col3]|10.00\\
8&|[fill=col2]| 1 MB SRAM Read (32b)&|[fill=col2]|50.00\\
  };
\end{scope}
\end{tikzpicture}

```
:::

\index{DLRM!embedding quantization}
These energy savings take on a different character for models where memory capacity, not compute, is the binding constraint. *DLRM embedding quantization* illustrates this distinction.

::: {.callout-lighthouse title="DLRM and Embedding Quantization"}
**The Memory Capacity Constraint**: Our **DLRM Lighthouse** (@sec-network-architectures) presents a unique compression challenge. Unlike ResNet or GPT, which are constrained by compute or bandwidth, DLRM is constrained by **Memory Capacity**. Its embedding tables can reach terabytes in size, far exceeding GPU memory.

For DLRM, quantization is not about faster math; it's about **storage density**. Quantizing embedding tables from FP32 to INT8 (or INT4) reduces memory footprint by 4–8 $\times$, allowing larger tables to fit on fewer GPUs. This is a pure **Information Density** optimization: we compress the lookup table so the **Machine** (Physics) can hold the **Algorithm** (Logic).
:::

\index{Keyword Spotting (KWS)!quantization imperative}
While DLRM operates at the terabyte scale, our Smart Doorbell Lighthouse faces the opposite extreme, demonstrating a phenomenon we call the *TinyML quantization imperative*, where compression becomes an existential requirement.

::: {.callout-lighthouse title="The TinyML Quantization Imperative"}
**The Energy and Storage Constraint**: Our **Smart Doorbell Lighthouse** operates at the opposite extreme of the Iron Law from DLRM. While DLRM optimizes for terabyte-scale capacity, the Smart Doorbell's Keyword Spotting (KWS) model must operate within a 100 KB budget to run on a microcontroller with 256 KB RAM.

In FP32, even the compact DS-CNN architecture consumes 4 $\times$ more memory bandwidth and energy per inference than in INT8. For an always-on device running on a coin cell battery, this 4 $\times$ energy difference translates directly to battery life: a device that lasts 1 month on FP32 might last 4 months on INT8. Here, quantization is the primary lever for the **Energy Term** ($O / (R_{peak} \cdot \eta)$) of the Iron Law.
:::

Beyond direct compute savings, reducing numerical precision has a significant impact on memory energy consumption, which often dominates total system power. Lower-precision representations reduce data storage requirements and memory bandwidth usage, leading to fewer and more efficient memory accesses. Accessing memory, particularly off-chip DRAM, is far more energy-intensive than performing arithmetic operations: DRAM accesses require orders of magnitude more energy (1.3-2.6 nJ) compared to cache accesses (e.g., 10 pJ for an 8 KB L1 cache access). An instruction's total energy can therefore be dominated by memory access patterns rather than computation[^fn-energy-efficiency-metrics].

[^fn-energy-efficiency-metrics]: **Energy Efficiency Metrics**: INT8 quantization reduces energy consumption by 4–8 $\times$ over FP32 on supported hardware. MobileNetV2 INT8 consumes `{python} mobilenet_int8_mj_str`mJ vs. `{python} mobilenet_fp32_mj_str`mJ FP32 per inference on Cortex-A75. ResNet-50 on TPU v4 achieves `{python} tpu_v4_tops_per_w_str` TOPS/Watt vs. `{python} v100_tops_per_w_str` TOPS/Watt on V100 GPU.

Reducing numerical precision thus improves efficiency on two fronts: faster computation and less data movement. This dual benefit is especially valuable for hardware accelerators and edge devices, where memory bandwidth and power efficiency are binding constraints.

#### Performance Gains {#sec-model-compression-performance-gains-0656}

The practical payoff of quantization becomes concrete in @fig-quantization_impact. Compare the left bars (inference time) and right bars (model size) in each category to see the gains when moving from FP32 to INT8. Quantized models achieve up to $4\times$ faster inference while reducing storage requirements by the same factor, making them well suited for deployment in resource-constrained environments.

::: {#fig-quantization_impact fig-env="figure" fig-pos="htb" fig-cap="**Quantization Impact**: Moving from FP32 to INT8 reduces inference time by up to 4 times while decreasing model size by a factor of 4, making models more efficient for resource-constrained environments." fig-alt="Two stacked bar charts comparing FP32 and INT8. Left chart shows inference time in milliseconds for Inception, MobileNet, and ResNet. Right chart shows model size in megabytes. INT8 consistently smaller and faster."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{other}{HTML}{D7191C}
\definecolor{WeightGradient}{HTML}{FDAE61}
\definecolor{Optimization}{HTML}{ABDDA4}
\definecolor{Activation}{HTML}{2B83BA}

\pgfplotsset{
  mybarstyle/.style={
   /pgf/number format/.cd,
   1000 sep={},
    width=75mm,
    height=75mm,
    axis line style={draw=none},
    ybar stacked, ymin=0,
    bar width=11mm,
    title style={font=\fontsize{8pt}{8}\selectfont\usefont{T1}{phv}{m}{n},yshift=-2pt},
    symbolic x coords={Inception\_v3,MobileNet\_v1, ResNet\_v2},
    xtick=data,
   legend style={at={(0.85,0.92)}, anchor=north},
   legend cell align=left,
   legend style={fill=BrownL!40,draw=BrownLine,row sep=1.85pt,
   font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
    enlarge x limits=0.2,
    tick label style={/pgf/number format/assume math mode=true},
    ticklabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
    grid=major,
    major grid style={black!60},
   every node near coord/.append style={xshift=1pt,
   /pgf/number format/assume math mode=true,
    font=\fontsize{6pt}{6}\selectfont\usefont{T1}{phv}{m}{n}, anchor=center},
    %
    yticklabel style={font=\fontsize{7pt}{8}\selectfont\usefont{T1}{phv}{m}{n}},
    xticklabel style={font=\fontsize{7pt}{8}\selectfont\usefont{T1}{phv}{m}{n},yshift=-3pt},
    ylabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
    xlabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
 }
}
\begin{scope}[local bounding box=RR,shift={(0,0)}]
\begin{axis}[mybarstyle,
    ymin=0,
    ytick={0,250,500,750,1000,1250},
    ylabel={Value},
    title={Inference\_Time},
   nodes near coords={\pgfmathprintnumber{\pgfplotspointmeta}~ms},
    ]
    \addplot [fill=WeightGradient!80,draw=none] coordinates {
        ({Inception\_v3},800)
        ({MobileNet\_v1},30)
        ({ResNet\_v2},300)};
    \addplot [fill=Activation!90,draw=none] coordinates {
        ({Inception\_v3},500)
        ({MobileNet\_v1},700)
        ({ResNet\_v2},70)};
\end{axis}
 \end{scope}
 %%%%RIGHT
 \begin{scope}[local bounding box=RR2,shift={(7,0)}]
\begin{axis}[mybarstyle,
  title={Model\_Size},
  nodes near coords={\pgfmathprintnumber{\pgfplotspointmeta}~MB},
    ]
    \addplot [fill=WeightGradient!80,draw=none] coordinates {
        ({Inception\_v3},135)
        ({MobileNet\_v1},4)
        ({ResNet\_v2},24)};
    \addplot [fill=Activation!90,draw=none] coordinates {
        ({Inception\_v3},71)
        ({MobileNet\_v1},45)
        ({ResNet\_v2},13)};

\legend{FP32,INT8}
\coordinate (legend) at (axis description cs:0.85,0.92);
\end{axis}
\node[fill=white,above=1pt of legend,anchor=south,
 font=\fontsize{8pt}{8}\selectfont\usefont{T1}{phv}{m}{n}]{Precision};
 \end{scope}
 \node[draw=none,inner sep=0pt,fit=(RR)(RR2)](BB){};
 \node[above=0pt of BB]{Impact of Quantization on Inference Time and Model Size};
  \node[below=-2pt of BB]{Model};
  \end{tikzpicture}
```
:::

To make these gains concrete, consider the *quantization savings* when deploying a modern large language model at reduced precision.

```{python}
#| label: quant-savings-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ QUANTIZATION SAVINGS CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "Quantization Savings" — Llama 3 8B deployment
# │
# │ Goal: Demonstrate storage savings from LLM quantization.
# │ Show: That INT4 compression enables an 8B model to fit on an 8GB GPU.
# │ How: Calculate total weight bytes for FP16 vs. INT4 precision.
# │
# │ Imports: mlsys.formatting (fmt), mlsys.constants (BYTES_FP16, BYTES_INT4, byte, GB)
# │ Exports: llm_params_b_str, fp16_bytes_str, int4_bytes_str,
# │          fp16_size_gb_str, int4_size_gb_str, compression_ratio_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
from mlsys.constants import KIB_TO_BYTES
from mlsys.constants import BYTES_FP16, BYTES_INT4, byte, GB

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class QuantizationSavings:
    """
    Namespace for Quantization Savings calculation.
    Scenario: FP16 vs INT4 storage for an 8B model.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    params_b = 8
    bytes_fp16 = 2.0
    bytes_int4 = 0.5

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    fp16_size_gb = params_b * bytes_fp16
    int4_size_gb = params_b * bytes_int4

    ratio = fp16_size_gb / int4_size_gb

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(ratio == 4.0, f"FP16/INT4 ratio should be exactly 4.0, got {ratio}")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    llm_params_b_str = f"{params_b}"
    fp16_bytes_str = f"{int(bytes_fp16)}"
    int4_bytes_str = f"{bytes_int4}"
    fp16_size_gb_str = f"{int(fp16_size_gb)}"
    int4_size_gb_str = f"{int(int4_size_gb)}"
    compression_ratio_str = f"{int(ratio)}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
llm_params_b_str = QuantizationSavings.llm_params_b_str
fp16_bytes_str = QuantizationSavings.fp16_bytes_str
int4_bytes_str = QuantizationSavings.int4_bytes_str
fp16_size_gb_str = QuantizationSavings.fp16_size_gb_str
int4_size_gb_str = QuantizationSavings.int4_size_gb_str
compression_ratio_str = QuantizationSavings.compression_ratio_str
```

::: {.callout-notebook title="Quantization Savings"}
**Scenario**: Deploying Llama 3 `{python} llm_params_b_str` B (`{python} llm_params_b_str` billion parameters).

**FP16 (Half Precision)**

- **Size**: `{python} llm_params_b_str` $\times 10^9$ $\times$ `{python} fp16_bytes_str` bytes (16-bit) = `{python} fp16_size_gb_str` GB
- **Hardware Req**: Requires 24 GB GPU (e.g., A10G, 3090, 4090).

**INT4 (4-bit Quantization)**

- **Size**: `{python} llm_params_b_str` $\times 10^9$ $\times$ `{python} int4_bytes_str` bytes (4-bit) = `{python} int4_size_gb_str` GB
- **Hardware Req**: Fits comfortably on 8 GB GPU (e.g., T4, consumer laptops).

**Impact**: `{python} compression_ratio_str` $\times$ compression allows deployment on commodity hardware, reducing cost by 5–10 $\times$.
:::

Beyond storage savings, quantization also accelerates computation through hardware parallelism. The speedup is not merely about arithmetic being cheaper—it emerges from how modern processors pack more operations into the same hardware resources when working with smaller data types. The following calculation illustrates the impact of *the SIMD multiplier*.

```{python}
#| label: simd-throughput-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SIMD THROUGHPUT CALCULATION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Callout "The SIMD Multiplier"
# │
# │ Goal: Demonstrate the throughput gain from hardware SIMD parallelism.
# │ Show: That INT8 quantization yields a 4× speedup on fixed-width registers.
# │ How: Calculate operations-per-register for 32-bit vs. 8-bit types.
# │
# │ Imports: mlsys.formatting (fmt),
# │          mlsys.constants (SIMD_REGISTER_BITS, FP32_BITS, INT8_BITS)
# │ Exports: simd_fp32_str, simd_int8_str, simd_gain_str,
# │          simd_register_bits_str, simd_fp32_bits_str, simd_int8_bits_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
from mlsys.constants import KIB_TO_BYTES
from mlsys.constants import SIMD_REGISTER_BITS, FP32_BITS, INT8_BITS

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class SIMDThroughput:
    """
    Namespace for SIMD Throughput calculation.
    Scenario: Comparing ops per register for FP32 vs INT8.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    register_bits = 512
    fp32_bits = 32
    int8_bits = 8

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    ops_fp32 = register_bits // fp32_bits
    ops_int8 = register_bits // int8_bits
    gain = ops_int8 // ops_fp32

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(gain == 4, f"INT8 vs FP32 should yield 4x ops, got {gain}x")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    simd_fp32_str = f"{ops_fp32}"
    simd_int8_str = f"{ops_int8}"
    simd_gain_str = f"{gain}"
    simd_register_bits_str = f"{register_bits}"
    simd_fp32_bits_str = f"{fp32_bits}"
    simd_int8_bits_str = f"{int8_bits}"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
simd_fp32_str = SIMDThroughput.simd_fp32_str
simd_int8_str = SIMDThroughput.simd_int8_str
simd_gain_str = SIMDThroughput.simd_gain_str
simd_register_bits_str = SIMDThroughput.simd_register_bits_str
simd_fp32_bits_str = SIMDThroughput.simd_fp32_bits_str
simd_int8_bits_str = SIMDThroughput.simd_int8_bits_str
```

::: {.callout-notebook title="The SIMD Multiplier"}
**The Throughput Physics**: Why is INT8 faster than FP32 on the same processor?

**Mechanism**: SIMD (Single Instruction, Multiple Data). A CPU or GPU core processes data in fixed-width vector registers (e.g., AVX-512 is 512 bits wide).

**The Math**:

1.  **Register Width**: `{python} simd_register_bits_str` bits.
2.  **FP32 Capacity**: `{python} simd_register_bits_str` / `{python} simd_fp32_bits_str` = `{python} simd_fp32_str` operations per cycle.
3.  **INT8 Capacity**: `{python} simd_register_bits_str` / `{python} simd_int8_bits_str` = `{python} simd_int8_str` operations per cycle.

**The Multiplier**: By switching to INT8, you pack **`{python} simd_gain_str` $\times$ more elements** into the same register.
Throughput Gain = INT8 Ops/Cycle / FP32 Ops/Cycle = `{python} simd_int8_str` / `{python} simd_fp32_str` = `{python} simd_gain_str` $\times$

**The Systems Conclusion**: Quantization delivers a **`{python} simd_gain_str` $\times$ speedup** on compute-bound layers solely due to vector packing, even before considering memory bandwidth savings.
:::

Reducing numerical precision introduces trade-offs, however. Lower-precision formats can cause numerical instability and quantization noise, potentially affecting model accuracy. Some architectures, such as large transformer-based NLP models, tolerate quantization well, whereas others may experience significant degradation. Selecting the appropriate numerical precision therefore requires balancing accuracy constraints, hardware support, and efficiency gains.

To appreciate how precision loss manifests in practice, examine the representative quantization error distribution in @fig-quantization: the bell-shaped curve centered near zero shows that most values quantize with minimal error, but the tails reveal outlier errors that can accumulate and influence model accuracy. Understanding this noise is essential, but practitioners ultimately care about end-to-end speedup, and the magnitude of *the quantization speedup* depends on whether a workload is compute-bound or memory-bound.

![**Quantization Error Distribution**: Histogram of quantization error weighted by probability density p(x), showing a bell-shaped curve centered near zero with tails that introduce quantization noise affecting model accuracy.](images/svg/quantization_error.svg){#fig-quantization width=80% fig-alt="Histogram showing quantization error distribution weighted by probability density. Bell-shaped curve centered near zero with tails extending to positive and negative errors, illustrating typical quantization noise pattern."}

::: {.callout-notebook title="The Quantization Speedup (Compute-Bound)"}
**Problem**: You have a compute-bound matrix multiplication (e.g., in a Transformer MLP block). You switch from FP16 to INT8. What is the expected speedup?

**The Math**: On modern hardware with dedicated INT8 units:

1. **Tensor Core Throughput**: NVIDIA A100 delivers `{python} a100_tflops_fp16_str` TFLOPS for FP16 vs `{python} a100_tflops_int8_str` TOPS for INT8, a `{python} a100_int8_speedup_str` $\times$ peak throughput increase.
2. **Memory Bandwidth**: INT8 weights are half the size, so loading them from memory takes half the time.
3. **Combined Effect**: For compute-bound operations, the speedup is primarily from compute throughput: **~`{python} a100_int8_speedup_str` $\times$ speedup**.

**The Systems Insight**: The speedup from quantization depends on the bottleneck. Compute-bound operations (large batch sizes, high arithmetic intensity) see ~`{python} a100_int8_speedup_str` $\times$ from faster INT8 units. The bandwidth-bound case (demonstrated in the Optimization Framework's napkin math) achieves up to `{python} bandwidth_bound_speedup_str` $\times$ because memory traffic dominates, so halving data size nearly doubles effective throughput for both bandwidth and compute.
:::

### Numerical Format Comparison {#sec-model-compression-numerical-format-comparison-8fa2}

@tbl-numerics compares commonly used numerical precision formats in machine learning, each exhibiting distinct trade-offs in storage efficiency, computational speed, and energy consumption. Emerging formats like FP8\index{FP8} and TF32\index{TF32 (TensorFloat-32)} have been introduced to further optimize performance, especially on AI accelerators.

| **Precision Format**                       | **Bit-Width** | **Storage Reduction (vs FP32)** |                   **Compute Speed (vs FP32)** | **Power Consumption** | **Use Cases**                                               |
|:-------------------------------------------|--------------:|--------------------------------:|----------------------------------------------:|:----------------------|:------------------------------------------------------------|
| **FP32 (Single-Precision Floating Point)** |        32-bit |           Baseline (1 $\times$) |                         Baseline (1 $\times$) | High                  | Training & inference (general-purpose)                      |
| **FP16 (Half-Precision Floating Point)**   |        16-bit |              2 $\times$ smaller |  2 $\times$ faster on FP16-optimized hardware | Lower                 | Accelerated training, inference (NVIDIA Tensor Cores, TPUs) |
| **bfloat16 (Brain Floating Point)**        |        16-bit |              2 $\times$ smaller |   Similar speed to FP16, better dynamic range | Lower                 | Training on TPUs, transformer-based models                  |
| **TF32 (TensorFloat-32)**                  |        19-bit |                 Similar to FP16 | Up to 8 $\times$ faster on NVIDIA Ampere GPUs | Lower                 | Training on NVIDIA GPUs                                     |
| **FP8 (Floating-Point 8-bit)**             |         8-bit |              4 $\times$ smaller |                Faster than INT8 in some cases | Significantly lower   | Efficient training/inference (H100, AI accelerators)        |
| **INT8 (8-bit Integer)**                   |         8-bit |              4 $\times$ smaller |                 4–8 $\times$ faster than FP32 | Significantly lower   | Quantized inference (Edge AI, mobile AI, NPUs)              |
| **INT4 (4-bit Integer)**                   |         4-bit |              8 $\times$ smaller |                            Hardware-dependent | Extremely low         | Ultra-low-power AI, experimental quantization               |
| **Binary/Ternary (1-bit / 2-bit)**         |       1–2-bit |          16–32 $\times$ smaller |                     Highly hardware-dependent | Lowest                | Extreme efficiency (binary/ternary neural networks)         |

: **Numerical Precision Formats**: Comparison of precision formats by bit width, memory reduction, computational efficiency, accuracy retention, and typical use cases across deployment contexts. {#tbl-numerics}

\index{bfloat16!dynamic range preservation}
\index{FP16!reduced dynamic range}
FP16\index{FP16!training}\index{Quantization!FP16 (half precision)} and bfloat16\index{bfloat16}\index{Quantization!bfloat16} formats provide moderate efficiency gains while preserving model accuracy. Many AI accelerators, such as NVIDIA Tensor Cores and TPUs, include dedicated support for FP16 computations, enabling $2\times$ faster matrix operations compared to FP32. BFloat16, in particular, retains the same 8-bit exponent as FP32 but with a reduced 7-bit mantissa, allowing it to maintain a similar dynamic range (~$10^{-38}$ to $10^{38}$) while sacrificing precision. In contrast, FP16, with its 5-bit exponent and 10-bit mantissa, has a significantly reduced dynamic range (~$10^{-5}$ to $10^5$), making it more suitable for inference rather than training. Since BFloat16 preserves the exponent size of FP32, it better handles extreme values encountered during training, whereas FP16 may struggle with underflow or overflow. This makes BFloat16 a more robust alternative for deep learning workloads that require a wide dynamic range.

Compare the three bit layouts in @fig-3float to see exactly where the bits go—and why the trade-off between precision and numerical range[^fn-floating-point-dynamic-range] differs so sharply across formats.

[^fn-floating-point-dynamic-range]: **Floating-Point Dynamic Range**: The dynamic range of a floating-point format is determined by its exponent bit-width and bias. FP32 and BFloat16 both use an 8-bit exponent with a bias of 127, resulting in an exponent range of $[-126, 127]$ and an approximate numerical range of $10^{-38}$ to $10^{38}$. FP16, with a 5-bit exponent and a bias of 15, has an exponent range of $[-14, 15]$, leading to a more constrained numerical range of roughly $10^{-5}$ to $10^5$. This reduced range in FP16 can lead to numerical instability in training, whereas BFloat16 retains FP32's broader range, making it more suitable for training deep neural networks.

::: {#fig-3float fig-env="figure" fig-pos="htb" fig-cap="**Floating-Point Precision**: Reduced-precision formats like FP16 and bfloat16 trade off numerical range for computational efficiency and memory savings. Bfloat16 maintains the exponent size of FP32, preserving its dynamic range and suitability for training, while FP16's smaller exponent limits its use to inference or carefully scaled training scenarios." fig-alt="Three horizontal bit-layout diagrams. FP32 shows 1-bit sign, 8-bit exponent, 23-bit mantissa. FP16 shows 1-bit sign, 5-bit exponent, 10-bit mantissa. BFloat16 shows 1-bit sign, 8-bit exponent, 7-bit mantissa."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{col1}{RGB}{239,230,197}
\definecolor{col2}{RGB}{245,208,122}
\definecolor{col3}{RGB}{242,162,57}
\colorlet{col1}{VioletL}
\colorlet{col2}{RedL}
\colorlet{col3}{RedLine!50}
\tikzset{
  Box/.style={inner xsep=2pt,
  %rounded corners,
  node distance=0,
  draw=black!90,
    line width=0.75pt,
    anchor=west,
    align=flush center,
    minimum width=54mm, minimum height=12mm
  },
}

\node[Box,fill=col1,anchor=south west,minimum width=30
](B1){\textbf{1-bit}\\sign};
\node[Box,fill=col2,right=of B1,minimum width=160
](B2){\textbf{8-bit} exponent};
 \node[Box,fill=col3,right=of B2,minimum width=470,name path=GG,
](B3){\textbf{23-bit} mantissa};
\node[left=2mmof B1]{\textbf{Float32}};

\begin{scope}[shift={(0,-2)}]
\node[Box,fill=col1,anchor=south west,minimum width=30
](BB1){\textbf{1-bit}\\sign};
\node[Box,fill=col2,right=of BB1,minimum width=100
](BB2){\textbf{5-bit} exponent};
 \node[Box,fill=col3,right=of BB2,minimum width=200
](BB3){\textbf{10-bit} mantissa};
\node[left=2mmof BB1]{\textbf{Float16}};
\end{scope}

\begin{scope}[shift={(0,-4)}]
\node[Box,fill=col1,anchor=south west,minimum width=30
](DB1){\textbf{1-bit}\\sign};
\node[Box,fill=col2,right=of DB1,minimum width=160
](DB2){\textbf{8-bit} exponent};
 \node[Box,fill=col3,right=of DB2,minimum width=140
](DB3){\textbf{7-bit} mantissa};
\node[left=2mmof DB1]{\textbf{BFloat16}};
\end{scope}

\draw[dashed,line width=0.75pt](DB3.south east)--++(270:0.5);
\draw[dashed,line width=0.75pt,name path=D](DB3.south east)--++(90:6);
%\node[Box,fill=cyan!10,minimum width=640](B13){S};

\path [name intersections={of=D and GG,by={X,Y}}];

\draw[align=center,
text width=62mm,
decoration={brace,amplitude=13pt},
decorate,thick] ([yshift=5mm,xshift=0mm]B1.north west) -- ([yshift=5mm]X)
node [midway,above=5mm] {\textbf{16 bits}};

\draw[align=center,
text width=62mm,
decoration={brace,amplitude=13pt},
decorate,thick] ([yshift=5mm,xshift=0mm]X) -- ([yshift=5mm]B3.north east)
node [midway,above=5mm] {\textbf{16 bits}};
\end{tikzpicture}
```
:::

INT8\index{INT8!efficiency gains}\index{Quantization!INT8 (8-bit integer)} precision offers more aggressive efficiency improvements for inference workloads. Many quantized models use INT8 for inference, reducing storage by $4\times$ while accelerating computation by 4–8 $\times$ on optimized hardware. INT8 is widely used in mobile and embedded AI, where energy constraints are significant. As the quantization savings calculation demonstrated earlier, this $4\times$ reduction in model size enables a `{python} llm_params_b_str` billion parameter model to fit on a single consumer GPU rather than requiring data center hardware.

Binary and ternary networks\index{Quantization!binary networks}\index{Quantization!ternary networks} represent the extreme end of quantization, where weights and activations are constrained to 1-bit (binary) or 2-bit (ternary) values. This results in massive storage and energy savings, but model accuracy often degrades significantly unless specialized architectures are used. Our *keyword spotting* lighthouse lives precisely in this regime, necessitating strategies for *keyword spotting and extreme compression*.

::: {.callout-lighthouse title="Keyword Spotting and Extreme Compression"}
**The Extreme Constraint**: Our **Keyword Spotting (KWS) Lighthouse** (@sec-network-architectures) lives here. Running on a microcontroller with 256 KB of SRAM means "standard" compression is not enough.

For KWS, INT8 quantization is often just the *starting point*. To fit complex acoustic models into embedded sensors, engineers push toward **INT4** or even **Binary** weights. In this regime, the **Information** (Data) is noisy audio, the **Logic** (Algorithm) is highly simplified, and the **Physics** (Machine) has a power budget measured in microwatts.
:::

### Energy Efficiency and Sustainability {#sec-model-compression-energy-efficiency-sustainability-7150}
\index{Energy Efficiency!quantization benefits}
\index{Model Compression!energy efficiency}

Quantization reduces energy consumption through two complementary mechanisms: smaller data types require fewer memory accesses (the dominant energy cost, as established in the Physics of Quantization analysis above), and lower-precision arithmetic units consume less power per operation. An INT8 operation, for example, uses roughly `{python} int8_fp32_energy_ratio_str` $\times$ less energy than its FP32 equivalent, compounding the memory energy savings into substantial reductions at system scale.

These gains depend on hardware support. AI accelerators with dedicated low-precision units (Tensor Cores for FP16/INT8, or the newer FP8 units on H100-class hardware) realize the full energy benefit, while general-purpose CPUs lacking such units see limited improvement. The practical implication is that energy efficiency from quantization is not a software-only optimization; it requires matching the chosen precision format to the target hardware's arithmetic capabilities.

### Precision Reduction Strategies {#sec-model-compression-precision-reduction-strategies-db83}

The preceding sections established *why* reduced precision matters: fewer bits means less energy, less memory traffic, and faster arithmetic. The question now is *how* to reduce precision without destroying model accuracy. Naive quantization introduces errors that degrade predictions, so practitioners need structured strategies that control *where* and *how* precision is reduced.

Three approaches form a complexity ladder. Post-training quantization (PTQ) reduces precision after training, requiring no retraining and minimal engineering effort. Quantization-aware training (QAT) incorporates quantization effects into the training loop, enabling models to adapt to lower precision and retain higher accuracy. Mixed-precision training assigns different precision levels to different operations, matching precision to each layer's sensitivity.

Before diving into each strategy, orient yourself with @fig-quantization-roadmap, which maps quantization techniques into three progressive tiers based on implementation complexity, resource requirements, and target use cases.

::: {#fig-quantization-roadmap fig-env="figure" fig-pos="htb" fig-cap="**Quantization Complexity Roadmap**: Three progressive tiers of quantization techniques, from foundational approaches suitable for quick deployment to research frontier methods for extreme resource constraints, reflecting increasing implementation effort, resource requirements, and potential accuracy trade-offs." fig-alt="Tiered diagram with three levels. Foundation tier includes PTQ and basic INT8. Production tier adds QAT and mixed precision. Research frontier tier shows INT4, binary, and ternary quantization with icons for increasing complexity."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]

\tikzset{
Box/.style={align=flush center,
    inner xsep=2pt,
    node distance=1.4,
    draw=OrangeLine,
    line width=0.75pt,
    rounded corners,
    fill=OrangeL!40,
    text width=50mm,
    minimum width=50mm, minimum height=18mm
  }
  }

\tikzset{%
planet/.style = {circle, draw=yellow!50!red!90,semithick, fill=yellow!30,line width=1.5pt,
                    font=\usefont{T1}{phv}{m}{n}\bfseries,
                    minimum size=24mm, inner sep=1mm,align=flush center},
satelliteI/.style = {circle, draw=none, semithick, node distance=5,%fill=#1!10,
                    text width=35mm, inner sep=1pt, align=flush center,minimum size=20mm,minimum height=12mm},
satellite/.style = {circle, draw=none, semithick, fill=#1!10,
                    text width=26mm, inner sep=1pt, align=flush center,minimum size=20mm,minimum height=12mm},
TxtC/.style = {font=\small\usefont{T1}{phv}{m}{n},text width=44mm,align=flush center},
arr/.style = {-{Triangle[length=3mm,width=6mm]}, color=#1!60,
                    line width=3mm, shorten <=1mm, shorten >=1mm},
Line/.style = {},
LineA/.style = {violet!60,{Circle[line width=1.5pt,fill=white,length=7.5pt]}-,line width=2.0pt,shorten <=-4pt},
LineAA/.style={violet!30,dashed, line width=1.0pt,{-{Triangle[width=1.0*6pt,length=1.6*6pt]}},shorten <=3pt,shorten >=2pt}
}

\tikzset{pics/brain/.style = {
        code = {
        \pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box=BRAIN,scale=\scalefac, every node/.append style={transform shape}]
\draw[fill=\filllcolor,line width=\Linewidth](-0.3,-0.10)to(0.08,0.60)
to[out=60,in=50,distance=3](-0.1,0.69)to[out=160,in=80](-0.26,0.59)to[out=170,in=90](-0.46,0.42)
to[out=170,in=110](-0.54,0.25)to[out=210,in=150](-0.54,0.04)
to[out=240,in=130](-0.52,-0.1)to[out=300,in=240]cycle;
\draw[fill=\filllcolor,line width=\Linewidth]
(-0.04,0.64)to[out=120,in=0](-0.1,0.69)(-0.19,0.52)to[out=120,in=330](-0.26,0.59)
(-0.4,0.33)to[out=150,in=280](-0.46,0.42)
%
(-0.44,-0.03)to[bend left=30](-0.34,-0.04)
(-0.33,0.08)to[bend left=40](-0.37,0.2) (-0.37,0.12)to[bend left=40](-0.45,0.14)
(-0.26,0.2)to[bend left=30](-0.24,0.13)
(-0.16,0.32)to[bend right=30](-0.27,0.3)to[bend right=30](-0.29,0.38)
(-0.13,0.49)to[bend left=30](-0.04,0.51);
\draw[rounded corners=0.8pt,line width=2*\Linewidth,\drawcircle,-{Circle[fill=\filllcolor,length=5.5pt]}](-0.23,0.03)--(-0.15,-0.03)--(-0.19,-0.18)--(-0.04,-0.28);
\draw[rounded corners=0.8pt,line width=2*\Linewidth,\drawcircle,-{Circle[fill=\filllcolor,length=5.5pt]}](-0.17,0.13)--(-0.04,0.05)--(-0.06,-0.06)--(0.14,-0.11);
\draw[rounded corners=0.8pt,line width=2*\Linewidth,\drawcircle,-{Circle[fill=\filllcolor,length=5.5pt]}](-0.12,0.23)--(0.31,0.0);
\draw[rounded corners=0.8pt,line width=2*\Linewidth,\drawcircle,-{Circle[fill=\filllcolor,length=5.5pt]}](-0.07,0.32)--(0.06,0.26)--(0.16,0.33)--(0.34,0.2);
\draw[rounded corners=0.8pt,line width=2*\Linewidth,\drawcircle,-{Circle[fill=\filllcolor,length=5.5pt]}](-0.01,0.43)--(0.06,0.39)--(0.18,0.51)--(0.31,0.4);
\coordinate(PO)at(-0.1,0.2);
\node[circle,draw=white,line width=1pt,fill=\filllcirclecolor,minimum size=5mm,inner sep=0pt](LV)at(PO){};
\node[draw=none,rotate=40,rounded corners=2pt,rectangle,minimum width=1.2mm,inner sep=1pt,
fill=\filllcirclecolor,minimum height=6mm,anchor=north]at(PO){};
\node[circle,draw=none,fill=white,minimum size=3.0mm,inner sep=0pt](LM)at(PO){};
\node[font=\tiny\bfseries]at(LM){...};
\end{scope}
     }
  }
}
\tikzset{pics/factory/.style = {
        code = {
        \pgfkeys{/channel/.cd, #1}
\begin{scope}[local bounding box=FACTORY,scale=\scalefac, every node/.append style={transform shape}]
\node[rectangle,draw=\drawcolor,fill=\filllcolor!50,minimum height=15,minimum width=23,,line width=\Linewidth](R1){};
\draw[fill=\filllcolor!50,line width=1.0pt]($(R1.40)+(0,-0.01)$)--++(110:0.2)--++(180:0.12)|-($(R1.40)+(0,-0.01)$);
\draw[,line width=\Linewidth,fill=green](-0.68,-0.27)--++(88:1.10)--++(0:0.15)--(-0.48,-0.27)--cycle;
\draw[line width=2.5pt](-0.8,-0.27)--(0.55,-0.27);

\foreach \x in{0.25,0.45,0.65}{
\node[rectangle,fill=black,minimum height=2,minimum width=5,thick,inner sep=0pt]
at ($(R1.north)!\x!(R1.south)$){};
}
\foreach \x in{0.25,0.45,0.65}{
\node[rectangle,fill=black,minimum height=2,minimum width=5,thick,inner sep=0pt]
at ($(R1.130)!\x!(R1.230)$){};
}
\foreach \x in{0.25,0.45,0.65}{
\node[rectangle,fill=black,minimum height=2,minimum width=5,thick,inner sep=0pt]
at ($(R1.50)!\x!(R1.310)$){};
}
\end{scope}
     }
  }
}
%brick
\tikzset{
  cigla/.style={ inner sep=0pt,anchor=west,
    node distance=1.4pt,
    draw=none,
    line width=0.1pt,
    rounded corners=1pt,
    fill=\filllcolor,
    minimum width=4mm, minimum height=2mm
  },
    cigla1/.style={cigla,fill=\filllcirclecolor},
pics/brick/.style = {
        code = {
        \pgfkeys{/channel/.cd, #1}
\begin{scope}[shift={($(0,0)+(0,0)$)},scale=\scalefac,every node/.append style={transform shape}]
\path[clip] (-1.05,-0.52)rectangle (0.71,0.45);
\node[cigla](C1) at (-1.03,-0.4){};
\node[cigla1,right= of C1](C2){};
\node[cigla,right= of C2](C3){};
\node[cigla1,right= of C3](C4){};
%
\node[cigla,above right= of C1,anchor=south](C11){};
\node[cigla1,right= of C11](C12){};
\node[cigla,right= of C12](C13){};
\node[cigla1,right= of C13](C14){};
%
\node[cigla,above right= of C11,anchor=south](C21){};
\node[cigla1,right= of C21](C22){};
\node[cigla,right= of C22](C23){};
%
\node[cigla,above right= of C21,anchor=south](C31){};
\node[cigla1,right= of C31](C32){};
\node[cigla,right= of C32](C33){};
\end{scope}
    }
  }
}
\pgfkeys{
  /channel/.cd,
   Depth/.store in=\Depth,
  Height/.store in=\Height,
  Width/.store in=\Width,
  filllcirclecolor/.store in=\filllcirclecolor,
  filllcolor/.store in=\filllcolor,
  drawcolor/.store in=\drawcolor,
  drawcircle/.store in=\drawcircle,
  scalefac/.store in=\scalefac,
  Linewidth/.store in=\Linewidth,
  picname/.store in=\picname,
  filllcolor=BrownLine,
  filllcirclecolor=violet!20,
  drawcolor=black,
  drawcircle=violet,
  scalefac=1,
  Linewidth=0.5pt,
  Depth=1.3,
  Height=0.8,
  Width=1.1,
  picname=C
}

\def\radius{3.2}
\def\startangle{90}

\node[satelliteI,fill=green!79!black!10](S1){};
\node[satelliteI,fill=red!10,right=of S1](S2){};
\node[satelliteI,fill=gray!10,right=of S2](S3){};
%logos
\pic[shift={(0.15,-0.5)}] at  (S3){brain={scalefac=2.3,picname=1,filllcolor=orange!30!, filllcirclecolor=cyan!55!black!60, Linewidth=0.5pt}};
\pic[shift={(0.2,-0.40)}] at  (S2){factory={scalefac=1.8,picname=1,filllcolor=brown!, Linewidth=0.5pt}};
\pic[shift={(0.15,0.15)}] at  (S1) {brick={scalefac=1.5,picname=1,filllcolor=red!70!black!80, Linewidth=1.0pt,filllcirclecolor=red!90!black!50}};
 \def\ra{26mm}
\foreach \i [count=\k from 1] in{180,180,180}{
\pgfmathtruncatemacro{\newX}{\i + 90} %
\draw[Line,line width=2.6pt,violet]
   (S\k)+(\i:0.7*\ra) arc[start angle=\i, end angle=\newX, radius=0.7*\ra];
}
\draw[LineA](S1.220)--++(220:1.75)coordinate(MA);
 \node[Box,anchor=north](FO)at(MA){\textbf{Foundational}\\ Post-Training Quantization
FP32/FP16/INT8 Basic Calibration};
 \draw[LineA](S2.220)--++(220:1.75)coordinate(ST);
 \node[Box,anchor=north](PR)at(ST){\textbf{Production}\\ Quantization-Aware Training
Mixed-Precision Per-Channel Quantization};
 \draw[LineA](S3.220)--++(220:1.75)coordinate(ST1);
 \node[Box,anchor=north](RE)at(ST1){\textbf{Research Frontier}\\ INT4/INT2
Binary/Ternary Networks Extreme Quantization};
%
\node[TxtC,below=2pt of FO]{Quick deployment\\ Minimal training cost\\ 0.5-2\% accuracy loss};
\node[TxtC,below=2pt of PR]{Production systems\\ Requires retraining\\ 0.2-1\% accuracy loss};
\node[TxtC,below=2pt of RE]{Extreme constraints\\ Architectural changes \\ 2-10\% accuracy loss};
%
\draw[-{Triangle[width=18pt,length=8pt]}, line width=10pt,cyan!40,shorten >=5pt, shorten <=5pt]
(S1)--node[above,text=black]{Increasing}(S2);
\draw[-{Triangle[width=18pt,length=8pt]}, line width=10pt,cyan!40,shorten >=5pt, shorten <=5pt]
(S2)--node[above,text=black]{Complexity}(S3);
 \end{tikzpicture}
```
:::

#### Post-Training Quantization {#sec-model-compression-posttraining-quantization-bb5b}

Post-training quantization\index{Post-Training Quantization (PTQ)!definition} (PTQ) reduces numerical precision after training, converting weights and activations from high-precision formats (FP32) to lower-precision representations (INT8\index{INT8!inference} or FP16\index{FP16!inference}) without retraining [@jacob2018quantization]. This achieves smaller model sizes, faster computation, and reduced energy consumption, making it practical for resource-constrained environments such as mobile devices, edge AI systems, and cloud inference platforms [@wu2020integer].

\index{Framework Toolkits!quantization support}
PTQ's key advantage is low computational cost: it requires no retraining or access to training data. However, reducing precision introduces quantization error\index{Quantization!error} that can degrade accuracy, especially for tasks requiring fine-grained numerical precision. Machine learning frameworks (TensorFlow Lite, ONNX Runtime, PyTorch) provide built-in PTQ support.

##### PTQ Functionality {#sec-model-compression-ptq-functionality-bd56}

\index{Quantization!scaling factor}
The core mechanism of PTQ is uniform quantization\index{Quantization!uniform}, which maps floating-point values to discrete integer levels using a consistent scaling factor. Because the interval between each quantized value is constant, uniform quantization simplifies implementation and enables efficient hardware execution. The quantized value $q$ is computed as:
$$
q = \text{round} \left(\frac{x}{s} \right)
$$
where:

- $q$ is the quantized integer representation,
- $x$ is the original floating-point value,
- $s$ is a scaling factor that maps the floating-point range to the available integer range.

@lst-quantization_example demonstrates uniform quantization from FP32 to INT8, achieving 4 $\times$ memory reduction while measuring the resulting quantization error.

::: {#lst-quantization_example lst-cap="**Uniform Quantization**: Converts FP32 weights to INT8 format, achieving 4 $\times$ memory reduction while measuring quantization error."}
```{.python}
import torch

# Original FP32 weights
weights_fp32 = torch.tensor(
    [0.127, -0.084, 0.392, -0.203], dtype=torch.float32
)
print(f"Original FP32: {weights_fp32}")
print(f"Memory per weight: 32 bits")

# Simple uniform quantization to INT8 (-128 to 127)
# Step 1: Find scale factor
max_val = weights_fp32.abs().max()
scale = max_val / 127  # 127 is max positive INT8 value

# Step 2: Quantize using our formula q = round(x/s)
weights_int8 = torch.round(weights_fp32 / scale).to(torch.int8)
print(f"Quantized INT8: {weights_int8}")
print(f"Memory per weight: 8 bits (reduced from 32)")

# Step 3: Dequantize to verify
weights_dequantized = weights_int8.float() * scale
print(f"Dequantized: {weights_dequantized}")
print(
    f"Quantization error: "
    f"{(weights_fp32 - weights_dequantized).abs().mean():.6f}"
)
```
:::

Once quantized, inference is performed using integer arithmetic, which is significantly more efficient than floating-point operations on most hardware platforms [@gholami2021survey].

An alternative, non-uniform quantization\index{Quantization!non-uniform}, assigns finer-grained precision to numerical ranges that are more densely populated, which can preserve accuracy for models whose weight distributions concentrate around specific values. Non-uniform schemes require more complex calibration and are less common in production, but they can be effective for models particularly sensitive to precision changes.

PTQ works well for computer vision models, where CNNs often tolerate quantization without significant accuracy loss. Models that rely on small numerical differences, such as NLP transformers or speech recognition systems, may require quantization-aware training or non-uniform strategies to retain performance.

##### Calibration {#sec-model-compression-calibration-c650}

An important aspect of PTQ is the calibration\index{Quantization!calibration} step, which selects the clipping range [$\alpha$, $\beta$] for quantizing model weights and activations. The effectiveness of precision reduction depends heavily on this chosen range: without proper calibration, quantization may cause significant accuracy degradation. Calibration ensures that the chosen range minimizes information loss and preserves model performance.

Walk through the PTQ pipeline in @fig-ptq-calibration step by step. A calibration dataset, a representative subset of training or validation data, is passed through the pre-trained model to estimate the numerical distribution of activations and weights. This distribution then defines the clipping range for quantization. The quantization step converts model parameters to a lower-precision format, producing the final quantized model.

::: {#fig-ptq-calibration fig-env="figure" fig-pos="htb" fig-cap="**Post-Training Quantization**: Calibration with a representative dataset determines optimal quantization ranges for model weights and activations, minimizing information loss during quantization to create efficient, lower-precision models. This process converts a pre-trained model into a quantized version suitable for deployment on resource-constrained devices." fig-alt="Vertical flowchart with four boxes connected by arrows. Pre-trained model and Calibration data feed into Calibration step, which feeds into Quantization step, producing final Quantized model output."}
```{.tikz}
\begin{tikzpicture}[font=\footnotesize\usefont{T1}{phv}{m}{n}]
\tikzset{
  Box/.style={inner xsep=2pt,
   node distance=0.5,
    draw=black!90,
    line width=0.75pt,
    anchor=west,
    align=flush center,
    minimum width=64mm,
    minimum height=7.5mm
  },
Line/.style={line width=1.0pt,black!50,-latex}
}

\node[Box,fill=GreenL](B1){Quantized model};
\node[Box,fill=BlueL,above=of B1](B2){Quantization};
\node[Box,fill=BlueL,above=of B2](B3){Calibration};
\node[Box,fill=GreenL,above=of B3.north west,minimum width=30mm,
anchor= south west](B4){Pre-trained model};
\node[Box,fill=BrownL,above=of B3.north east,minimum width=30mm,
anchor= south east](B5){Calibration data};
\draw[Line](B2)--(B1);
\draw[Line](B3)--(B2);
\draw[Line](B4)--(B4|-B3.north);
\draw[Line](B5)--(B5|-B3.north);
\end{tikzpicture}
```
:::

For example, consider quantizing activations that originally range between -6 and 6 to 8-bit integers. Simply using the full integer range of -128 to 127 might not be the most effective approach. Calibration passes a representative dataset through the model, observes the actual activation range, and uses that observed range to set a tighter quantization range, reducing information loss.

Common calibration methods include **Max**\index{Quantization!calibration!max method} (uses maximum absolute value, simple but susceptible to outliers), **Entropy**\index{Quantization!calibration!entropy method} (minimizes KL divergence between original and quantized distributions, TensorRT's default), and **Percentile**\index{Quantization!calibration!percentile method} (clips to a percentile, e.g., 99%, avoiding outlier impact). @fig-resnet-activations-histogram shows *why* outlier handling matters: ResNet50 activations exhibit long tails where outliers can skew the quantization range.

![**Activation Distribution**: Resnet50 layer activations exhibit a long tail, with outlier values that can lead to inefficient precision use if not handled carefully. Source: [@wu2020integer].](images/svg/activation_histogram.svg){#fig-resnet-activations-histogram width=85% fig-alt="Histogram of ResNet50 activation values showing right-skewed distribution. Most values cluster near zero with long tail extending to outliers around 2.1, demonstrating challenge for quantization range selection."}

Calibration ranges can be **symmetric**\index{Quantization!symmetric} (equal positive and negative scaling) or **asymmetric**\index{Quantization!asymmetric} (different scaling factors for each side, useful when distributions are skewed). The choice of method and range significantly affects quantized model accuracy.

##### Tuning Quantization Ranges {#sec-model-compression-tuning-quantization-ranges-439d}

A key challenge in post-training quantization is selecting the appropriate calibration range $[\alpha, \beta]$ to map floating-point values into a lower-precision representation. The choice of this range directly affects the quantization error and, consequently, the accuracy of the quantized model. @fig-calibration-ranges contrasts the two primary calibration strategies: symmetric calibration and asymmetric calibration.

::: {#fig-calibration-ranges fig-env="figure" fig-pos="htb" fig-cap="**Calibration Range Selection**: Symmetric calibration uses a fixed range around zero, while asymmetric calibration adapts the range to the data distribution, potentially minimizing quantization error and preserving model accuracy. Choosing an appropriate calibration strategy balances precision with the risk of saturation for outlier values." fig-alt="Two side-by-side mapping diagrams. Left shows symmetric calibration with range from -1 to 1 mapping to -127 to 127 with zero aligned. Right shows asymmetric calibration with range -0.5 to 1.5 mapping with shifted zero point."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\large\usefont{T1}{phv}{m}{n}]
\tikzset{
Line/.style={line width=1.0pt,red,text=black},
LineT/.style={black,line width=0.5pt,-latex,shorten >=2pt},
LineD/.style={dashed,black,line width=0.75pt},
}
%\node[]at(-0.7,1.7){ \includegraphics[scale=1.0]{1}};

\begin{scope}[local bounding box=RIGHT,shift={($(10.5,0)+(0,0)$)}]
\coordinate(RD1)at(0,0);
\coordinate(RD2)at(2.3,0);
\coordinate(RD3)at(4.42,0);
\coordinate(RD4)at(5.67,0);
\coordinate(RD5)at(8.79,0);
\coordinate(RG1)at(0.33,3.35);
\coordinate(RGG1)at($(RG1)+(-0.5,0)$);
\coordinate(RG2)at(1.81,3.35);
\coordinate(RG3)at(3.15,3.35);
\coordinate(RG4)at(4.43,3.35);
\coordinate(RG5)at(5.09,3.35);
\coordinate(RG6)at(6.87,3.35);
\coordinate(RG7)at(8.23,3.35);
\coordinate(RGG7)at($(RG7)+(0.5,0)$);
\draw[LineD,latex-latex](RGG1)--(RGG7)node[right,text=black]{$r$};
\draw[Line](RD1)--(RD5)node[right=3pt,text=black]{$Q$};
\draw[Line](RG2)node[above=2pt]{$\alpha=-0.5$}--(RG6);
\draw[LineT](RG2)--(RD1)node[below=2pt]{$-128$};
\draw[LineT](RG1)--(RD1);
\draw[LineT](RG3)node[above=2pt]{$0$}--(RD2)node[below=2pt]{$-Z$};
\draw[LineT](RG4)node[above=2pt]{$SZ$}--(RD3)node[below=2pt]{$0$};
\draw[LineT](RG5)--(RD4);
\draw[LineT](RG6)node[above=2pt]{$\beta=1.5$}--(RD5);
\draw[LineT](RG7)--(RD5)node[below=2pt]{$127$};
\foreach \i/\cl in {1/red,2/red,3/green!70!black,4/red,5/red,6/red,7/red} {
\fill[\cl](RG\i)circle(2.5pt);
}
\foreach \i/\cl in {1/red,2/red,3/green!70!black,4/red,5/red} {
\fill[\cl](RD\i)circle(2.5pt);
}\end{scope}

\begin{scope}[local bounding box=LEFT,shift={($(0,0)+(0,0)$)}]
\coordinate(D1)at(0,0);
\coordinate(D2)at(2.3,0);
\coordinate(D3)at(4.42,0);
\coordinate(D4)at(5.67,0);
\coordinate(D5)at(8.79,0);
\coordinate(G1)at(0.33,3.35);
\coordinate(GG1)at($(G1)+(-0.5,0)$);
\coordinate(G2)at(1.81,3.35);
\coordinate(G3)at(3.15,3.35);
\coordinate(G4)at(4.43,3.35);
\coordinate(G5)at(5.09,3.35);
\coordinate(G6)at(6.87,3.35);
\coordinate(G7)at(8.23,3.35);
\coordinate(GG7)at($(G7)+(0.5,0)$);
\draw[LineD,latex-latex](GG1)--(GG7)node[right,text=black]{$r$};
\draw[Line](D1)--(D5)node[right=3pt,text=black]{$Q$};
\draw[Line](G2)node[above=2pt]{$\alpha=-1$}--(G6);
\draw[LineT](G2)--(D1)node[below=2pt]{$-127$};
\draw[LineT](G1)--(D1);
\draw[LineT](G3)--(D2);
\draw[LineT](G4)node[above=2pt]{$0$}--(D3)node[below=2pt]{$0$};
\draw[LineT](G5)--(D4);
\draw[LineT](G6)node[above=2pt]{$\beta=1$}--(D5);
\draw[LineT](G7)--(D5)node[below=2pt]{$127$};
\foreach \i/\cl in {1/red,2/red,3/red,4/green!70!black,5/red,6/red,7/red} {
\fill[\cl](G\i)circle(2.5pt);
}
\foreach \i/\cl in {1/red,2/red,3/green!70!black,4/red,5/red} {
\fill[\cl](D\i)circle(2.5pt);
}
\end{scope}
\end{tikzpicture}
```
:::

Compare the two mapping diagrams side by side in @fig-calibration-ranges. Symmetric calibration (left) maps $[-1, 1]$ to $[-127, 127]$ with zero preserved, making it simpler to implement and well suited for zero-centered weight distributions. Asymmetric calibration (right) uses different ranges ($\alpha = -0.5$, $\beta = 1.5$), better utilizing the quantized range for skewed distributions at the cost of additional complexity. Most frameworks (TensorRT, PyTorch) support both modes. The conceptual difference is clear from the diagrams, but the actual computation of scale and zero-point parameters requires a concrete formula—which the following worked example of *calculating scale and zero-point* derives step by step.

```{python}
#| label: quantization-math-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ QUANTIZATION MATH: SCALE AND ZERO-POINT
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: "Calculating Scale and Zero-Point" callout
# │
# │ Goal: Demonstrate the derivation of affine quantization parameters.
# │ Show: How to map a floating-point range to the 8-bit integer domain.
# │ How: Calculate scale and zero-point for a representative distribution.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: alpha_str, beta_str, range_str, steps_str, scale_str,
# │          zero_point_str, x_val_str, x_q_str, x_recon_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
from mlsys.constants import KIB_TO_BYTES

# --- Inputs (activation range example) ---
alpha_value = -1.0
beta_value = 3.0
bits_value = 8
x_val_value = 0.0  # value to quantize

# --- Process (calculate affine parameters) ---
# 1. Calculate Scale (s)
#    s = (beta - alpha) / (2^b - 1)
int_steps_value = 2**bits_value - 1
scale_value = (beta_value - alpha_value) / int_steps_value

# 2. Calculate Zero-Point (z)
#    z = round(-alpha / s)
#    Note: z maps the real value 0.0 to an integer
zero_point_value = round(-alpha_value / scale_value)

# 3. Quantize a value
#    x_q = clamp(round(x / s) + z, 0, 2^b - 1)
x_q_raw = round(x_val_value / scale_value) + zero_point_value
x_q_value = max(0, min(int_steps_value, x_q_raw))

# 4. Dequantize (reconstruct)
#    x_recon = (x_q - z) * s
x_recon_value = (x_q_value - zero_point_value) * scale_value

# --- Outputs (formatted strings for prose) ---
alpha_str = fmt(alpha_value, precision=1, commas=False)       # "-1.0"
beta_str = fmt(beta_value, precision=1, commas=False)         # "3.0"
range_str = fmt(beta_value - alpha_value, precision=1, commas=False) # "4.0"
steps_str = f"{int_steps_value}"                              # "255"
scale_str = fmt(scale_value, precision=4, commas=False)       # "0.0157"
zero_point_str = f"{int(zero_point_value)}"                   # "64"
x_val_str = fmt(x_val_value, precision=1, commas=False)       # "0.0"
x_q_str = f"{int(x_q_value)}"                                 # "64"
x_recon_str = fmt(x_recon_value, precision=2, commas=False)   # "0.00"
```

::: {.callout-notebook title="Calculating Scale and Zero-Point"}

**The Affine Quantization Formula**: How do we map a floating-point range $[\alpha, \beta]$ to an integer range $[0, 2^b-1]$ (e.g., UINT8)?

**The Math**:
We need a linear mapping $x \approx s(x_q - z)$, where $s$ is the **scale** (step size) and $z$ is the **zero-point** (integer value corresponding to real zero). The affine quantization process consists of three steps formalized in @eq-quant-scale, @eq-quant-zero-point, and @eq-quantize.

1.  **Calculate Scale ($s$)**: Divide the real range by the integer range.
    $$s = \frac{\beta - \alpha}{2^b - 1}$$ {#eq-quant-scale}

2.  **Calculate Zero-Point ($z$)**: Shift the range so that real zero maps to an integer.
    $$z = \text{round}\left(\frac{-\alpha}{s}\right)$$ {#eq-quant-zero-point}

3.  **Quantize ($x \to x_q$)**:
    $$x_q = \text{clamp}\left(\text{round}\left(\frac{x}{s} + z\right), 0, 2^b - 1\right)$$ {#eq-quantize}

**Example**:
Suppose your activations range from $\alpha$ = `{python} alpha_str` to $\beta$ = `{python} beta_str`. You want to quantize to UINT8 ($b=8$).

*   **Range**: $\beta - \alpha$ = `{python} range_str`.
*   **Steps**: $2^8 - 1$ = `{python} steps_str`.
*   **Scale**: $s$ = `{python} range_str` / `{python} steps_str` $\approx$ `{python} scale_str`.
*   **Zero-Point**: $z = \text{round}(-(\alpha) / s)$ = round(−`{python} alpha_str` / `{python} scale_str`) = round(63.75) = `{python} zero_point_str`.

So, the real value $0.0$ is represented by the integer `{python} zero_point_str`. This ensures that zero-padding (common in CNNs) is represented exactly, preventing "quantization drift" where padding introduces non-zero noise.
:::

##### Granularity {#sec-model-compression-granularity-93b3}

After determining the clipping range, the next optimization step is adjusting the granularity of that range to retain as much accuracy as possible. In CNNs, the input activations of a layer undergo convolution with multiple filters, each of which may have a unique range of values. The quantization process must account for these differences to preserve model performance.

This variation is strikingly visible in @fig-quantization-granularity: notice how the range for Filter 1 is significantly smaller than that for Filter 3. The precision with which the clipping range [$\alpha$, $\beta$] is determined becomes an important factor in effective quantization. This variability in ranges is why different granularity-based quantization strategies are employed.

::: {#fig-quantization-granularity fig-env="figure" fig-pos="htb" fig-cap="**Quantization Range Variation**: Different convolutional filters exhibit unique activation ranges, necessitating per-filter quantization to minimize accuracy loss during quantization. Adjusting the granularity of clipping ranges, as shown by the differing scales for each filter, optimizes the trade-off between model size and performance. Source: [@gholami2021survey]." fig-alt="Four rows showing CNN filters with Gaussian weight distributions. Each filter has different clipping ranges shown as red and blue dashed lines. Layer-wise clipping uses same range; channel-wise uses per-filter ranges."}
```{.tikz}
\resizebox{.8\textwidth}{!}{%
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50,text=black},
Box/.style={inner xsep=2pt,
    node distance=1.2,
    draw=VioletLine2,
    line width=0.75pt,
    fill=VioletL2,
    text width=27mm,align=flush center,
    minimum width=27mm, minimum height=10mm
  },
}
\pgfmathdeclarefunction{agausss}{2}{%
  \pgfmathparse{1/(#2*sqrt(2*pi))*exp(-((x-#1)^2)/(2*#2^2))}%
}
\pgfmathdeclarefunction{gauss}{3}{%
\pgfmathparse{1/(#3*sqrt(2*pi))*exp(-((#1-#2)^2)/(2*#3^2))}%
}
%first row
\begin{scope}
\begin{scope}[local bounding box=F1,line width=0.5pt]
\newcommand{\Depth}{1.3}
\newcommand{\Height}{1.3}
\newcommand{\Width}{1.3}
\coordinate (O2) at (0,0,0);
\coordinate (A2) at (0,\Width,0);
\coordinate (B2) at (0,\Width,\Height);
\coordinate (C2) at (0,0,\Height);
\coordinate (D2) at (\Depth,0,0);
\coordinate (E2) at (\Depth,\Width,0);
\coordinate (F2) at (\Depth,\Width,\Height);
\coordinate (G2) at (\Depth,0,\Height);

\draw[fill=GreenL] (D2) -- (E2) -- (F2) -- (G2) -- cycle;% Right Face
\draw[fill=GreenL] (C2) -- (B2) -- (F2) -- (G2) -- (C2);% Front Face
\draw[fill=GreenL] (A2) -- (B2) -- (F2) -- (E2) -- cycle;% Top Face
%
\draw($(C2)!0.33!(G2)$)--($(B2)!0.33!(F2)$)--($(A2)!0.33!(E2)$);
\draw($(C2)!0.66!(G2)$)--($(B2)!0.66!(F2)$)--($(A2)!0.66!(E2)$);
\draw($(B2)!0.33!(C2)$)--($(F2)!0.33!(G2)$)--($(E2)!0.33!(D2)$);
\draw($(B2)!0.66!(C2)$)--($(F2)!0.66!(G2)$)--($(E2)!0.66!(D2)$);
\draw($(B2)!0.33!(A2)$)--($(F2)!0.33!(E2)$)--($(G2)!0.33!(D2)$);
\draw($(B2)!0.66!(A2)$)--($(F2)!0.66!(E2)$)--($(G2)!0.66!(D2)$);
\node[below=0.1of $(C2)!0.5!(G2)$]{Filter 1};
\end{scope}

\begin{scope}[line width=0.5pt,shift={(4.5,-1)},scale=0.7]
\draw[line width=2pt] plot[domain=-3.9:3.9,samples=51,
            smooth,xscale=0.5,yscale=2.0] (\x,{2*exp(-\x*\x/3});
\draw[red,dashed](3.2,0)--(3.2,4);
\draw[red,dashed](-3.2,0)--(-3.2,4);
\end{scope}

\begin{scope}[line width=0.5pt,shift={(10,-1)},scale=0.7]
\draw[line width=2pt] plot[domain=-3.9:3.9,samples=51,
             smooth,xscale=0.5,yscale=2.0] (\x,{2*exp(-\x*\x/3});
\draw[blue,dashed](1.95,0)--(1.95,4);
\draw[blue,dashed](-1.95,0)--(-1.95,4);
\end{scope}
\end{scope}
%%%%%%%%%%%%%%%%%%%%%%%%%%
%second row
\begin{scope}[shift={(0,-3.0)}]
\begin{scope}[line width=0.5pt]
\newcommand{\Depth}{1.3}
\newcommand{\Height}{1.3}
\newcommand{\Width}{1.3}
\coordinate (O2) at (0,0,0);
\coordinate (A2) at (0,\Width,0);
\coordinate (B2) at (0,\Width,\Height);
\coordinate (C2) at (0,0,\Height);
\coordinate (D2) at (\Depth,0,0);
\coordinate (E2) at (\Depth,\Width,0);
\coordinate (F2) at (\Depth,\Width,\Height);
\coordinate (G2) at (\Depth,0,\Height);

\draw[fill=white] (D2) -- (E2) -- (F2) -- (G2) -- cycle;% Right Face
\draw[fill=white] (C2) -- (B2) -- (F2) -- (G2) -- (C2);% Front Face
\draw[fill=white] (A2) -- (B2) -- (F2) -- (E2) -- cycle;% Top Face
%
\draw($(C2)!0.33!(G2)$)--($(B2)!0.33!(F2)$)--($(A2)!0.33!(E2)$);
\draw($(C2)!0.66!(G2)$)--($(B2)!0.66!(F2)$)--($(A2)!0.66!(E2)$);
\draw($(B2)!0.33!(C2)$)--($(F2)!0.33!(G2)$)--($(E2)!0.33!(D2)$);
\draw($(B2)!0.66!(C2)$)--($(F2)!0.66!(G2)$)--($(E2)!0.66!(D2)$);
\draw($(B2)!0.33!(A2)$)--($(F2)!0.33!(E2)$)--($(G2)!0.33!(D2)$);
\draw($(B2)!0.66!(A2)$)--($(F2)!0.66!(E2)$)--($(G2)!0.66!(D2)$);
\node[below=0.1of $(C2)!0.5!(G2)$]{Filter 2};
\end{scope}

\begin{scope}[line width=0.5pt,shift={(4.5,-1)},scale=0.7]
\draw[line width=2pt] plot[domain=-3:3,samples=51,
            smooth,xscale=0.8,yscale=1.5] (\x,{2*exp(-\x*\x/3});
\draw[red,dashed](3.2,0)--(3.2,3.5);
\draw[red,dashed](-3.2,0)--(-3.2,3.5);
\end{scope}

\begin{scope}[line width=0.5pt,shift={(10,-1)},scale=0.7]
\draw[line width=2pt] plot[domain=-3:3,samples=51,
             smooth,xscale=0.8,yscale=1.5] (\x,{2*exp(-\x*\x/3});
\draw[blue,dashed](2.4,0)--(2.4,3.5);
\draw[blue,dashed](-2.4,0)--(-2.4,3.5);
\end{scope}
\end{scope}
%%%%%%%%%%%%%%%%%%%%%%%%%
%third row
\begin{scope}[shift={(0,-6.0)}]
\begin{scope}[local bounding box=SF3,line width=0.5pt]
\newcommand{\Depth}{1.3}
\newcommand{\Height}{1.3}
\newcommand{\Width}{1.3}
\coordinate (O2) at (0,0,0);
\coordinate (A2) at (0,\Width,0);
\coordinate (B2) at (0,\Width,\Height);
\coordinate (C2) at (0,0,\Height);
\coordinate (D2) at (\Depth,0,0);
\coordinate (E2) at (\Depth,\Width,0);
\coordinate (F2) at (\Depth,\Width,\Height);
\coordinate (G2) at (\Depth,0,\Height);

\draw[fill=white] (D2) -- (E2) -- (F2) -- (G2) -- cycle;% Right Face
\draw[fill=white] (C2) -- (B2) -- (F2) -- (G2) -- (C2);% Front Face
\draw[fill=white] (A2) -- (B2) -- (F2) -- (E2) -- cycle;% Top Face
%
\draw($(C2)!0.33!(G2)$)--($(B2)!0.33!(F2)$)--($(A2)!0.33!(E2)$);
\draw($(C2)!0.66!(G2)$)--($(B2)!0.66!(F2)$)--($(A2)!0.66!(E2)$);
\draw($(B2)!0.33!(C2)$)--($(F2)!0.33!(G2)$)--($(E2)!0.33!(D2)$);
\draw($(B2)!0.66!(C2)$)--($(F2)!0.66!(G2)$)--($(E2)!0.66!(D2)$);
\draw($(B2)!0.33!(A2)$)--($(F2)!0.33!(E2)$)--($(G2)!0.33!(D2)$);
\draw($(B2)!0.66!(A2)$)--($(F2)!0.66!(E2)$)--($(G2)!0.66!(D2)$);
\node[below=0.1of $(C2)!0.5!(G2)$](F3){Filter 3};
\end{scope}

\begin{scope}[local bounding box=S1,line width=0.5pt,shift={(4.5,-0.5)},scale=0.7]
\draw[line width=2pt] plot[domain=-4:4,samples=51,
             smooth,xscale=0.8] (\x,{1.7*exp(-\x*\x/3});
\draw[red,dashed](3.2,0)--(3.2,2);
\draw[red,dashed](-3.2,0)--(-3.2,2);
\end{scope}

\begin{scope}[local bounding box=S2,line width=0.5pt,shift={(10,-0.5)},scale=0.7]
\draw[line width=2pt] plot[domain=-4:4,samples=51,
            smooth,xscale=0.8] (\x,{1.7*exp(-\x*\x/3});
\draw[blue,dashed](3.2,0)--(3.2,2);
\draw[blue,dashed](-3.2,0)--(-3.2,2);
\end{scope}
\end{scope}
%%%%%%%%%%%%%%%%%%%%%%%%%%
%fourth row
\begin{scope}[shift={(0,-9.25)}]
\begin{scope}[local bounding box=SFC,line width=0.5pt]
\newcommand{\Depth}{1.3}
\newcommand{\Height}{1.3}
\newcommand{\Width}{1.3}
\coordinate (CO2) at (0,0,0);
\coordinate (CA2) at (0,\Width,0);
\coordinate (CB2) at (0,\Width,\Height);
\coordinate (CC2) at (0,0,\Height);
\coordinate (CD2) at (\Depth,0,0);
\coordinate (CE2) at (\Depth,\Width,0);
\coordinate (CF2) at (\Depth,\Width,\Height);
\coordinate (CG2) at (\Depth,0,\Height);

\draw[fill=white] (CD2) -- (CE2) -- (CF2) -- (CG2) -- cycle;% Right Face
\draw[fill=white] (CC2) -- (CB2) -- (CF2) -- (CG2) -- (CC2);% Front Face
\draw[fill=white] (CA2) -- (CB2) -- (CF2) -- (CE2) -- cycle;% Top Face
%
\draw($(CC2)!0.33!(CG2)$)--($(CB2)!0.33!(CF2)$)--($(CA2)!0.33!(CE2)$);
\draw($(CC2)!0.66!(CG2)$)--($(CB2)!0.66!(CF2)$)--($(CA2)!0.66!(CE2)$);
\draw($(CB2)!0.33!(CC2)$)--($(CF2)!0.33!(CG2)$)--($(CE2)!0.33!(CD2)$);
\draw($(CB2)!0.66!(CC2)$)--($(CF2)!0.66!(CG2)$)--($(CE2)!0.66!(CD2)$);
\draw($(CB2)!0.33!(CA2)$)--($(CF2)!0.33!(CE2)$)--($(CG2)!0.33!(CD2)$);
\draw($(CB2)!0.66!(CA2)$)--($(CF2)!0.66!(CE2)$)--($(CG2)!0.66!(CD2)$);
\node[below=0.1of $(CC2)!0.5!(CG2)$](FC){Filter C};
\end{scope}

\begin{scope}[local bounding box=S3,line width=0.5pt,shift={(4.5,-1)},scale=0.7]
\draw[line width=2pt] plot[domain=-3:3,samples=51,
             smooth,xscale=0.8,yscale=1.5] (\x,{2*exp(-\x*\x/3});
\draw[red,dashed](3.2,0)coordinate(X2)--(3.2,3.5);
\draw[red,dashed](-3.2,0)coordinate(X1)--(-3.2,3.5);
\node[below=0.35of $(X1)!0.5!(X2)$,align=center]{Layerwise\\ Quantization};
\end{scope}

\begin{scope}[local bounding box=S4,line width=0.5pt,shift={(10,-1)},scale=0.7]
\draw[line width=2pt] plot[domain=-3:3,samples=51,
            smooth,xscale=0.8,yscale=1.5] (\x,{2*exp(-\x*\x/3});
\draw[blue,dashed](2.4,0)coordinate(D2)--(2.4,3.5);
\draw[blue,dashed](-2.4,0)coordinate(D1)--(-2.4,3.5);
\node[below=0.35of $(D1)!0.5!(D2)$,align=center]{Channelwise\\ Quantization};
\end{scope}
\node[rotate=90,font=\Large\bfseries]at($(S1)!0.42!(S3)$){...};
\node[rotate=90,font=\Large\bfseries]at($(S2)!0.42!(S4)$){...};
\node[rotate=90,font=\Large\bfseries]at($(SF3)!0.48!(SFC)$){...};
\end{scope}
%%%%
%diagram
\begin{scope}[local bounding box=DI,line width=0.5pt,shift={(-5,-1)}]
\node[Box](B1){Layer N};
\node[Box,below=of B1](B2){Layer N – 1};
\node[Box,node distance=2.2,,below=of B2](B3){Layer 2};
\node[Box,below=of B3,fill=RedL,draw=RedLine](B4){Layer 1};
\node[rotate=90,font=\Large\bfseries](B0)at($(B2)!0.5!(B3)$){...};
\draw[Line,-latex](B4)--(B3);
\draw[Line,-latex](B3)--(B0);
\draw[Line,-latex](B0)--(B2);
\draw[Line,-latex](B2)--(B1);
\draw[Line,-latex](B1)--++(90:1.3)node[above]{Output: $\hat{y}$};
\draw[Line,latex-](B4)--++(270:1.3)node[below]{Input: $x$};
\end{scope}
\draw[dashed,red,thick](B4.north east)--(F1.north west);
\draw[dashed,red,thick](B4.south east)--(FC.south west);
\end{tikzpicture}}
```
:::

Quantization granularity\index{Quantization!granularity} determines how many parameters share the same clipping range:

- **Layerwise**\index{Quantization!layerwise}: One range per layer. Simple but suboptimal when filter ranges vary widely.
- **Groupwise**\index{Quantization!groupwise}: Filters grouped with shared ranges. Used in Q-BERT [@sheng2019qbert] for transformer attention layers.
- **Channelwise**\index{Quantization!channelwise}: One range per filter. The current standard, balancing accuracy and efficiency.
- **Sub-channelwise**\index{Quantization!sub-channelwise}: Ranges within each filter. Maximum precision but significant overhead.

Channelwise quantization has become the dominant approach, providing significant accuracy improvements over layerwise quantization with minimal computational overhead.

With granularity determined, the next consideration is what to quantize. Neural networks contain two primary numerical components: the static weights learned during training and the dynamic activations computed during inference. Each presents distinct quantization challenges.

##### Weights vs. Activations {#sec-model-compression-weights-vs-activations-5efb}

\index{Quantization!weight vs activation}
Weight Quantization\index{Quantization!weight} involves converting the continuous, high-precision weights of a model into lower-precision values, such as converting 32-bit floating-point (Float32) weights to 8-bit integer (INT8) weights. In @fig-weight-activations-quantization, focus on the second step (red squares) to see where weight quantization enters the multiply-accumulate pipeline. This process significantly reduces the model size, decreasing both the memory required to store the model and the computational resources needed for inference. For example, a weight matrix in a neural network layer with Float32 weights like $[0.215, -1.432, 0.902,\ldots]$ might be mapped to INT8 values such as $[27, -183, 115, \ldots]$, leading to a significant reduction in memory usage.

::: {#fig-weight-activations-quantization fig-env="figure" fig-pos="htb" fig-cap="**Quantization and Weight Precision**: Color-coded matrix multiplication diagram showing three steps: blue squares represent input activations, red squares represent quantized weights, and green squares represent output activations. Reducing precision from float32 to INT8 lowers model size and computational cost at the potential expense of accuracy. Source: HarvardX." fig-alt="Matrix multiplication diagram with three steps. Blue squares show input activations, red squares show quantized weights, and green squares show output activations. Arrows indicate computation flow through multiply-accumulate operations."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]

\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50,text=black},
Box/.style={inner xsep=2pt,
    node distance=0.5,
    draw=BlueLine,
    line width=0.75pt,
    fill=BlueL,
    text width=19mm,align=flush center,
    minimum width=19mm, minimum height=9mm
  },
}

\node[Box](B1){Matrix\\ Multiplication};
\node[Box,right=of B1,fill=RedL,draw=RedLine](B2){Int32\\ Output};
\node[Box,right=of B2,fill=GreenL,draw=GreenLine](B3){Quantization};
\node[Box,node distance=1.4,right=of B3,fill=BrownL,draw=BrownLine](B4){Activation};
\node[Box,right=of B4,fill=RedL,draw=RedLine](B5){Float 16\\ Output};
\node[Box,above left=0.8 and 0 of B1,fill=VioletL2,draw=VioletLine2](GB2){Quantization};
\node[Box,left=of GB2,fill=VioletL2,draw=VioletLine2](GB1){Float input};
\node[Box,below left=0.8 and 0 of B1,fill=VioletL2,draw=VioletLine2](DB2){Quantization};
\node[Box,left=of DB2,fill=VioletL2,draw=VioletLine2](DB1){Float input};
%%%
\begin{scope}[font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},scale=0.8,shift={($(B4)+(0,-3.5)$)}]
\draw[line width=0.5pt, -{Latex[length=6pt,width=4pt]}] (-3,0)--(3,0)node[below, xshift=-0.12cm]{$x$};
\draw[line width=0.5pt, -{Latex[length=6pt,width=4pt]}] (0,-1.8)--(0,2.2)node[left, yshift=-0.15cm]{$y$};
\draw[xscale=1, yscale=1, line width=1.25pt, domain=-2.9:2.9,smooth,
            variable=\x, BlueLine] plot ({\x},{rad(atan(\x))});
\draw[line width=0.5pt] (0,1.4)--(2.95,1.4);
\draw[line width=0.5pt] (0,-1.4)--(-2.95,-1.4);
\draw[line width=0.5pt] (-0.1,1.4)node[left]{1}--(0,1.4);
\draw[line width=0.5pt] (-0.1,0.7)node[left]{0.5}--(0.1,0.7);
\draw[line width=0.5pt] (-0.1,-0.7)--(0.1,-0.7)node[right]{–0.5};
\draw[line width=0.5pt] (-0.1,-1.4)--(0,-1.4)node[right]{–1};
%
\draw[red, line width=1.0pt](-2.9,0.1)to[out=0,in=180](2.9,1.3);
\draw[green!90!red!90, , line width=1.0pt](0,0)--++(190:3);
\draw[VioletLine,line width=1.0pt](0,0)--++(45:3);
\end{scope}
%%
\draw[Line,-latex](GB1)--(GB2);
\draw[Line,-latex](GB2)-|node[above,pos=0.3]{Int8}(B1);
\draw[Line,-latex](DB1)--(DB2);
\draw[Line,-latex](DB2)-|node[below,pos=0.3]{Int8}(B1);
\draw[Line,-latex](B1)--(B2);
\draw[Line,-latex](B2)--(B3);
\draw[Line,-latex](B3)--node[above]{Float 16}(B4);
\draw[Line,-latex](B4)--(B5);
\end{tikzpicture}
```
:::

Activation Quantization refers to the process of quantizing the activation values, or outputs of the layers, during model inference. This quantization can reduce the computational resources required during inference, particularly when targeting hardware optimized for integer arithmetic. It introduces challenges related to maintaining model accuracy, as the precision of intermediate computations is reduced. For instance, in a CNN, the activation maps (or feature maps) produced by convolutional layers, originally represented in Float32, may be quantized to INT8 during inference. This can significantly accelerate computation on hardware capable of efficiently processing lower-precision integers.

\index{LLaMA!AWQ quantization}
Recent advancements have explored **Activation-aware Weight Quantization (AWQ)**\index{Quantization!activation-aware weight (AWQ)}[^fn-awq] for the compression and acceleration of large language models (LLMs). This approach is particularly relevant for our **GPT-2 / Llama Lighthouse**, which is memory-bandwidth bound. By protecting only a small fraction of the most salient weights (approximately 1%) based on activation magnitude, AWQ enables effective 4-bit weight quantization. This reduces the memory traffic required to load the massive parameter set for every token generation, directly attacking the primary bottleneck of generative inference [@lin2023awq].

[^fn-awq]: **Activation-aware Weight Quantization (AWQ)**: Observes that only approximately 1% of weights disproportionately affect accuracy based on activation patterns. By protecting these salient weights while aggressively quantizing others, AWQ achieves INT4 quantization of LLaMA-7B with <1% perplexity degradation. Delivers 3.2 $\times$ speedup on A100 GPUs by reducing memory bandwidth requirements from 14 GB to 3.5 GB per inference.

##### Static vs. Dynamic Quantization {#sec-model-compression-static-vs-dynamic-quantization-4791}

After determining the type and granularity of the clipping range, practitioners must decide *when* the clipping ranges are calculated. Two primary approaches exist for quantizing activations: static quantization\index{Quantization!static} and dynamic quantization\index{Quantization!dynamic}.

In static quantization, the clipping range is pre-calculated and remains fixed during inference. This method introduces no additional computational overhead at runtime, making it efficient. The fixed range can, however, lead to lower accuracy compared to dynamic quantization. A typical implementation involves running calibration inputs to compute the typical activation range [@jacob2018quantization; @yao2021hawq].

Dynamic quantization instead calculates the range for each activation map at runtime. This allows the quantization process to adjust based on the input, potentially yielding higher accuracy since the range is computed per activation. The trade-off is higher computational overhead, since the range must be recalculated at each step, which can be expensive at scale.

These timing and granularity decisions interact with the broader choice of quantization methodology. @tbl-quantization_methods compares post-training quantization, quantization-aware training, and dynamic quantization, each offering distinct strengths and trade-offs for different deployment scenarios.

| **Aspect**                    | **Post Training Quantization** | **Quantization-Aware Training** | **Dynamic Quantization** |
|:------------------------------|:-------------------------------|:--------------------------------|:-------------------------|
| **Pros**                      |                                |                                 |                          |
| **Simplicity**                | ✓                              | ✗                               | ✗                        |
| **Accuracy Preservation**     | ✗                              | ✓                               | ✓                        |
| **Adaptability**              | ✗                              | ✗                               | ✓                        |
| **Optimized Performance**     | ✗                              | ✓                               | △                        |
| **Cons**                      |                                |                                 |                          |
| **Accuracy Degradation**      | ✓                              | ✗                               | △                        |
| **Computational Overhead**    | ✗                              | ✓                               | ✓                        |
| **Implementation Complexity** | ✗                              | ✓                               | ✓                        |
| **Tradeoffs**                 |                                |                                 |                          |
| **Speed vs. Accuracy**        | ✓                              | ✗                               | ✗                        |
| **Accuracy vs. Cost**         | ✗                              | ✓                               | ✗                        |
| **Adaptability vs. Overhead** | ✗                              | ✗                               | ✓                        |

: **Quantization Method Comparison**: Post-training quantization, quantization-aware training, and dynamic quantization represent distinct approaches to model compression. Legend: ✓ = present, ✗ = absent, △ = input-dependent. {#tbl-quantization_methods}

This comparison highlights the diverse strategies available for precision reduction. Before proceeding to advanced techniques, verify your grasp of these core quantization modes.

::: {.callout-checkpoint title="The Quantization Gate" collapse="false"}
Precision reduction is the most impactful deployment optimization.

**Quantization Modes**

- [ ] **Post-Training Quantization (PTQ)**: Why is this the default for CNNs? (Activations are stable, error is low).
- [ ] **Quantization-Aware Training (QAT)**: Why do you need QAT for MobileNet? (Compact models have less redundancy to absorb quantization noise).

**System Implications**

- [ ] **Latency vs. Throughput**: Why does INT8 improve throughput on GPUs but only latency on CPUs? (Hint: Parallelism vs. Instruction Efficiency).
:::

##### PTQ in Practice {#sec-model-compression-ptq-practice-2800}

The preceding subsections reveal PTQ's core trade-off: simplicity versus accuracy control. PTQ requires no retraining and can be applied to any pre-trained model in minutes, making it the default starting point for deployment optimization. For rapid deployment scenarios with production deadlines under two weeks and acceptable accuracy loss of 1–2%, PTQ with appropriate calibration often provides a complete solution.

The limitation is that PTQ offers no mechanism to recover from accuracy loss. If the quantized model's accuracy drops below the production threshold — a common outcome for transformer-based architectures where attention mechanisms amplify small numerical differences — the only recourse is to choose a less aggressive precision format, which sacrifices the efficiency gains that motivated quantization in the first place. This ceiling on PTQ's accuracy preservation motivates a more powerful approach: rather than applying quantization as a post-hoc transformation, we can integrate precision constraints directly into the training process itself.

#### Quantization-Aware Training {#sec-model-compression-quantizationaware-training-4032}

QAT\index{Quantization-Aware Training (QAT)!definition} integrates quantization constraints directly into the training process, simulating low-precision arithmetic during forward passes to allow the model to adapt to quantization effects [@jacob2018quantization]. Production systems requiring less than 1% accuracy loss benefit most from this approach, which recovers accuracy through fine-tuning with quantization simulation at the cost of 20–50% additional training time. This approach is particularly important for models requiring fine-grained numerical precision, such as transformers used in NLP and speech recognition systems [@nagel2021white]. The QAT pipeline, outlined in @fig-qat, applies quantization to a pre-trained model and then fine-tunes it so the weights learn to compensate for low-precision constraints.

::: {#fig-qat fig-env="figure" fig-pos="htb" fig-cap="**Quantization-Aware Training**: Vertical flowchart showing the QAT pipeline: a pre-trained model passes through a quantization step that simulates low-precision arithmetic, then undergoes retraining with training data to adapt weights to quantization constraints, producing a final quantized model optimized for efficient inference." fig-alt="Vertical flowchart showing QAT process. Pre-trained model feeds into Quantization step, then Retraining/Finetuning step with Training data input, producing final Quantized model output."}
```{.tikz}
\begin{tikzpicture}[font=\footnotesize\usefont{T1}{phv}{m}{n}]
\tikzset{
  Box/.style={inner xsep=2pt,
   node distance=0.5,
    draw=black!90,
    line width=0.75pt,
    anchor=west,
    align=flush center,
    minimum width=64mm,
    minimum height=7.5mm
  },
Line/.style={line width=1.0pt,black!50,-latex}
}
\node[Box,fill=GreenL](B1){Quantized model};
\node[Box,fill=BlueL,above=of B1](B2){Retraining/Finetuning};
\node[Box,fill=BlueL,above=of B2.north west,minimum width=30mm,
anchor= south west](B3){Quantization};
\node[Box,fill=GreenL,above=of B3.north east,minimum width=30mm,
anchor= south east](B4){Pre-trained model};
\node[Box,fill=none,above=of B2.north east,minimum width=30mm,
draw=none,anchor= south east](B5){};
\draw[Line](B2)--(B1);
\draw[Line](B4)--(B4|-B3.north);
\draw[Line](B3)--(B3|-B2.north);
\path[red](B2.north east)|-coordinate(A)(B4.north east);
\path[blue](A)-|(B5.south west)coordinate(B);
\draw[line width=0.75pt, draw=black!90, fill=OrangeL]
(A) rectangle (B) node[pos=0.5] {Training data};
\coordinate(S)at($(B)!0.5!(B5.south east)$);
\draw[Line](S)--(S|-B2.north);
\end{tikzpicture}

```
:::

In many cases, QAT can also build off PTQ (discussed in detail in the previous section). Trace the two-stage pipeline in @fig-ptq-qat: instead of starting from a full-precision model, PTQ is first applied to produce an initial quantized model using calibration data. This quantized model then serves as the starting point for QAT, where additional fine-tuning with training data helps the model better adapt to low-precision constraints. This hybrid approach combines PTQ's efficiency with QAT's accuracy preservation, reducing the degradation typically associated with post-training approaches alone.

::: {#fig-ptq-qat fig-env="figure" fig-pos="htb" fig-cap="**PTQ-to-QAT Pipeline.** Two grouped stages: the PTQ stage quantizes and calibrates a pretrained model using calibration data, then the QAT stage fine-tunes the result with training data. This hybrid approach combines PTQ's efficiency with QAT's accuracy preservation." fig-alt="Vertical flowchart with two grouped stages. PTQ stage shows pretrained model through quantize and calibrate steps. QAT stage shows fine-tuning step. Calibrate data feeds PTQ; Training data feeds QAT."}
```{.tikz}
\scalebox{0.8}{
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]

\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50},
Box/.style={inner xsep=2pt,
    node distance=0.6,
    draw=GreenLine,
    line width=0.75pt,
    fill=GreenL,
    text width=27mm,align=flush center,
    minimum width=27mm, minimum height=9mm
  },
}

\node[Box](B1){Pretrained model};
\node[Box,below=of B1,fill=RedL,draw=RedLine](B2){Quantize model};
\node[Box,below=of B2,fill=RedL,draw=RedLine](B3){Calibrate model};
\node[Box,below=of B3](B4){PTQ model};
\node[Box,below=of B4,fill=RedL,draw=RedLine](B5){Fine-tune model};
\node[Box,below=of B5](B6){QAT model};
%
\node[Box,node distance=1.6,left=of B3,fill=BlueL,draw=BlueLine](B7){Calibrate data};
\node[Box,node distance=1.6,left=of B5,fill=BlueL,draw=BlueLine](B8){Training data};

\foreach \x in{1,...,5}{
\pgfmathtruncatemacro{\newX}{\x + 1}
\draw[Line,-latex](B\x)--(B\newX);
}
\draw[Line,-latex](B7)--(B3);
\draw[Line,-latex](B8)--(B5);
\scoped[on background layer]
\node[draw=BackLine,inner xsep=5mm,inner ysep=3mm,
yshift=0mm,
fill=BackColor,fit=(B2)(B3)(B7),line width=0.75pt](BB1){};
\node[below=4pt of  BB1.north,inner sep=0pt,
anchor=north,fill=BackColor]{PTQ};

\scoped[on background layer]
\node[draw=BackLine,inner xsep=5mm,inner ysep=3mm,
yshift=0mm,
fill=OliveL!30,fit=(B5)(B8),line width=0.75pt](BB2){};
\node[below=4pt of  BB2.north,inner sep=0pt,
anchor=north,fill=BackColor]{QAT};
\end{tikzpicture}}
```
:::

##### Training Mathematics {#sec-model-compression-training-mathematics-699e}

During forward propagation, weights and activations are quantized and dequantized to mimic reduced precision. Let $x$ be a full-precision value, $s$ the scaling factor that maps floating-point values into a lower-precision range, and $q$ the simulated quantized value. This process is typically represented as:
$$
q = \text{round} \left(\frac{x}{s} \right) \times s
$$
where $q$ represents the simulated quantized value, $x$ denotes the full-precision weight or activation, and $s$ is the scaling factor mapping floating-point values to lower-precision integers.

\index{Straight-Through Estimator (STE)!etymology}
\index{Bengio, Yoshua!straight-through estimator}
Although the forward pass utilizes quantized values, gradient calculations during backpropagation remain in full precision. The Straight-Through Estimator (STE) accomplishes this\index{Straight-Through Estimator (STE)}[^fn-ste], which approximates the gradient of the quantized function by treating the rounding operation as if it had a derivative of one. In effect, the STE pretends quantization is the identity function during backpropagation, allowing gradients to flow unchanged through otherwise non-differentiable operations. This approach prevents the gradient from being obstructed due to the non-differentiable nature of the quantization operation, thereby allowing effective model training [@bengio2013estimating].

[^fn-ste]: **Straight-Through Estimator (STE)**: Gradient approximation technique for non-differentiable functions, introduced by Bengio et al. [@bengio2013estimating]. Sets gradient of step function to 1 everywhere, enabling backpropagation through quantization layers. Crucial for training binarized neural networks and quantization-aware training, despite theoretical limitations around zero.

Integrating quantization effects during training enables the model to learn weight and activation distributions that minimize numerical precision loss. The resulting model, when deployed using true low-precision arithmetic (e.g., INT8 inference), maintains significantly higher accuracy than one that is quantized post hoc [@krishnamoorthi2018quantizing].

##### Fake Quantization Nodes and Implementation {#sec-model-compression-fake-quantization-nodes-implementation-e822}

QAT implementation relies on fake quantization\index{Quantization-Aware Training (QAT)!fake quantization} operations that simulate quantization during forward propagation while maintaining full precision for gradient computation. These operations insert quantize-dequantize pairs into the computational graph, creating a training-time simulation of inference-time behavior.

A fake quantization node performs three operations sequentially:

1. **Quantization**: Map floating-point value to discrete quantization level
2. **Clipping**: Enforce range constraints based on bit width
3. **Dequantization**: Convert back to floating-point for subsequent operations

Mathematically, for symmetric quantization with bit width $b$, given a floating-point input value $x$:
$$
\begin{aligned}
q_{level} &= \text{clip}\left(\text{round}\left(\frac{x}{s}\right), -2^{b-1}, 2^{b-1} - 1\right) \\
x_{fake} &= q_{level} \times s
\end{aligned}
$$
where $s = \frac{\max(|x|)}{2^{b-1} - 1}$ is the scale factor computed from the input distribution, and $x_{fake}$ represents the fake-quantized output that mimics INT8 values but remains in floating-point format.

For asymmetric quantization supporting unsigned integers:
$$
\begin{aligned}
s &= \frac{\max(x) - \min(x)}{2^b - 1} \\
z &= \text{round}\left(-\frac{\min(x)}{s}\right) \\
q_{level} &= \text{clip}\left(\text{round}\left(\frac{x}{s} + z\right), 0, 2^b - 1\right) \\
x_{fake} &= (q_{level} - z) \times s
\end{aligned}
$$
where $z$ is the zero-point offset enabling asymmetric range representation.

The following implementation demonstrates how frameworks simulate quantization during training while maintaining gradient flow. The forward pass applies quantization to both inputs and weights before convolution, mimicking INT8 inference behavior, while the implementation maintains floating-point precision throughout to allow gradients to flow during backpropagation. @lst-qat-conv-forward demonstrates the computational graph for a quantized convolution layer, which contains fake quantization nodes for both weights and activations:

::: {#lst-qat-conv-forward lst-cap="**QAT Convolution Forward Pass**: Fake quantization nodes simulate integer quantization during training while maintaining gradient flow through the straight-through estimator."}
```{.python}
# Forward pass with fake quantization
def qat_conv_forward(x, weight):
    # Fake quantize input activations
    x_scale = compute_scale(x, bits=8, symmetric=False)
    x_zero = compute_zero_point(x, x_scale, bits=8)
    x_quant = fake_quantize(x, x_scale, x_zero, bits=8)

    # Fake quantize weights (typically symmetric)
    w_scale = compute_scale(weight, bits=8, symmetric=True)
    w_quant = fake_quantize(weight, w_scale, zero=0, bits=8)

    # Convolution with fake-quantized values
    output = conv2d(x_quant, w_quant)
    return output
```
:::

The critical aspect of fake quantization is gradient handling during backpropagation. The rounding and clipping operations are non-differentiable, requiring gradient approximation through the Straight-Through Estimator:
$$
\frac{\partial x_{fake}}{\partial x} = \begin{cases}
1 & \text{if } x \in [x_{min}, x_{max}] \\
0 & \text{otherwise}
\end{cases}
$$
This approximation treats the quantization function as identity within the valid range, allowing gradients to flow unchanged through the fake quantization nodes except for values that exceed clipping bounds.

During backpropagation, the full-precision gradient $\frac{\partial \mathcal{L}}{\partial x_{fake}}$ propagates directly to $x$ for values within the quantization range. For weights and activations exceeding the range, gradients become zero, preventing further updates that would push values beyond representable limits. This gradient behavior encourages the model to learn weight distributions that naturally fit within quantization constraints.

In practice, frameworks like PyTorch and TensorFlow implement fake quantization as custom autograd operators whose forward pass performs the quantize-dequantize round trip while the backward pass applies the STE mask directly. Two implementation details deserve attention. First, scale factors should not remain static throughout training — as weight distributions evolve, scales must track these changes via exponential moving averages to prevent scale mismatch between training and deployment. Second, batch normalization layers require special handling\index{Batch Normalization!QAT interaction}: their running statistics must be computed on fake-quantized activations rather than full-precision values, ensuring that inference with true INT8 operations uses parameters calibrated for quantized distributions.

##### QAT Trade-offs {#sec-model-compression-qat-tradeoffs-e8ef}

QAT's[^fn-qat-performance] primary advantage is accuracy preservation under low-precision inference. By incorporating quantization noise during training, the model learns weight distributions that tolerate reduced precision, and AI processors with dedicated integer units (TPUs, NPUs, edge accelerators) can then exploit INT8 arithmetic for faster, lower-energy inference without significant accuracy degradation [@wu2020integer; @gholami2021survey]. QAT particularly benefits quantization-sensitive architectures: transformers for NLP, speech recognition encoders, and high-resolution vision models where attention mechanisms amplify small numerical differences.

[^fn-qat-performance]: **Quantization-Aware Training**: QAT enables INT8 inference with minimal accuracy loss - ResNet-50 maintains 76.1% vs. 76.2% FP32 ImageNet accuracy, while MobileNetV2 achieves 71.8% vs. 72.0%. BERT-Base INT8 retains 99.1% of FP32 performance on GLUE, compared to 96.8% with post-training quantization alone.

The cost is additional engineering complexity. Simulated quantization at every forward pass adds 20-50% to training time, and QAT introduces hyperparameters (quantization schemes, scale factor update schedules) that require careful tuning [@choukroun2019low]. This overhead makes QAT less practical for very large models where training budgets are already constrained.

In practice, the choice between PTQ and QAT follows a simple decision rule: start with PTQ and measure accuracy on the validation set. If accuracy meets the production threshold, ship it — the engineering cost of QAT is not justified. If PTQ falls short, invest in QAT to recover the gap. A hybrid approach, starting with PTQ calibration and applying QAT fine-tuning only for accuracy-critical layers, often provides the best balance.

Both PTQ and QAT typically target 8-bit or 4-bit precision while maintaining near-original accuracy. Some deployment scenarios, however, demand even more aggressive compression, pushing precision to the absolute limits of what neural networks can tolerate.

### Extreme Quantization {#sec-model-compression-extreme-quantization-aba4}

Beyond INT8 and INT4\index{INT4!extreme quantization}\index{Quantization!INT4 (4-bit integer)} quantization, extreme quantization techniques use 1-bit (binarization)\index{Quantization!binarization} or 2-bit (ternarization)\index{Quantization!ternarization} representations to achieve dramatic reductions in memory usage and computational requirements [@Courbariaux2016]. Binarization\index{Binary Neural Networks} constrains weights and activations to two values (typically -1 and +1, or 0 and 1), drastically reducing model size and accelerating inference on specialized hardware like binary neural networks [@Rastegari2016]. However, this constraint severely limits model expressiveness, often degrading accuracy on tasks requiring high precision such as image recognition or natural language processing [@Hubara2018].

\index{Ternary Networks!definition}
Ternarization extends binarization by allowing three values (-1, 0, +1), providing additional flexibility that slightly improves accuracy over pure binarization [@Zhu2017]. The zero value enables greater sparsity while maintaining more representational power. Both techniques require gradient approximation methods like Straight-Through Estimator (STE) to handle non-differentiable quantization operations during training [@bengio2013estimating], with QAT integration helping mitigate accuracy loss [@Choi2019].

#### Challenges and Limitations {#sec-model-compression-challenges-limitations-a99c}

Despite enabling ultra-low-power machine learning for embedded systems and mobile devices, binarization and ternarization face significant challenges. Performance maintenance is difficult with such drastic quantization, requiring specialized hardware capable of efficiently handling binary or ternary operations [@Umuroglu2017]. Traditional processors lack optimization for these computations, necessitating custom hardware accelerators.

Accuracy loss remains a critical concern. These methods suit tasks where high precision is not critical or where QAT can compensate for precision constraints. Despite challenges, the ability to drastically reduce model size while maintaining acceptable accuracy makes them attractive for edge AI and resource-constrained environments [@jacob2018quantization]. Future advances in specialized hardware and training techniques will likely enhance their role in efficient, scalable AI.

The following checkpoint tests understanding of quantization concepts before proceeding to architectural efficiency techniques.

::: {.callout-checkpoint title="Quantization and Precision Checkpoint" collapse="true"}

Test your understanding of quantization before moving to architectural efficiency:

- [ ] Can you explain why INT8 quantization provides roughly 4 $\times$ memory reduction but potentially more than 4 $\times$ energy reduction? Consider the energy cost of floating-point versus integer arithmetic units.
- [ ] Do you understand the key advantage of quantization-aware training (QAT) over post-training quantization (PTQ), and when the extra training cost is justified?
- [ ] Can you explain why weight quantization provides nearly linear speedup with bit-width reduction for memory-bandwidth-bound LLM inference? Think about which resource is the bottleneck during autoregressive generation.
:::

We have now covered two optimization dimensions: structural optimization (pruning, distillation, NAS) determines *what* to compute, and precision optimization (quantization) determines *how precisely* to compute it. Together, these techniques can reduce a model's theoretical complexity by 80% or more—from removing half the parameters through pruning to compressing weights from 32-bit floats to 4-bit integers through quantization.

Yet practitioners often discover a frustrating gap between theory and practice: a model pruned to 50% parameters and quantized to INT8 may achieve only 20% latency improvement on actual hardware. Why does theoretical compression fail to translate into proportional speedup? This theory-practice gap reveals that optimization must extend beyond the model itself to how computations execute on physical hardware.

The gap arises from several sources. Sparse matrices stored in dense format waste memory bandwidth loading zeros—the hardware cannot skip what it does not know is zero. Operations that could run in parallel execute sequentially due to data dependencies the compiler cannot resolve. Simple inputs receive the same computational budget as complex ones because the model has no mechanism to exit early. Closing the gap between "optimized on paper" and "optimized in practice" is the domain of our third optimization dimension: **architectural efficiency**. This dimension ensures that structural and precision optimizations translate into real-world speedups by aligning computation patterns with hardware capabilities.


## Architectural Efficiency {#sec-model-compression-architectural-efficiency-8dd3}

Architectural efficiency optimization ensures that computations execute efficiently on target hardware by aligning model operations with processor capabilities and memory hierarchies. Where representation optimization determines *what* computations to perform and precision optimization determines *how precisely* to compute, architectural efficiency addresses *how* operations are scheduled, memory is accessed, and workloads adapt to input characteristics. This third dimension closes the gap between theoretical compression ratios and real-world speedups.

Four complementary approaches to architectural efficiency are examined: hardware-aware design principles that proactively integrate deployment constraints during model development, sparsity exploitation techniques that accelerate computation on pruned models, dynamic computation strategies that adapt workload to input complexity, and operator fusion methods that reduce memory traffic by combining operations. These techniques transform algorithmic optimizations into realized performance gains.

### Hardware-Aware Design {#sec-model-compression-hardwareaware-design-c561}

Closing the gap between theoretical complexity reduction and real-world performance requires incorporating hardware constraints directly into the model design process.\index{Hardware-Aware Design!principles}\index{Hardware-Aware Optimization!design principles} Hardware-aware design addresses this by ensuring that architectural decisions account for memory bandwidth, parallelism capabilities, and energy budgets from the outset, rather than treating hardware compatibility as an afterthought.

#### Efficient Design Principles {#sec-model-compression-efficient-design-principles-b015}

Designing for hardware efficiency requires structuring architectures to account for computational cost, memory usage, inference latency, and power consumption while maintaining strong predictive performance. A key aspect involves leveraging the strengths of specific hardware platforms (GPUs, TPUs, mobile or edge devices) to maximize parallelism, optimize memory hierarchies, and minimize latency through hardware-optimized operations. @tbl-hardware-efficient-design categorizes these design principles, each addressing a core aspect of computational and system constraints.

| **Principle**             | **Goal**                                                                                                                                                                     | **Example Networks**            |
|:--------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
| **Scaling Optimization**  | Adjust model depth, width, and resolution to balance efficiency and hardware constraints.                                                                                    | EfficientNet, RegNet            |
| **Computation Reduction** | Minimize redundant operations to reduce computational cost, using hardware-specific optimizations (e.g., depthwise separable convolutions on mobile chips).                  | MobileNet, ResNeXt              |
| **Memory Optimization**   | Ensure efficient memory usage by reducing activation and parameter storage requirements, using hardware-specific memory hierarchies (e.g., local and global memory in GPUs). | DenseNet, SqueezeNet            |
| **Hardware-Aware Design** | Optimize architectures for specific hardware constraints (e.g., low power, parallelism, high throughput).                                                                    | TPU-optimized models, MobileNet |

: **Hardware-Aware Design Principles**: Categorizing model design choices by their impact on computational cost, memory usage, and inference latency enables structured optimization for diverse hardware platforms and deployment scenarios. MobileNet exemplifies computation reduction through depthwise separable convolutions, while DenseNet and SqueezeNet demonstrate memory optimization strategies. {#tbl-hardware-efficient-design}

The principles in @tbl-hardware-efficient-design work synergistically: scaling optimization sizes models appropriately for available resources, computation reduction eliminates redundant operations through techniques like depthwise separable convolutions[^fn-depthwise-separable-efficiency], memory optimization aligns access patterns with hardware hierarchies, and hardware-aware design ensures architectural decisions match platform capabilities. Together, these principles enable models that balance accuracy with efficiency while maintaining consistent behavior across deployment environments.

[^fn-depthwise-separable-efficiency]: **Depthwise Separable Convolutions**: Factorizes standard convolution into depthwise (per-channel) and pointwise (1 $\times$ 1) operations, reducing computation by 8–9 $\times$. MobileNetV2 achieves 72% ImageNet accuracy with `{python} mobilenetv2_mflops_str` M FLOPs vs. ResNet-50's 76% with `{python} resnet_gflops_str` B FLOPs (13.7 $\times$ fewer operations). Enables real-time inference on mobile devices.

The following subsections examine each principle in detail, beginning with how to scale model dimensions effectively.

#### Scaling Optimization {#sec-model-compression-scaling-optimization-6ad9}

Scaling a model's architecture involves balancing accuracy with computational cost and aligning it with the capabilities of the target hardware. The question is not simply "how big should the model be?" but rather "how should size be distributed across different dimensions?" Each component of a model—whether its depth, width, or input resolution—impacts resource consumption differently. In hardware-aware design, these dimensions should not only be optimized for accuracy but also for efficiency in memory usage, processing power, and energy consumption, especially when the model is deployed on specific hardware like GPUs, TPUs, or edge devices.

Different hardware platforms interact with scaling dimensions in distinct ways. Deeper models capture more complex representations, but excessive depth increases inference latency, training time, and memory consumption, problems that are particularly acute on resource-constrained platforms. Wider models process more information in parallel, benefiting GPUs and TPUs with high parallelism, but at the cost of increased memory usage. Higher input resolution provides finer details for tasks like image classification but exponentially increases computational costs, potentially overloading hardware memory or causing power inefficiencies on edge devices.

Mathematically, the total FLOPs for a convolutional model can be approximated as:
$$
\text{FLOPs} \propto d \cdot w^2 \cdot r^2,
$$
where $d$ is depth, $w$ is width, and $r$ is the input resolution. Increasing all three dimensions without considering the hardware limitations can result in suboptimal performance, especially on devices with limited computational power or memory bandwidth.

For efficient model scaling, managing these parameters in a balanced way becomes essential, ensuring that the model remains within hardware limits while maximizing performance. How do we find the right balance? Trial and error across three dimensions (depth $\times$ width $\times$ resolution) creates an enormous search space. This is where compound scaling\index{Compound Scaling!definition} comes into play. Instead of adjusting depth, width, and resolution independently, compound scaling balances all three dimensions together by applying fixed ratios $(\alpha, \beta, \gamma)$ relative to a base model:
$$
d = \alpha^\phi d_0, \quad w = \beta^\phi w_0, \quad r = \gamma^\phi r_0
$$
Here, $\phi$ is a scaling coefficient, and $\alpha$, $\beta$, and $\gamma$ are scaling factors determined based on hardware constraints and empirical data. This approach ensures that models grow in a way that optimizes hardware resource usage, keeping them efficient while improving accuracy.

\index{EfficientNet!compound scaling validation}
For example, the NAS-discovered **EfficientNet** (@sec-model-compression-neural-architecture-search-cf12) empirically validated this principle. Its search algorithm found that carefully balancing depth, width, and resolution via **compound scaling** yielded models that were both computationally efficient and high-performing, outperforming architectures that scaled dimensions arbitrarily. Compound scaling reduces computational cost while preserving accuracy, making it a key consideration for hardware-aware model design. This approach is particularly beneficial when deploying models on GPUs or TPUs, where parallelism can be fully leveraged, but memory and power usage need to be carefully managed. @sec-benchmarking examines performance evaluation methods for measuring these efficiency gains.

This principle extends beyond convolutional models to other architectures like transformers. Adjusting the number of layers, attention heads, or embedding dimensions impacts computational efficiency similarly. Hardware-aware scaling has become central to optimizing model performance across various computational constraints, especially when working with large models or resource-constrained devices.

#### Computation Reduction {#sec-model-compression-computation-reduction-13de}

Modern architectures leverage factorized computations to decompose complex operations into simpler components, reducing computational overhead while maintaining representational power. Standard convolutions apply filters uniformly across all spatial locations and channels, creating computational bottlenecks on resource-constrained hardware. Factorization techniques address this inefficiency by restructuring operations to minimize redundant computation.

Depthwise separable convolutions\index{Depthwise Separable Convolution}\index{Model Compression!depthwise separable convolution}, introduced in MobileNet, exemplify this approach by decomposing standard convolutions into two stages: depthwise convolution (applying separate filters to each input channel independently) and pointwise convolution (1 $\times$ 1 convolution mixing outputs across channels). The computational complexity of standard convolution with input size $h \times w$, $C_{\text{in}}$ input channels, and $C_{\text{out}}$ output channels is:
$$
\mathcal{O}(h w C_{\text{in}} C_{\text{out}} k^2)
$$
where $k$ is kernel size. Depthwise separable convolutions reduce this to:
$$
\mathcal{O}(h w C_{\text{in}} k^2) + \mathcal{O}(h w C_{\text{in}} C_{\text{out}})
$$
eliminating the $k^2$ factor from channel-mixing operations, achieving 5 $\times$-10 $\times$ FLOP reduction. This directly translates to reduced memory bandwidth requirements and improved inference latency on mobile and edge devices.

\index{Grouped Convolution!ResNeXt}
\index{Bottleneck Layers!ResNet dimensionality reduction}
\index{SqueezeNet!1x1 convolution parameter reduction}
Complementary factorization techniques extend these benefits. Grouped convolutions (ResNeXt) partition feature maps into independent groups processed separately before merging, maintaining accuracy while reducing redundant operations. Bottleneck layers (ResNet) apply 1 $\times$ 1 convolutions to reduce feature dimensionality before expensive operations, concentrating computation where it provides maximum value. Combined with sparsity and hardware-aware scheduling, these techniques maximize accelerator utilization across GPUs, TPUs, and specialized edge processors.

While reducing computation is essential, memory constraints often prove more limiting than compute capacity on resource-constrained devices. The next section addresses these memory bottlenecks directly.

#### Memory Optimization {#sec-model-compression-memory-optimization-8c43}

Memory optimization[^fn-memory-optimization] addresses performance bottlenecks arising when memory demands for activations, feature maps, and parameters exceed hardware capacity on resource-constrained devices. Modern architectures employ memory-efficient strategies to reduce storage requirements while maintaining performance, ensuring computational tractability and energy efficiency on GPUs, TPUs, and edge AI platforms.

[^fn-memory-optimization]: **Memory Optimization**: Techniques reducing peak memory during training and inference. DenseNet-121's feature reuse significantly reduces activation memory compared to ResNet-50; gradient checkpointing typically trades 15–25% additional compute for substantial memory reduction (up to 10 $\times$ for some architectures). In-place operations, memory pooling, and operator fusion further reduce footprint, enabling LLM inference on consumer GPUs.

\index{DenseNet!feature reuse}
One effective technique is feature reuse\index{Memory Optimization!feature reuse}, employed in DenseNet. In traditional convolutional networks, each layer computes a new set of feature maps, increasing the model's memory footprint. DenseNet reduces redundant activations by reusing feature maps from previous layers and selectively applying transformations, lowering memory requirements without sacrificing accuracy. In a standard convolutional network with $N_L$ layers, if each layer generates $k$ new feature maps, the total number of feature maps grows linearly:
$$
\mathcal{O}(N_L k)
$$

In contrast, DenseNet reuses feature maps from earlier layers, reducing the number of feature maps stored. This leads to improved parameter efficiency and a reduced memory footprint, which is important for hardware with limited memory resources.

Activation checkpointing\index{Activation Checkpointing}\index{Memory Optimization!activation checkpointing}[^fn-activation-checkpointing-compression] complements feature reuse by trading computation for memory during training. As established in @sec-model-training, this technique stores only a subset of forward-pass activations and recomputes the rest during backpropagation, reducing peak memory from $\mathcal{O}(A_{\text{total}})$ to $\mathcal{O}\big(\sqrt{A_{\text{total}}}\big)$. In the compression context, checkpointing enables training of larger models within fixed memory budgets, which in turn provides more capacity for subsequent pruning or distillation to exploit.

[^fn-activation-checkpointing-compression]: **Activation Checkpointing**: Also called gradient checkpointing. Reduces memory usage by 20--50% in large transformers with only 15--20% training time overhead by recomputing activations instead of storing them. See @sec-model-training for the full etymology and systems analysis.

Parameter reduction is another important technique. SqueezeNet, for instance, applies $1\times 1$ convolutions to reduce the number of input channels before applying standard convolutions, significantly reducing model size without compromising expressive power. The number of parameters in a standard convolutional layer is:
$$
\mathcal{O}(C_{\text{in}} C_{\text{out}} k^2)
$$

By reducing $C_{\text{in}}$ using $1\times 1$ convolutions, SqueezeNet[^fn-squeezenet] reduces the number of parameters, achieving a 50 $\times$ reduction in parameter count compared to AlexNet while maintaining similar performance. This method is well-suited for edge devices that have strict memory and storage constraints.

[^fn-squeezenet]: **SqueezeNet**: DeepScale/Berkeley architecture using fire modules (squeeze + expand layers) achieves AlexNet-level accuracy (57.5% top-1 ImageNet) with 50 $\times$ fewer parameters (1.25M vs 60M). Model size drops from 240 MB to ~5 MB uncompressed, enabling deployment on smartphones and embedded systems with limited storage.

Feature reuse, activation checkpointing, and parameter reduction form key components of hardware-aware model design, allowing models to fit within memory limits of modern accelerators while reducing power consumption through fewer memory accesses. Specialized accelerators like TPUs and GPUs leverage memory hierarchies, caching, and high bandwidth memory to efficiently handle sparse or reduced-memory representations, enabling faster inference with minimal overhead.

Beyond reducing what data must be stored, substantial efficiency gains emerge from optimizing how operations access memory. The next technique addresses this by combining multiple operations to reduce memory traffic.

```{python}
#| label: fusion-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ OPERATOR FUSION CALCULATIONS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Operator Fusion section — Conv-BN-ReLU, GEMM, bandwidth analysis
# │
# │ Goal: Quantify the latency and bandwidth benefits of operator fusion.
# │ Show: The 3× reduction in kernel launches and memory traffic for standard layers.
# │ How: Model memory elimination and launch overhead for ResNet-50.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: conv_bn_relu_intermediate_mb_str, gemm_intermediate_mb_str,
# │          feat_map_kb_str, weights_mb_str, bn_params_kb_str,
# │          unfused_conv_mb_str, unfused_bn_mb_str, unfused_relu_mb_str,
# │          total_unfused_mb_str, total_fused_mb_str,
# │          bandwidth_reduction_pct_str, kernels_unfused_str,
# │          kernels_fused_str, saved_latency_ms_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
from mlsys.constants import KIB_TO_BYTES

# --- Inputs (Conv-BN-ReLU) ---
conv_channels_value = 256
conv_spatial_value = 28
bytes_per_element_value = 4

# GEMM
gemm_hidden_value = 768
gemm_seq_value = 512

# Memory Bandwidth Analysis (ResNet-50 layer)
# Feature map: 256 channels × 28 × 28 spatial × 4 bytes/element (FP32)
feat_map_mb_value = conv_channels_value * conv_spatial_value * conv_spatial_value * bytes_per_element_value / MILLION  # SI MB
weights_mb_value = 2.4
bn_params_mb_value = 0.002

# Kernel Launch
kernels_unfused_value = 159
kernels_fused_value = 53
latency_per_kernel_us_value = 10

# --- Process ---
# Conv-BN-ReLU intermediate
conv_bn_relu_intermediate_bytes = 2 * conv_channels_value * conv_spatial_value * conv_spatial_value * bytes_per_element_value
conv_bn_relu_intermediate_mb_value = conv_bn_relu_intermediate_bytes / (1024**2)

# GEMM intermediate
gemm_intermediate_bytes = gemm_hidden_value * gemm_seq_value * bytes_per_element_value
gemm_intermediate_mb_value = gemm_intermediate_bytes / (1024**2)

# Bandwidth Analysis
unfused_conv_mb_value = feat_map_mb_value * 2 + weights_mb_value
unfused_bn_mb_value = feat_map_mb_value * 2 + bn_params_mb_value
unfused_relu_mb_value = feat_map_mb_value * 2
total_unfused_mb_value = unfused_conv_mb_value + unfused_bn_mb_value + unfused_relu_mb_value

total_fused_mb_value = feat_map_mb_value * 2 + weights_mb_value
bandwidth_reduction_pct_value = (1 - total_fused_mb_value / total_unfused_mb_value) * 100

# Kernel Launch
saved_latency_us_value = (kernels_unfused_value - kernels_fused_value) * latency_per_kernel_us_value
saved_latency_ms_value = saved_latency_us_value / 1000

# V100 timing analysis (memory-bound)
v100_bw_gbs_local_value = v100_bw_gbs_value  # from earlier cell
unfused_time_us_value = total_unfused_mb_value / v100_bw_gbs_local_value * 1000  # MB / (GB/s) * 1000 = us
fused_time_us_value = total_fused_mb_value / v100_bw_gbs_local_value * 1000
fusion_speedup_value = unfused_time_us_value / fused_time_us_value

# --- Outputs (formatted strings for prose) ---
conv_bn_relu_intermediate_mb_str = fmt(conv_bn_relu_intermediate_mb_value, precision=1, commas=False)
gemm_intermediate_mb_str = fmt(gemm_intermediate_mb_value, precision=1, commas=False)

feat_map_kb_str = fmt(feat_map_mb_value * 1000, precision=0, commas=False)
weights_mb_str = fmt(weights_mb_value, precision=1, commas=False)
bn_params_kb_str = fmt(bn_params_mb_value * KIB_TO_BYTES, precision=0, commas=False)

unfused_conv_mb_str = fmt(unfused_conv_mb_value, precision=1, commas=False)
unfused_bn_mb_str = fmt(unfused_bn_mb_value, precision=1, commas=False)
unfused_relu_mb_str = fmt(unfused_relu_mb_value, precision=1, commas=False)
total_unfused_mb_str = fmt(total_unfused_mb_value, precision=1, commas=False)
total_fused_mb_str = fmt(total_fused_mb_value, precision=1, commas=False)
bandwidth_reduction_pct_str = fmt(bandwidth_reduction_pct_value, precision=0, commas=False)

kernels_unfused_str = fmt(kernels_unfused_value, precision=0, commas=False)
kernels_fused_str = fmt(kernels_fused_value, precision=0, commas=False)
saved_latency_ms_str = fmt(saved_latency_ms_value, precision=0, commas=False)
unfused_time_us_str = fmt(unfused_time_us_value, precision=0, commas=False)
fused_time_us_str = fmt(fused_time_us_value, precision=1, commas=False)
fusion_speedup_str = fmt(fusion_speedup_value, precision=2, commas=False)
```

#### Operator Fusion {#sec-model-compression-operator-fusion-ac1d}

\index{Operator Fusion!GEMM fusion}
Operator fusion\index{Operator Fusion!definition}\index{Architectural Efficiency!operator fusion} combines multiple computational operations into single fused kernels, reducing intermediate memory traffic and kernel launch overhead. To understand why this matters, consider a typical neural network layer: convolution followed by batch normalization followed by ReLU. Without fusion, each operation writes its output to GPU global memory, then the next operation reads that output back. This results in three memory round-trips for what could be computed entirely in fast on-chip registers. This graph-level optimization technique addresses a pervasive inefficiency in deep learning execution: sequential operations that write intermediate results to memory only to immediately read them back for the next operation. By fusing operations, inference engines eliminate these redundant memory transactions, improving both throughput and latency on memory-bound workloads.

Modern neural networks consist of sequences of operations such as convolution, batch normalization, activation functions, and element-wise operations. When executed independently, each operation requires four steps:

1. Loading input tensors from global memory
2. Performing computation
3. Writing output tensors back to global memory
4. Launching the next kernel

This pattern creates memory bandwidth bottlenecks for operations with low arithmetic intensity (FLOPs/byte accessed). The memory traffic for $N$ unfused operations operating on tensors of size $M$ bytes is:
$$
\text{Memory}_{\text{unfused}} = 2NM
$$
where each operation reads ($M$ bytes) and writes ($M$ bytes) intermediate results. Operator fusion reduces this to:
$$
\text{Memory}_{\text{fused}} = 2M
$$
by reading inputs once, computing all operations in sequence, and writing final outputs once.

Common fusion patterns in neural network inference optimize specific operation sequences that appear repeatedly in modern architectures:

##### Convolution-BatchNorm-ReLU Fusion {.unnumbered}

\index{Operator Fusion!Conv-BN-ReLU}This ubiquitous pattern appears in nearly every modern CNN architecture. @lst-conv-bn-relu-fusion shows how fusion reduces three memory round-trips to a single kernel launch:

::: {#lst-conv-bn-relu-fusion lst-cap="**Conv-BN-ReLU Fusion**: Combining three operations into a single kernel reduces memory traffic from 6 transfers to 2, eliminating intermediate memory writes."}
```{.python}
# === UNFUSED: 3 kernel launches, 6 memory transfers ===
conv_out = conv2d(input, weight)
bn_out = batch_norm(conv_out, ...)
relu_out = relu(bn_out)


# === FUSED: 1 kernel launch, 2 memory transfers ===
def conv_bn_relu_fused(input, weight, gamma, beta, mean, var):
    # Read input and weight once
    conv = conv2d(input, weight)

    # Apply batch norm in registers (no memory write)
    bn = gamma * (conv - mean) / sqrt(var + eps) + beta

    # Apply ReLU in registers (no memory write)
    output = max(bn, 0)

    # Write final result once
    return output
```
:::

```{python}
#| label: conv-fusion-calc
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ CONV-BN-RELU FUSION TRANSFER REDUCTION
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Operator fusion prose, after @lst-conv-bn-relu-fusion
# │
# │ Goal: Demonstrate the memory traffic reduction from Conv-BN-ReLU fusion.
# │ Show: A 3× reduction in data movement (6 transfers to 2).
# │ How: Calculate memory read/write counts for fused vs. unfused execution.
# │
# │ Imports: mlsys.formatting (fmt, md_math)
# │ Exports: transfer_reduction_str, conv_bn_relu_mem_md
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check, md_math

# --- Inputs (transfer counts) ---
unfused_transfers_value = 6  # read/write for each of conv, BN, ReLU
fused_transfers_value = 2    # read input, write output

# --- Process ---
transfer_reduction_value = unfused_transfers_value / fused_transfers_value

# --- Outputs (formatted strings for prose) ---
transfer_reduction_str = fmt(transfer_reduction_value, precision=0, commas=False)
conv_bn_relu_mem_md = md_math(f"2 \\times 256 \\times 28 \\times 28 \\times 4 \\text{{ bytes}} \\approx \\text{{{conv_bn_relu_intermediate_mb_str} MB}}")
```

The arithmetic operations remain identical, but memory traffic drops from 6 transfers to 2 transfers (`{python} transfer_reduction_str` $\times$ reduction). For a ResNet-50 layer with 256 channels and spatial size $28 \times 28$, this eliminates `{python} conv_bn_relu_mem_md` of intermediate memory traffic per layer.

\index{Operator Fusion!attention (FlashAttention)}
The same principle extends beyond CNNs. GEMM-bias-activation fusion eliminates intermediate writes in transformer linear layers by computing element-wise operations in registers immediately after each matrix multiplication output element. Attention tiling, as in FlashAttention[^fn-flashattention-compression], reduces HBM traffic from $O(n^2)$ to $O(n)$ for long-context transformers by processing attention in SRAM-sized tiles rather than materializing the full $n \times n$ attention matrix, as detailed in @sec-model-training-flash-attention-ioaware-attention-optimization-3da0.

[^fn-flashattention-compression]: FlashAttention\index{FlashAttention} (introduced in @sec-network-architectures) demonstrates fusion's power for memory-bound attention: tiling to SRAM yields 2–4 $\times$ speedup and enables 64K context windows by reducing memory traffic from O(n^2) to O(n). This exemplifies how operator fusion can transform memory-bound bottlenecks into compute-bound operations.

Memory bandwidth analysis quantifies these fusion benefits concretely. Consider a Conv-BN-ReLU sequence operating on a $28 \times 28 \times 256$ feature map (`{python} feat_map_kb_str` KB). Without fusion, each operation performs its own memory round-trip: Conv reads input (`{python} feat_map_kb_str` KB) plus weights (`{python} weights_mb_str` MB) and writes output (`{python} feat_map_kb_str` KB), totaling `{python} unfused_conv_mb_str` MB. BN then reads that output, adds its parameters (`{python} bn_params_kb_str` KB), and writes again, for `{python} unfused_bn_mb_str` MB. ReLU repeats the pattern for another `{python} unfused_relu_mb_str` MB. The total unfused memory traffic is `{python} total_unfused_mb_str` MB. With fusion, the entire sequence reads input and weights once and writes the final output once, requiring only `{python} total_fused_mb_str` MB — a `{python} bandwidth_reduction_pct_str`% bandwidth reduction.

On a V100 GPU with `{python} v100_bw_gbs_str` GB/s HBM bandwidth and `{python} v100_tflops_fp32_str` TFLOPS FP32 compute, the unfused sequence takes approximately `{python} unfused_time_us_str` microseconds (memory-bound), while the fused version takes approximately `{python} fused_time_us_str` microseconds (`{python} fusion_speedup_str` $\times$ speedup). The speedup comes entirely from reducing memory traffic, as the compute remains identical.

Fusion effectiveness varies by workload characteristics. Memory-bound operations benefit most, while compute-bound operations see minimal improvement:

- **Element-wise operations**: 2–4 $\times$ speedup (highly memory-bound, low arithmetic intensity)
- **Conv-BN-Act patterns**: 1.5–2 $\times$ speedup (mixed memory/compute characteristics)
- **GEMM-based operations**: 1.2–1.5 $\times$ speedup (compute-bound, fusion reduces memory-bound tail)
- **Attention mechanisms**: 2–4 $\times$ speedup on long sequences (quadratic memory scaling)

Fusion also reduces kernel launch overhead. Each CUDA kernel launch incurs approximately 5-10 microseconds of latency. For a ResNet-50 with 53 convolutional layers, unfused execution launches `{python} kernels_unfused_str` kernels (Conv + BN + ReLU), while fused execution launches `{python} kernels_fused_str` kernels, saving approximately `{python} saved_latency_ms_str` millisecond from launch overhead alone.

\index{ML Compiler!framework-level fusion}\index{ML Compiler!graph-level fusion}
Fusion implementation spans the software stack, from framework-level pattern matching (PyTorch's TorchScript, TensorFlow's Grappler) through compiler-level optimization (XLA, TVM, TensorRT) to runtime fusion that adapts to input shapes and hardware characteristics. @sec-hardware-acceleration examines the compiler and hardware dimensions of fusion in detail, including register pressure constraints, graph pattern matching strategies, and platform-specific trade-offs across GPU, TPU, and edge accelerators.

Operator fusion optimizes how operations execute by reducing memory traffic between fixed computational steps. A complementary approach asks a different question: must we execute all computational steps at all? This leads to adaptive computation methods that vary the amount of work performed based on input characteristics.

### Adaptive Computation Methods {#sec-model-compression-adaptive-computation-methods-d164}

\index{Adaptive Inference!definition}
The preceding techniques (hardware-aware design and operator fusion) optimize models uniformly: every input receives the same computational treatment regardless of its complexity. Consider image classification: a photo of a cat against a plain white background requires less analysis than a cat partially hidden in a cluttered room. Adaptive computation methods challenge the uniform-computation assumption by allowing models to vary the amount of work performed on a per-input basis. This flexibility enables significant efficiency gains in practice, as many real-world inputs are simple enough to classify correctly with only a fraction of the full network. The following subsections examine dynamic schemes that adjust computation at inference time and conditional execution strategies that selectively activate model components.

#### Dynamic Schemes {#sec-model-compression-dynamic-schemes-e9ff}

If some inputs are simpler than others, why should all inputs receive the same computational budget? Dynamic schemes answer this question by modifying the computation graph at inference time, enabling models to allocate resources proportionally to input difficulty. These approaches range from early exit architectures that terminate processing when confidence is sufficient to mixture-of-experts models that activate only relevant subnetworks. Each strategy offers a distinct mechanism for reducing average-case computation while preserving worst-case accuracy.

##### Early Exit Architectures {#sec-model-compression-early-exit-architectures-eeb9}

While operator fusion optimizes memory access patterns for fixed computation graphs, adaptive computation\index{Adaptive Computation!early exit} methods address a different inefficiency: conventional models apply uniform processing to all inputs regardless of complexity, wasting resources on simple cases. Dynamic computation allows models to skip layers or operations for simple inputs while engaging deeper networks for complex cases, optimizing efficiency, energy consumption, and latency while preserving predictive performance. This capability is essential for resource-constrained hardware in mobile devices, embedded systems, and autonomous vehicles where real-time processing is critical.

Early exit architectures\index{Early Exit Architectures!definition} allow a model to make predictions at intermediate points in the network rather than completing the full forward pass for every input. This approach is effective for real-time applications and energy-efficient inference, as it enables selective computation based on the complexity of individual inputs [@teerapittayanon2016branchynet].

The core mechanism in early exit architectures involves multiple exit points embedded within the network. Simpler inputs, which can be classified with high confidence early in the model, exit at an intermediate layer, reducing unnecessary computations. Conversely, more complex inputs continue processing through deeper layers to ensure accuracy.

A well-known example is BranchyNet\index{BranchyNet}\index{Early Exit Architectures!BranchyNet}[^fn-branchynet], which introduces multiple exit points throughout the network. For each input, the model evaluates intermediate predictions using confidence thresholds. If the prediction confidence exceeds a predefined threshold at an exit point, the model terminates further computations and outputs the result. Otherwise, it continues processing until the final layer [@teerapittayanon2016branchynet]. This approach minimizes inference time without compromising performance on challenging inputs.

[^fn-branchynet]: **BranchyNet**: Pioneered adaptive inference with early exit branches at multiple network depths, achieving significant speedups on image classification tasks. Reduces average inference time substantially for simple inputs while maintaining full computation for complex cases, enabling real-time processing on mobile devices. The exact speedup depends on input distribution and confidence thresholds.

Another example is multi-exit vision transformers, which extend early exits to transformer-based architectures. These models use lightweight classifiers at various transformer layers, allowing predictions to be generated early when possible [@scardapane2020should]. This technique significantly reduces inference time while maintaining robust performance for complex samples.

Early exit models are advantageous for resource-constrained devices, such as mobile processors and edge accelerators. By dynamically adjusting computational effort, these architectures reduce power consumption and processing latency, making them ideal for real-time decision-making [@hu2021triple].

When deployed on hardware accelerators such as GPUs and TPUs, early exit architectures can be further optimized by exploiting parallelism. For instance, different exit paths can be evaluated concurrently, thereby improving throughput while preserving the benefits of adaptive computation [@chen2024eellm]. This approach is illustrated in @fig-early-exit-transformers, where each transformer layer is followed by a classifier and an optional early exit mechanism based on confidence estimation or latency-to-accuracy trade-offs (LTE). At each stage, the system may choose to exit early if sufficient confidence is achieved, or continue processing through deeper layers, enabling dynamic allocation of computational resources.

::: {#fig-early-exit-transformers fig-env="figure" fig-pos="htb" fig-cap="**Early Exit Architecture**: Transformer layers dynamically adjust computation by classifying each layer's output and enabling early termination if sufficient confidence is reached, reducing latency and power consumption for resource-constrained devices. This approach allows for parallel evaluation of different exit paths, improving throughput on hardware accelerators like GPUs and TPUs. Source: [@xin-etal-2021-berxit]." fig-alt="Flowchart with input feeding n transformer layers in sequence. Each layer connects to a classifier, confidence estimator, and exit point. Arrows show continue paths for low confidence."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50,text=black},
Box/.style={inner xsep=2pt,
    node distance=1.3,
    draw=GreenLine,
    line width=0.75pt,
    fill=GreenL,
    text width=25mm,align=flush center,
    minimum width=25mm, minimum height=9mm
  },
}

\node[Box, ellipse,text width=14mm,minimum width=12mm,
             minimum height=11mm, fill=RedL,draw=RedLine](B1){Input};
\node[Box,right=of B1](B2){Transformer 1};
\node[Box,right=of B2](B3){Transformer 2};
\node[Box, node distance=2.5,right=of B3](B4){Transformer n};
\node[font=\tiny](B0)at($(B3)!0.5!(B4)$){$\bullet$ $\bullet$ $\bullet$};
%
\def\di{0.55}
\node[Box,node distance=\di,below=of B2,fill=VioletL2,draw=VioletLine](C1){Classifier 1};
\node[Box,node distance=\di,below=of C1,fill=BlueL,draw=BlueLine](C2){Confidence / LTE};
\node[Box,node distance=\di,below=of C2,ellipse,text width=14mm,minimum width=12mm,
             minimum height=11mm, fill=RedL,draw=RedLine](C3){Exit};
%
\node[Box,node distance=\di,below=of B3,fill=VioletL2,draw=VioletLine](2C1){Classifier 2};
\node[Box,node distance=\di,below=of 2C1,fill=BlueL,draw=BlueLine](2C2){Confidence / LTE};
\node[Box,node distance=\di,below=of 2C2,ellipse,text width=14mm,minimum width=12mm,
             minimum height=11mm, fill=RedL,draw=RedLine](2C3){Exit};
%
\node[Box,node distance=\di,below=of B4,fill=VioletL2,draw=VioletLine](3C1){Classifier n};
\node[Box,node distance=\di,below=of 3C1,fill=BlueL,draw=BlueLine](3C2){Confidence / LTE};
\node[Box,node distance=\di,below=of 3C2,ellipse,text width=14mm,minimum width=12mm,
             minimum height=11mm, fill=RedL,draw=RedLine](3C3){Exit};
%
\node[font=\tiny](2B0)at($(2C1)!0.5!(3C1)$){$\bullet$ $\bullet$ $\bullet$};
\node[font=\tiny](3B0)at($(2C2)!0.5!(3C2)$){$\bullet$ $\bullet$ $\bullet$};
\draw[Line,-latex](B1)--(B2);
\draw[Line,-latex](B2)--(B3);
\draw[Line,-latex](B3)--(B0);
\draw[Line,-latex](B0)--(B4);
%
\draw[Line,-latex](B2)--(C1);
\draw[Line,-latex](C1)--(C2);
\draw[Line,-latex](C2)--(C3);
%
\draw[Line,-latex](B3)--(2C1);
\draw[Line,-latex](2C1)--(2C2);
\draw[Line,-latex](2C2)--(2C3);
%
\draw[Line,-latex](B4)--(3C1);
\draw[Line,-latex](3C1)--(3C2);
\draw[Line,-latex](3C2)--(3C3);
%
\draw[Line,-latex](C2)-|node[left=6pt,pos=0.76,rotate=90]{Continue}($(B2)!0.5!(B3)$);
\draw[Line,-latex](2C2.east)-|node[right=9pt,pos=0.35,rotate=90]{Continue}($(B3)!0.68!(B0)$);
\end{tikzpicture}
```
:::

##### Conditional Computation {#sec-model-compression-conditional-computation-3087}

Conditional computation\index{Conditional Computation!definition}\index{Adaptive Computation!conditional} refers to the ability of a neural network to decide which parts of the model to activate based on the input, thereby reducing unnecessary computation. This approach can be highly beneficial in resource-constrained environments, such as mobile devices or real-time systems, where reducing the number of operations directly translates to lower computational cost, power consumption, and inference latency [@bengio2015conditional].

Where early exit architectures make a single exit-or-continue decision at each layer, conditional computation dynamically selects which layers, units, or paths to activate based on input characteristics. Mechanisms such as gating functions or dynamic routing "turn off" parts of the network that are unnecessary for a particular input, focusing computational resources where they are most needed.

One example of conditional computation is SkipNet\index{SkipNet}\index{Conditional Computation!SkipNet}, which uses a gating mechanism to skip layers in a CNN when the input is deemed simple enough. The gating mechanism uses a lightweight classifier to predict if the layer should be skipped. This prediction is made based on the input, and the model adjusts the number of layers used during inference accordingly [@wang2018skipnet]. If the gating function determines that the input is simple, certain layers are bypassed, resulting in faster inference. However, for more complex inputs, the model uses the full depth of the network to achieve the necessary accuracy.

\index{Capsule Networks!dynamic routing}
Another example is Dynamic Routing Networks, such as in the Capsule Networks (CapsNets), where routing mechanisms dynamically choose the path that activations take through the network. In these networks, the decision-making process involves selecting specific pathways for information flow based on the input's complexity, which can significantly reduce the number of operations and computations required [@sabour2017dynamic]. This mechanism introduces adaptability by using different routing strategies, providing computational efficiency while preserving the quality of predictions.

These strategies offer significant advantages in real-world applications with limited computational resources. In autonomous driving, for example, the system must process inputs of varying complexity: straightforward lane markings can follow a simpler path, while detecting obstacles or performing detailed scene understanding requires the full model capacity. Conditional computation adapts to input complexity in real time, improving both speed and efficiency [@seo2023neuroflow].

##### Gate-Based Computation {#sec-model-compression-gatebased-computation-cf6b}

Gate-based conditional computation introduces learned gating mechanisms that dynamically control which parts of a neural network are activated based on input complexity. Unlike static architectures that process all inputs with the same computational effort, this approach enables dynamic activation of sub-networks or layers by learning decision boundaries during training [@shazeer2017outrageously].

Gating mechanisms are typically implemented using binary or continuous gating functions, wherein a lightweight control module (often called a router or gating network) predicts whether a particular layer or path should be executed. This decision-making occurs dynamically at inference time, allowing the model to allocate computational resources adaptively.

A well-known example of this paradigm is the Dynamic Filter Network (DFN)\index{Dynamic Computation!filter networks}, which applies input-dependent filtering by selecting different convolutional kernels at runtime. DFN reduces unnecessary computation by avoiding uniform filter application across inputs, tailoring its computations based on input complexity [@jia2016dynamic].

Another widely adopted strategy is the Mixture of Experts (MoE)\index{Mixture of Experts (MoE)!definition} framework. The key insight is that different inputs may benefit from different types of processing. A question about mathematics and a question about history might best be handled by different "specialist" subnetworks. In this architecture, a gating network\index{Mixture of Experts (MoE)!gating network} selects a subset of specialized expert subnetworks to process each input [@shazeer2017outrageously]. This allows only a small portion of the total model to be active for any given input, significantly improving computational efficiency without sacrificing model capacity. A notable instantiation of this idea is Google's Switch Transformer\index{Switch Transformer}\index{Mixture of Experts (MoE)!Switch Transformer}[^fn-switch-transformer], which extends the transformer architecture with expert-based conditional computation [@fedus2021switch]. While we introduce MoE principles here for single-system context, large-scale MoE deployments involving distributed expert placement are explored in advanced coverage of large-scale systems.

[^fn-switch-transformer]: **Switch Transformer**: Scales to 1.6 trillion parameters while activating only 2 billion per token, achieving 7 $\times$ faster pretraining than dense T5 at equivalent FLOPs. Routes each token to a single expert (vs. top-k), reducing communication overhead. Training instability from load imbalance requires auxiliary loss terms; production deployments use capacity factors of 1.25–2 $\times$ to handle routing variance.

Look at the right side of @fig-switch-transformer to see this routing in action: the Switch Transformer replaces the traditional feedforward layer with a Switching FFN Layer. For each token, a lightweight router selects a single expert from a pool of feedforward networks. The router outputs a probability distribution over available experts, and the highest-probability expert is activated per token. This design enables large models to scale parameter count without proportionally increasing inference cost.

::: {#fig-switch-transformer fig-env="figure" fig-pos="htb" fig-cap="**Conditional Computation**: Switch transformers enhance efficiency by dynamically routing tokens to specialized expert subnetworks, enabling parallel processing and reducing the computational load per input. This architecture implements a form of mixture of experts where a gating network selects which experts process each token, allowing for increased model capacity without a proportional increase in computation. *Source: [@fedus2021switch]*." fig-alt="Two-part diagram. Left shows Switch Transformer block with self-attention, add-normalize, switching FFN layer, and add-normalize. Right shows expanded view with router selecting one of four FFN experts per token based on probability."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}]

\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50,text=black},
Box/.style={inner xsep=2pt,
    node distance=0.8,
    draw=GreenLine,
    line width=0.75pt,
    fill=GreenL,
    text width=75mm,align=flush center,
    minimum width=75mm, minimum height=8mm
  },
Box2/.style={inner xsep=2pt,
    node distance=0.15,
    draw=VioletLine,
    line width=0.75pt,
    fill=VioletL2,
    text width=10mm,align=flush center,
    minimum width=10mm, minimum height=7mm
  },
Box3/.style={inner xsep=2pt,
    node distance=0.8,
    draw=VioletLine,
    line width=0.75pt,
    fill=VioletL2,
    text width=33mm,align=flush center,
    minimum width=33mm, minimum height=8mm
  },
do path picture/.style={%
    path picture={%
      \pgfpointdiff{\pgfpointanchor{path picture bounding box}{south west}}%
        {\pgfpointanchor{path picture bounding box}{north east}}%
      \pgfgetlastxy\x\y%
      \tikzset{x=\x/2,y=\y/2}%
      #1
    }
  },
  cross/.style={do path picture={
    \draw [line cap=round] (-1,-1) -- (1,1) (-1,1) -- (1,-1);
  }},
plus/.style={do path picture={
    \draw [line cap=round] (-3/4,0) -- (3/4,0) (0,-3/4) -- (0,3/4);
  }}
}

\node[Box,fill=RedL,draw=RedLine](P1){Self-Attention};
\node[Box,above=0.5 of P1,fill=BrownL,draw=BrownLine](P2){Add + Normalize};
\node[Box,node distance=6.6,above=of P2,fill=BrownL,draw=BrownLine](P3){Add + Normalize};
\draw[Line,-latex](P2.172)coordinate(DPR1)--++(90:1)coordinate(PR1);
\draw[Line,-latex](P2.8)coordinate(DPR2)--++(90:1)coordinate(PR2);
\draw[Line,-latex](P3.172)coordinate(DAN1)--++(90:0.5)coordinate(AN1);
\draw[Line,-latex](P3.8)coordinate(DAN2)--++(90:0.5)coordinate(AN2);
\draw[Line,-latex](P1.172)--(P1.172|-P2.south);
\draw[Line,-latex](P1.8)--(P1.8|-P2.south);
%%%Router-1
\begin{scope}[local bounding box=R1,line width=0.5pt,shift={($(PR1)+(-0.4,0.2)$)}]
\newcommand{\Depth}{1.3}
\newcommand{\Height}{0.7}
\newcommand{\Width}{0.4}
\coordinate (O2) at (0,0,0);
\coordinate (A2) at (0,\Width,0);
\coordinate (B2) at (0,\Width,\Height);
\coordinate (C2) at (0,0,\Height);
\coordinate (D2) at (\Depth,0,0);
\coordinate (E2) at (\Depth,\Width,0);
\coordinate (F2) at (\Depth,\Width,\Height);
\coordinate (G2) at (\Depth,0,\Height);

\draw[fill=GreenL] (D2) -- (E2) -- (F2) -- (G2) -- cycle;% Right Face
\draw[fill=GreenL] (C2) -- (B2) -- (F2) -- (G2) -- (C2);% Front Face
\draw[fill=GreenL] (A2) -- (B2) -- (F2) -- (E2) -- cycle;% Top Face
%
\node[]at($(B2)!0.5!(G2)$){Router};

\begin{scope}[local bounding box=BB1,line width=0.5pt,inner sep=3.6pt]
\def\dx{0.25}
\def\dy{0.5}
\def\dz{0.2}
% bottom-left corner coordinate (bar origin)
\def\x{0}
\def\y{0.21}
\def\z{0}

% boje
%\filldraw[fill=blue!30, draw=black] (\x,\y,\z) -- (\x,\y,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z) -- cycle; % leva strana
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z) -- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
%\filldraw[fill=blue!20, draw=black] (\x,\y,\z) -- (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y,\z) -- cycle; % donja strana
%\filldraw[fill=blue!40, draw=black] (\x,\y,\z) -- (\x+\dx,\y,\z) -- (\x+\dx,\y+\dy,\z) -- (\x,\y+\dy,\z) -- cycle; % zadnja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}

\begin{scope}[local bounding box=BB1,shift={(0.25,0)}]
\def\dx{0.25}
\def\dy{1.0}
\def\dz{0.2}
%
\def\x{0}
\def\y{0.21}
\def\z{0}
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z)coordinate(NB1) -- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}
\begin{scope}[local bounding box=BB1,shift={(0.5,0)}]
\def\dx{0.25}
\def\dy{0.2}
\def\dz{0.2}
%
\def\x{0}
\def\y{0.21}
\def\z{0}
%
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z) -- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}
\begin{scope}[local bounding box=BB1,shift={(0.75,0)}]
\def\dx{0.25}
\def\dy{0.6}
\def\dz{0.2}
%
\def\x{0}
\def\y{0.21}
\def\z{0}
%
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z) -- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}
\end{scope}
%%%%%%%%%%%%%%%%%%
%%%Router-2
\begin{scope}[local bounding box=R1,line width=0.5pt,shift={($(PR2)+(-0.4,0.2)$)}]
\newcommand{\Depth}{1.3}
\newcommand{\Height}{0.7}
\newcommand{\Width}{0.4}
\coordinate (O2) at (0,0,0);
\coordinate (A2) at (0,\Width,0);
\coordinate (B2) at (0,\Width,\Height);
\coordinate (C2) at (0,0,\Height);
\coordinate (D2) at (\Depth,0,0);
\coordinate (E2) at (\Depth,\Width,0);
\coordinate (F2) at (\Depth,\Width,\Height);
\coordinate (G2) at (\Depth,0,\Height);

\draw[fill=GreenL] (D2) -- (E2) -- (F2) -- (G2) -- cycle;% Right Face
\draw[fill=GreenL] (C2) -- (B2) -- (F2) -- (G2) -- (C2);% Front Face
\draw[fill=GreenL] (A2) -- (B2) -- (F2) -- (E2) -- cycle;% Top Face
%
\node[]at($(B2)!0.5!(G2)$){Router};

\begin{scope}[local bounding box=BB1,line width=0.5pt,inner sep=3.6pt]
\def\dx{0.25}
\def\dy{1.1}
\def\dz{0.2}
%
\def\x{0}
\def\y{0.21}
\def\z{0}

%
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z) coordinate(NB2)-- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}

\begin{scope}[local bounding box=BB1,shift={(0.25,0)}]
\def\dx{0.25}
\def\dy{0.6}
\def\dz{0.2}
%
\def\x{0}
\def\y{0.21}
\def\z{0}
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z) -- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}
\begin{scope}[local bounding box=BB1,shift={(0.5,0)}]
\def\dx{0.25}
\def\dy{0.3}
\def\dz{0.2}
%
\def\x{0}
\def\y{0.21}
\def\z{0}
%
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z) -- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}
\begin{scope}[local bounding box=BB1,shift={(0.75,0)}]
\def\dx{0.25}
\def\dy{0.15}
\def\dz{0.2}
%
\def\x{0}
\def\y{0.21}
\def\z{0}
%
\filldraw[fill=red!10, draw=black] (\x,\y+\dy,\z) -- (\x,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % gornja strana
\filldraw[fill=red!50, draw=black] (\x+\dx,\y,\z) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x+\dx,\y+\dy,\z) -- cycle; % desna strana
\filldraw[fill=red!60, draw=black] (\x,\y,\z+\dz) -- (\x+\dx,\y,\z+\dz) -- (\x+\dx,\y+\dy,\z+\dz) -- (\x,\y+\dy,\z+\dz) -- cycle; % prednja strana
\end{scope}
\end{scope}
%%%%%%%%%%%%%%%%%
%FFN left
\begin{scope}[local bounding box=SFFN1,line width=0.5pt,shift={($(NB1)+(-1.85,1.55)$)}]
\node[Box2](FFN1){FFN1};
\node[Box2,right=of FFN1,line width=1.5pt](FFN2){FFN2};
\node[Box2,right=of FFN2](FFN3){FFN3};
\node[Box2,right=of FFN3](FFN4){FFN4};
\end{scope}
%FFN right
\begin{scope}[local bounding box=SFFN2,line width=0.5pt,shift={($(SFFN1)+(3.9,0)$)}]
\node[Box2,line width=1.5pt](2FFN1){FFN1};
\node[Box2,right=of 2FFN1](2FFN2){FFN2};
\node[Box2,right=of 2FFN2](2FFN3){FFN3};
\node[Box2,right=of 2FFN3](2FFN4){FFN4};
\end{scope}
\node[draw,circle,line width=0.75pt, above=1.2 of $(FFN2)!0.5!(FFN3)$,cross,minimum width=6mm](CI1){};
\node[draw,circle,line width=0.75pt, above=1.2 of $(2FFN2)!0.5!(2FFN3)$,cross,minimum width=6mm](CI2){};
%
\draw[Line,-latex](NB1)--++(90:0.5)-|(FFN2);
\draw[Line,-latex](NB2)--++(90:0.5)-|(2FFN1);
\draw[Line,-latex,dashed,rounded corners=8pt](NB1)--
node[below,pos=0.5]{$p = 0.65$}++(180:2.7)|-(CI1.west);
\draw[Line,-latex,dashed,rounded corners=8pt](NB2)--
node[below,pos=0.5]{$p = 0.8$}++(0:3.2)|-(CI2.east);
\draw[Line,-latex](CI1)--(CI1|-P3.south);
\draw[Line,-latex](CI2)--(CI2|-P3.south);
\draw[Line,-latex](FFN2)--++(90:0.7)-|(CI1);
\draw[Line,-latex](2FFN1)--++(90:0.7)-|(CI2);
%%%
%fitting
\scoped[on background layer]
\node[draw=BackLine,inner xsep=7mm,inner ysep=3mm,
yshift=0mm,fill=BackColor!70,fit=(FFN1)(2FFN4)(CI2)(R1),line width=0.75pt](GBB2){};
%%below Router to Add +Normalize
\draw[Line,-latex]($(DPR1)!0.25!(PR1)$)--++(180:3.8)|-(P3);
\draw[Line,-latex]($(DPR2)!0.25!(PR2)$)--++(0:3.8)|-(P3);
%%%Above Add + Normalize
\begin{scope}[local bounding box=Y1,line width=0.5pt,shift={($(AN1)+(-1.2,0)$)}]
\def\side{0.4}
\foreach \i/\col in {0/white,1/green!40,2/white,3/green!40,4/white,5/green!40}{
    \draw[fill=\col,thick] (\i*\side,0) rectangle ++(\side,\side);
}
\end{scope}
\node[left=2pt of Y1]{$y_1$};
\begin{scope}[local bounding box=Y2,line width=0.5pt,shift={($(AN2)+(-1.2,0)$)}]
\def\side{0.4}
\foreach \i/\col in {0/white,1/green!40,2/white,3/green!40,4/white,5/green!40}{
    \draw[fill=\col,thick] (\i*\side,0) rectangle ++(\side,\side);
}
\end{scope}
\node[left=2pt of Y2]{$y_2$};
%%below Self-Attention
\draw[Line,latex-](P1.188)coordinate(GSA1)--++(270:0.7)coordinate(SA1);
\node[draw,circle,line width=0.75pt, below=0 of SA1,cross,minimum width=6mm](CI3){};
\draw[Line,latex-](P1.352)coordinate(GSA2)--++(270:0.7)coordinate(SA2);
\node[draw,circle,line width=0.75pt, below=0 of SA2,cross,minimum width=6mm](CI4){};
\draw[Line,latex-](CI3.south)coordinate(GDCI3)--++(270:0.7)coordinate(DCI3);
\draw[Line,latex-](CI4.south)coordinate(GDCI4)--++(270:0.7)coordinate(DCI4);%
%
\node[left=2pt of CI3,align=center]{Positional\\ embedding};
\node[left=2pt of CI4,align=center]{Positional\\ embedding};
%
\begin{scope}[local bounding box=X1,line width=0.5pt,shift={($(DCI3)+(-1.2,-0.4)$)}]
\def\side{0.4}
\foreach \i/\col in {0/white,1/green!40,2/white,3/green!40,4/white,5/green!40}{
    \draw[fill=\col,thick] (\i*\side,0) rectangle ++(\side,\side);
}
\end{scope}
\node[left=2pt of X1]{$x_1$};
\node[below=2pt of X1]{More};
\begin{scope}[local bounding box=X2,line width=0.5pt,shift={($(DCI4)+(-1.2,-0.4)$)}]
\def\side{0.4}
\foreach \i/\col in {0/white,1/green!40,2/white,3/green!40,4/white,5/green!40}{
    \draw[fill=\col,thick] (\i*\side,0) rectangle ++(\side,\side);
}
\end{scope}
\node[left=2pt of X2]{$x_2$};
\node[below=2pt of X2]{Parameters};
%
\draw[Line,-latex]($(GSA1)!0.5!(SA1)$)--++(180:3.8)|-(P2);
\draw[Line,-latex]($(GSA2)!0.5!(SA2)$)--++(0:3.8)|-(P2);
%%%%%%%%%%%%
%left diagram
\begin{scope}[local bounding box=LD,line width=0.5pt,shift={(-12,1.8)}]
\node[Box3,fill=RedL,draw=RedLine](2P1){Self-Attention};
\node[Box3,above=of 2P1,fill=BrownL,draw=BrownLine](2P2){Add + Normalize};
\node[Box3,above=of 2P2,fill=BackColor,draw=BackLine](2P3){Switching FFN Layer};
\node[Box3,above=of 2P3,fill=BrownL,draw=BrownLine](2P4){Add + Normalize};
%
\draw[Line,-latex](2P1)--(2P2);
\draw[Line,-latex](2P2)--(2P3);
\draw[Line,-latex](2P3)--(2P4);
\draw[Line,-latex](2P4)--++(90:1)node[above]{$y$};
\draw[Line,latex-](2P1)--++(270:1)node[below]{$x$};
\draw[blue,dashed,thick]($(2P1.south east)+(0.2,-0.2)$)--++(310:5);
\draw[blue,dashed,thick]($(2P4.north east)+(0.2,-0.2)$)--++(40:5);
\end{scope}
\end{tikzpicture}
```
:::

Gate-based conditional computation is effective for multi-task and transfer learning settings where inputs may benefit from specialized processing pathways. By enabling fine-grained control over model execution, such mechanisms allow for adaptive specialization across tasks while maintaining efficiency.

However, these benefits come at the cost of increased architectural complexity. The routing and gating operations themselves introduce additional overhead, both in terms of latency and memory access. Efficient deployment on hardware accelerators such as GPUs, TPUs, or edge devices requires careful attention to the scheduling and batching of expert activations [@lepikhin2020gshard].

##### Adaptive Inference {#sec-model-compression-adaptive-inference-9f20}

Early exit and conditional computation represent discrete choices: exit or continue, activate this expert or that one. Adaptive inference pushes this flexibility further by *continuously* modulating computational depth and resource allocation based on real-time confidence and task complexity [@yang2020resolution]. Rather than predefined exit points or discrete layer skipping, adaptive inference treats computation as a dial that can be turned up or down based on intermediate assessments of the input.

Fast Neural Networks (FNNs) exemplify this approach, adjusting the number of active layers based on real-time complexity estimation. If an input is straightforward, only a subset of layers is activated; if early layers produce low-confidence outputs, additional layers refine the prediction [@wu2019fast]. A related approach, dynamic layer scaling, progressively increases computational depth based on uncertainty estimates, useful for fine-grained classification tasks where some inputs require only coarse-grained processing while others need deeper feature extraction [@wang2021glam].

Adaptive inference excels in latency-sensitive applications where resource constraints fluctuate dynamically. In autonomous systems, for example, lane detection may require minimal computation while multi-object tracking in dense environments demands additional processing power. On hardware accelerators such as GPUs and TPUs, adaptive inference leverages parallel processing capabilities by distributing workloads dynamically, maximizing throughput while minimizing energy expenditure.

#### Implementation Challenges {#sec-model-compression-implementation-challenges-1184}

The efficiency gains from dynamic computation come at a price. Techniques like early exit and mixture of experts introduce architectural complexity that can undermine the very speedups they promise if not carefully managed. Dynamic computation introduces several practical challenges:

Training complexity poses the first obstacle: discrete gating decisions cannot be optimized with standard backpropagation, requiring reinforcement learning or continuous approximations. Because different inputs follow different paths, gradient updates become inconsistent and require careful regularization.

Overhead and latency variability present the second concern. Gating decisions add computational overhead that can offset savings from skipped computations. Variable inference times are problematic for real-time applications with strict latency requirements.

\index{Hardware Utilization!dynamic computation impact}
Hardware inefficiency compounds these issues. Dynamic computation patterns reduce hardware utilization because modern accelerators are optimized for regular, predictable operations. When inputs follow different paths, some hardware resources remain idle. See @sec-hardware-acceleration for hardware-aware strategies.

Generalization risks emerge when models learn to allocate insufficient computation to rare but important inputs, creating biased predictions. Dynamic models also introduce new adversarial attack vectors where attackers manipulate gating mechanisms.

Evaluation difficulty rounds out the challenge set. Standard benchmarks assume fixed computational budgets. FLOPs and latency metrics do not capture adaptive computation, and variable execution paths complicate reproducibility.

Despite these challenges, dynamic computation remains promising for efficiency optimization. Addressing these limitations requires robust training techniques, hardware-aware execution strategies, and evaluation frameworks that account for adaptive scaling.

### Sparsity Exploitation {#sec-model-compression-sparsity-exploitation-48a6}

\index{Sparsity!etymology}
Dynamic computation decides *whether* to perform certain operations, but many computations still involve multiplying by zero—a waste that sparsity exploitation\index{Sparsity!exploitation} directly addresses. Recall that pruning (from @sec-model-compression-pruning-d1cb) introduces zeros into weight matrices. Sparsity exploitation asks how to *accelerate* computation when those zeros are present. The distinction matters: pruning reduces what we store, while sparsity exploitation reduces what we compute. Sparsity\index{Sparsity!definition}[^fn-sparsity-etymology] in machine learning refers to the condition where a significant portion of the elements within a tensor, such as weight matrices or activation tensors, are zero or nearly zero.

[^fn-sparsity-etymology]: **Sparsity**: From Latin "sparsus" (scattered, spread out), past participle of "spargere" (to scatter). The mathematical sense dates to the 1950s-1960s when researchers studying linear systems noticed that many real-world matrices had mostly zero entries. In ML, sparsity became central with the development of L1 regularization (LASSO, 1996), which induces exact zeros in weights rather than just small values.

More formally, for a tensor $T \in \mathbb{R}^{m \times n}$ (or higher dimensions), the sparsity $S$ can be expressed as:
$$
S = \frac{\Vert \mathbf{1}_{\{T_{ij} = 0\}} \Vert_0}{m \times n}
$$
where $\mathbf{1}_{\{T_{ij} = 0\}}$ is an indicator function that yields 1 if $T_{ij} = 0$ and 0 otherwise, and $\Vert \cdot \Vert_0$ represents the L0 norm, which counts the number of non-zero elements.

Due to the nature of floating-point representations, we often extend this definition to include elements that are close to zero. This leads to:
$$
S_{\epsilon} = \frac{\Vert \mathbf{1}_{\{|T_{ij}| < \epsilon\}} \Vert_0}{m \times n}
$$
where $\epsilon$ is a small threshold value.

\index{L1 Regularization!sparsity induction}
Sparsity can emerge naturally during training, often as a result of regularization techniques, or be deliberately introduced through methods like pruning, where elements below a specific threshold are forced to zero. Effectively exploiting sparsity leads to significant computational efficiency, memory savings, and reduced power consumption, which prove valuable when deploying models on devices with limited resources, such as mobile phones, embedded systems, and edge devices.

#### Sparsity Types {#sec-model-compression-sparsity-types-793d}

Sparsity in neural networks falls into two broad categories: unstructured sparsity and structured sparsity.

\index{Sparsity!unstructured}
Unstructured sparsity occurs when individual weights are set to zero without any specific pattern, typically through magnitude-based pruning. While highly flexible, unstructured sparsity is less efficient on hardware because it lacks a predictable structure. Exploiting it requires specialized hardware or software optimizations.

\index{Sparsity!structured}
Structured sparsity involves removing entire components of the network, such as filters, neurons, or channels. Because these removals produce predictable memory access patterns, structured sparsity is more efficient on hardware accelerators like GPUs or TPUs. It is the preferred approach when deployment requires predictable computational resource usage.

#### Sparsity Utilization Methods {#sec-model-compression-sparsity-utilization-methods-04c3}

With the distinction between unstructured and structured sparsity patterns established, the critical question becomes: how do we translate theoretical zeros into actual speedup? The challenge lies in the gap between theoretical parameter reduction and realized performance: a sparse model with 90% of weights zeroed may still run at nearly full computational cost on hardware not designed for irregular memory access. The processor cannot skip a multiplication unless it *knows* the operand is zero—and discovering that requires loading the operand from memory in the first place. Bridging this gap requires specialized utilization methods and hardware support that can efficiently skip zero-valued computations [@hoefler2021sparsity]. Structured sparsity proves more hardware-efficient, enabling accelerators like GPUs and TPUs to fully exploit regular patterns [@Han2015].

The simplest utilization method is sparse matrix operations, which skip zero elements during computation to significantly reduce arithmetic operations. Consider the difference: multiplying a dense $4\times 4$ matrix with a vector typically requires 16 multiplications, while a sparse-aware implementation computes only the 6 nonzero operations:
$$
\begin{bmatrix}
2 & 0 & 0 & 1 \\
0 & 3 & 0 & 0 \\
4 & 0 & 5 & 0 \\
0 & 0 & 0 & 6
\end{bmatrix}
\begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix}
=
\begin{bmatrix} 2x_1 + x_4 \\ 3x_2 \\ 4x_1 + 5x_3 \\ 6x_4 \end{bmatrix}
$$

A third important technique for exploiting sparsity is low-rank approximation. In this approach, large, dense weight matrices are approximated by smaller, lower-rank matrices that capture the most important information while discarding redundant components. This reduces both the storage requirements and computational cost. For instance, a weight matrix of size $1000 \times 1000$ with one million parameters can be factorized into two smaller matrices, say $U$ (size $1000 \times 50$) and $V$ (size $50 \times 1000$), which results in only 100,000 parameters, much fewer than the original one million. This smaller representation retains the key features of the original matrix while significantly reducing the computational burden [@Denton2014].

Low-rank approximations, such as Singular Value Decomposition, are commonly used to compress weight matrices in neural networks. These approximations are widely applied in recommendation systems and natural language processing models to reduce computational complexity and memory usage without a significant loss in performance [@Joulin2017].

Sparsity-aware training complements these methods by helping models learn sparse representations during the training process itself. Sparse gradient descent, which updates only non-zero elements, reduces the number of active parameters throughout training rather than applying compression after the fact [@Bellec2018].

#### Sparsity Hardware Support {#sec-model-compression-sparsity-hardware-support-588b}

Achieving actual speedups from sparsity requires hardware that can efficiently skip zero-valued computations. The hardware acceleration principles in @sec-hardware-acceleration examine how different processor architectures handle sparse patterns with varying effectiveness. Software libraries can help bridge this gap by reformulating sparse computations into patterns that current hardware handles efficiently. For example, MegaBlocks [@gale2022megablocksefficientsparsetraining] reformulates sparse Mixture of Experts training into block-sparse operations, developing specialized kernels that maintain high accelerator utilization despite irregular sparsity patterns.

#### Structured Patterns {#sec-model-compression-structured-patterns-c3f4}

\index{cuSPARSE!block sparse operations}
Various sparsity formats have been developed, each with unique structural characteristics and implications. Two of the most prominent are block sparse matrices\index{Sparsity!block sparse} and N:M sparsity patterns. Block sparse matrices generally have isolated blocks of zero and non-zero dense submatrices such that a matrix operation on the large sparse matrix can be easily re-expressed as a smaller (overall arithmetic-wise) number of dense operations on submatrices. This sparsity allows more efficient storage of the dense submatricies while maintaining shape compatibility for operations like matrix or vector products. For example, @fig-block-sparse-gemm shows how NVIDIA's cuSPARSE [@nvidia_cusparse_block] library supports sparse block matrix operations and storage. Several other works, such as Monarch matrices [@dao2022monarchexpressivestructuredmatrices], have extended on this block-sparsity to strike an improved balance between matrix expressivity and compute/memory efficiency.

::: {#fig-block-sparse-gemm fig-env="figure" fig-pos="htb" fig-cap="**Block Sparse Representation**: NVIDIA's cusparse library efficiently stores block sparse matrices by exploiting dense submatrix structures, enabling accelerated matrix operations while maintaining compatibility with dense matrix computations through block indexing. This approach reduces memory footprint and arithmetic complexity for sparse linear algebra, important for scaling machine learning models. *Source: NVIDIA.*" fig-alt="Grid of 3x3 matrix blocks with varying shades indicating dense submatrices. Adjacent index array shows non-zero block positions. Gray blocks represent zeros, colored blocks represent dense submatrices stored separately."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{%
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50,text=black},
}
\definecolor{Blue1}{RGB}{23,68,150}
\definecolor{Blue2}{RGB}{84,131,217}
\definecolor{Blue3}{RGB}{145,177,237}
\def\columns{3}
\def\rows{3}
\def\cellsize{5mm}
\def\cellheight{5mm}

\begin{scope}[local bounding box=BL1]
\begin{scope}[local bounding box=matrica1]
\def\rowone{Blue2,Blue3,Blue2}
\def\rowtwo{Blue2,Blue1,Blue2}
\def\rowthree{Blue2,Blue2,Blue2}
\def\br{A}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
%%%%%%
\begin{scope}[shift={(1.5,0)}]
\def\rowone{Blue1,Blue2,Blue2}
\def\rowtwo{Blue3,Blue2,Blue1}
\def\rowthree{Blue2,Blue1,Blue2}
\def\br{B}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
%%%%%%
\begin{scope}[shift={(3.0,0)}]
\def\br{C}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
%%%%%%%%
%second row
\begin{scope}[shift={(0,-1.5)}]
\def\rowone{Blue1,Blue3,Blue3}
\def\rowtwo{Blue1,Blue2,Blue2}
\def\rowthree{Blue2,Blue1,Blue2}
\def\br{D}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(1.5,-1.5)}]
\def\br{E}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(3,-1.5)}]
\def\rowone{Blue1,Blue2,Blue2}
\def\rowtwo{Blue2,Blue1,Blue2}
\def\rowthree{Blue2,Blue2,Blue1}
\def\br{E}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
%%third row
\begin{scope}[shift={(0,-3)}]
\def\br{H}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(1.5,-3)}]
\def\rowone{Blue2,Blue1,Blue3}
\def\rowtwo{Blue2,Blue2,Blue1}
\def\rowthree{Blue1,Blue2,Blue2}
\def\br{E}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(3,-3)}]
\def\rowone{Blue2,Blue3,Blue2}
\def\rowtwo{Blue3,Blue1,Blue1}
\def\rowthree{Blue2,Blue3,Blue2}
\def\br{E}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\end{scope}
\node[below=0.2 of BL1,align=center]{Block sparse\\ weights};
\newcommand{\zeroentry}{%
    \tikz[baseline=0.8ex]{
\node[draw=black, line width=1.2pt,fill=black!10, minimum width=0.8*\cellsize,
                    minimum height=0.8*\cellheight] (cell-G)  {};
}}
\node[above=0.2 of BL1,align=center]{\zeroentry ~~= zero entry};

%%%%%%%%%%%%
%%right matrix
\begin{scope}[local bounding box=BL2,shift={(6,0)}]]
\begin{scope}
\def\rowone{Blue2,Blue3,Blue2}
\def\rowtwo{Blue2,Blue1,Blue2}
\def\rowthree{Blue2,Blue2,Blue2}
\def\br{A2}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
%%%%%%
\begin{scope}[shift={(1.5,0)}]
\def\rowone{Blue1,Blue2,Blue2}
\def\rowtwo{Blue3,Blue2,Blue1}
\def\rowthree{Blue2,Blue1,Blue2}
\def\br{B2}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
%%%%%%
%second row
\begin{scope}[shift={(0,-1.5)}]
\def\rowone{Blue1,Blue3,Blue3}
\def\rowtwo{Blue1,Blue2,Blue2}
\def\rowthree{Blue2,Blue1,Blue2}
\def\br{C2}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(1.5,-1.5)}]
\def\rowone{Blue1,Blue2,Blue2}
\def\rowtwo{Blue2,Blue1,Blue2}
\def\rowthree{Blue2,Blue2,Blue1}
\def\br{D2}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
%%third row
\begin{scope}[shift={(0,-3)}]
\def\rowone{Blue2,Blue1,Blue3}
\def\rowtwo{Blue2,Blue2,Blue1}
\def\rowthree{Blue1,Blue2,Blue2}
\def\br{E2}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(1.5,-3)}]
\def\rowone{Blue2,Blue3,Blue2}
\def\rowtwo{Blue3,Blue1,Blue1}
\def\rowthree{Blue2,Blue3,Blue2}
\def\br{F2}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=black!10, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=white,line width=1pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}

%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\end{scope}

%%%%%%%%%%%
%%third matrix-other color
\begin{scope}[local bounding box=BL3,shift={(9.5,0.5)}]
\def\columns{2}
\def\rows{1}
\def\cellsize{5mm}
\def\cellheight{15mm}
\begin{scope}
\def\rowone{OrangeL,OrangeL}
\def\br{A3}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=white, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(0,-1.5)}]
\def\rowone{OrangeL,OrangeL}
\def\br{B3}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=white, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\begin{scope}[shift={(0,-3)}]
\def\rowone{OrangeL,OrangeL}
\def\br{C3}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=white, line width=1pt,fill=white, minimum width=\cellsize,
                    minimum height=\cellheight, line width=0.5pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
%countur line
\draw[line width=2pt,black!80]
    (0.5*\cellsize,-0.5*\cellheight) rectangle
    (\columns*\cellsize+0.5*\cellsize,-\rows*\cellheight-0.5*\cellheight);
\end{scope}
\end{scope}
\node[below=0.2 of BL2,align=center](NZ){Non-zero\\ data values};
\node[below=0.2 of BL3,align=center](BI){Block\\ indices};
\scoped[on background layer]
\node[draw=none,inner xsep=0mm,inner ysep=0mm,
yshift=0mm,fill=none,fit=(NZ)(BI),line width=0.75pt](BB1){};
\node[below=2pt of BB1](IR){Internal representation};

\coordinate(XA)at($(cell-1-1A2.north west)+(0.5,0.5)$);
\coordinate(XA1)at($(cell-1-3E2.south west)+(0.5,-1.5)$);
\coordinate(XB)at($(cell-1-1A3.north east)+(0,0.5)$);
\coordinate(XB1)at($(cell-1-1C3.south east)+(0.5,-1.5)$);
%\fill[red](cell-1-1A2.north west)circle(2pt);
\draw[line width=3.5pt,violet!30,rounded corners=20pt](XA)--++(180:1.3)|-(XA1);
\draw[line width=3.5pt,violet!30,rounded corners=20pt](XB)--++(0:1.3)|-(XB1);
%
\coordinate(T1)at($(cell-1-1A2.north west)+(-0.2,0.2)$);
\coordinate(T2)at($(cell-2-1A3.north east)+(0.2,0.2)$);
\coordinate(T3)at($(cell-2-1A3.south east)+(0.2,-0.2)$);
\coordinate(T4)at($(cell-1-1A3.south west)+(-0.2,-0.2)$);
\coordinate(T5)at($(cell-1-1A2.south west)+(-0.2,-0.2)$);
\draw[line width=3.5pt,red](T1)-|(T3)--(T4)|-(T5)--(T1);
%\fill[blue](T5)circle(2pt);

%%%%%%%%%%%
%%third matrix-other color
\begin{scope}[local bounding box=BL4,shift={(13,0)}]
\def\columns{4}
\def\rows{9}
\begin{scope}
\def\br{A4}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[line width=2pt,draw=black!80, fill=GreenL, minimum width=\cellsize,
                    minimum height=\cellheight] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
\end{scope}
%%%%%%%%%%%
%%above matrix
\begin{scope}[local bounding box=BL5,shift={(13,6)}]
\def\columns{4}
\def\rows{9}
\begin{scope}
\def\br{A5}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[line width=2pt,draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
\end{scope}
\node[below=0.2 of BL4,align=center](OA){Output\\ activations};

\node[draw=red,inner xsep=1.5mm,inner ysep=1.5mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1A4),line width=3.5pt](BB2){};
\draw[red,line width=1.5pt](BB2)--++(135:2)node[above left]{Dot Product};
\node[draw=red,inner xsep=1.5mm,inner ysep=1.5mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1A5)(cell-1-9A5),line width=3.5pt](BB3){};
%
\node[single arrow, draw=black,thick, fill=VioletL,
      minimum width = 20pt, single arrow head extend=3pt,
      minimum height=10mm]at($(BL3)!0.52!(BL4)$) {};
\node[single arrow, draw=black,thick, fill=VioletL,
      minimum width = 20pt, single arrow head extend=3pt,
      minimum height=9mm,rotate=270]at($(BL5)!0.5!(BL4)$) {};
\node[left=2mm of BB3,align=center,red]{Input\\activations};
\end{tikzpicture}
```
:::

Similarly, the $N$:$M$ sparsity pattern\index{Sparsity!N:M patterns}\index{Sparsity!2:4 structured} is a structured sparsity format where, in every set of $M$ consecutive elements (e.g., weights or activations), exactly $N$ are non-zero, and the other two are zero [@zhou2021learningnmfinegrainedstructured]. This deterministic pattern facilitates efficient hardware acceleration, as it allows for predictable memory access patterns and optimized computations. By enforcing this structure, models can achieve a balance between sparsity-induced efficiency gains and maintaining sufficient capacity for learning complex representations. @fig-2-4-gemm below shows a comparison between accelerating dense versus 2:4 sparsity matrix multiplication, a common sparsity pattern used in model training. Later works like STEP [@lu2023steplearningnmstructured] have examined learning more general $N$:$M$ sparsity masks for accelerating deep learning inference under the same principles.

::: {#fig-2-4-gemm fig-env="figure" fig-pos="htb" fig-cap="**2:4 Structured Sparsity GEMM.** Left: standard dense matrix multiplication on Tensor Cores using full 8-element rows. Right: 2:4 sparse multiplication where each group of four elements retains only two non-zeros, with 2-bit indices selecting matching elements from the dense B matrix, halving compute. Source: PyTorch blog [@pytorch_sparsity_blog]." fig-alt="Side-by-side comparison of dense and 2:4 sparse GEMM on Tensor Cores. Left shows 8-element row multiplication. Right shows 4-element sparse row with 2-bit indices selecting matching elements from dense B matrix."}
```{.tikz}
\begin{tikzpicture}[line join=round,font=\usefont{T1}{phv}{m}{n}]
\tikzset{%
 mysnake/.style={postaction={draw,decorate,decoration={snake,amplitude=3pt,segment length=19pt}}},
helvetica/.style={align=flush center,font=\small\usefont{T1}{phv}{m}{n}},
Line/.style={line width=1.0pt,black!50,text=black},
Box/.style={inner sep=5pt,
    node distance=0.8,
    draw=VioletLine,
    line width=0.75pt,
    fill=VioletL2,
    text width=43mm,align=flush center,
    minimum width=43mm, minimum height=7mm
  },
do path picture/.style={%
    path picture={%
      \pgfpointdiff{\pgfpointanchor{path picture bounding box}{south west}}%
        {\pgfpointanchor{path picture bounding box}{north east}}%
      \pgfgetlastxy\x\y%
      \tikzset{x=\x/2,y=\y/2}%
      #1
    }
  },
  cross/.style={do path picture={
    \draw [line cap=round] (-1,-1) -- (1,1) (-1,1) -- (1,-1);
  }},
}
\definecolor{Blue1}{RGB}{23,68,150}
\definecolor{Blue2}{RGB}{84,131,217}
\definecolor{Blue3}{RGB}{145,177,237}
\def\columns{3}
\def\rows{3}
\def\cellsize{5mm}
\def\cellheight{5mm}

\begin{scope}[local bounding box=LEFT]
\node[draw,circle,line width=0.75pt,cross,minimum width=6mm](CI1){};
\node[draw=black, line width=1.2pt,fill=GreenL, minimum width=0.9*\cellsize,
                    minimum height=0.9*\cellheight,below=0.5 of CI1](AR) {};
\node[right=1mm of AR](AR1){Accumulator (result)};

\begin{scope}[local bounding box=M3,shift={(-0.2,1.5)}]
\def\columns{8}
\def\rows{1}
\def\br{M3}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
\begin{scope}[local bounding box=M1,shift={(-4.5,1.5)}]
\def\columns{8}
\def\rows{1}
\def\br{M1}
\def\rowone{Blue1,Blue2,Blue3,Blue1,Blue3,Blue1,Blue2,Blue3}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=1pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\end{scope}

\draw[Line,-latex](CI1)--(AR);
\draw[Line,-latex](M3)|-(CI1);
\draw[Line,-latex](M1)|-(CI1);
%%fitting
\scoped[on background layer]
\node[draw=BackLine,inner xsep=3mm,inner ysep=11mm,
yshift=8mm,fill=BackColor!50,fit=(AR1)(M1)(M3),line width=0.75pt](BB1){};
\node[anchor=north west,align=center]at(BB1.north west){Dense operation\\ on Tensor Core};

%%below matrix Blue
\begin{scope}[local bounding box=DM1,shift={(0.2,-2.5)}]
\def\columns{8}
\def\rows{8}
\def\br{DM1}
\def\rowone{Blue1,Blue2,Blue3,Blue1,Blue3,Blue1,Blue2,Blue3}
\def\rowtwo{Blue3,Blue2,Blue3,Blue2,Blue3,Blue3,Blue1,Blue1}
\def\rowthree{Blue2,Blue1,Blue2,Blue3,Blue3,Blue2,Blue2,Blue1}
\def\rowfour{Blue2,Blue3,Blue2,Blue3,Blue1,Blue2,Blue3,Blue3}
\def\rowfive{Blue2,Blue2,Blue3,Blue1,Blue3,Blue1,Blue2,Blue3}
\def\rowsix{Blue2,Blue3,Blue1,Blue3,Blue1,Blue3,Blue2,Blue2}
\def\rowseven{Blue3,Blue3,Blue2,Blue1,Blue2,Blue2,Blue3,Blue3}
\def\rowosam{Blue3,Blue2,Blue3,Blue2,Blue3,Blue1,Blue2,Blue1}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}
\foreach \color [count=\x] in \rowfour {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-4\br) {};
}
\foreach \color [count=\x] in \rowfive {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-5\br) {};
}
\foreach \color [count=\x] in \rowsix {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-6\br) {};
}
\foreach \color [count=\x] in \rowseven {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-7\br) {};
}
\foreach \color [count=\x] in \rowosam {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-8\br) {};
}
\end{scope}

%
\draw[|-|,thick]([yshift=-5.5]cell-1-8DM1.south west)--node[below=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{K}([yshift=-5.5]cell-8-8DM1.south east);
\node[left=1mm of DM1.west,rotate=90,anchor=south]{A matrix (Dense)};
\draw[|-|,thick]([xshift=7.5]cell-8-8DM1.south east)--node[right=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{M}([xshift=7.5]cell-8-1DM1.north east);
%

\node[below=22pt of DM1](SP){\textbf{Dense M $\times$ N $\times$ K GEMM}};
%%%last matrix Green
\begin{scope}[local bounding box=DM3,shift={(6,-2.5)}]
\def\columns{4}
\def\rows{8}
\def\br{DM3}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=GreenL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
%
%
\node[below=0.35 of DM3,align=center](CM){C matrix\\ (Dense)};
\draw[|-|,thick]([yshift=9.5]cell-1-1DM3.north west)--node[above=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{N}([yshift=9.5]cell-4-1DM3.north east);
\draw[|-|,thick]([xshift=7.5]cell-4-8DM3.south east)--node[right=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{M}([xshift=7.5]cell-4-1DM3.north east);
%

%
%fitting
\node[draw=red,inner xsep=1.2mm,inner ysep=1.2mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM1)(cell-8-1DM1),line width=3.5pt](BB3){};
\draw[red,line width=1.5pt](BB3)--(BB1.south);
\node[draw=red,inner xsep=1.0mm,inner ysep=1.0mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM3),line width=3.5pt](BB3){};
\draw[red,line width=1.5pt](BB3)--(BB1.south east);
%%%last upper matrix brown
\begin{scope}[local bounding box=DM4,shift={(6,3.5)}]
\def\columns{4}
\def\rows{8}
\def\br{DM4}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
%
\node[above=0.2 of DM4,align=center](BM){B matrix\\ (Dense)};
\draw[|-|,thick]([xshift=7.5]cell-4-8DM4.south east)--node[right=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{K}([xshift=7.5]cell-4-1DM4.north east);
%fitting
\node[draw=red,inner xsep=1.2mm,inner ysep=1.2mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM4)(cell-1-8DM4),line width=3.5pt](BB4){};
\draw[red,line width=1.5pt](BB4)--(BB1.east);
\node[draw=red,inner xsep=1.0mm,inner ysep=1.0mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM3),line width=3.5pt](BB3){};
\draw[red,line width=1.5pt](BB3)--(BB1.south east);
\end{scope}
%%%%%%%%%%
%right part
%%%%%%%%%%%%
\begin{scope}[local bounding box=RIGHT,shift={(14.5,0)}]
\node[draw,circle,line width=0.75pt,cross,minimum width=6mm](CI1){};
\node[draw=black, line width=1.2pt,fill=GreenL, minimum width=0.9*\cellsize,
                    minimum height=0.9*\cellheight,below=0.5 of CI1](AR) {};
\node[right=1mm of AR](AR1){Accumulator (result)};

\begin{scope}[local bounding box=M3,shift={(1,1.5)}]
\def\columns{4}
\def\rows{1}
\def\br{M3}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}

\begin{scope}[local bounding box=M2,shift={(-1.3,1.5)}]
\def\columns{4}
\def\rows{1}
\def\br{M2}
\def\cellsize{2.5mm}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=OrangeL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
\begin{scope}[local bounding box=M1,shift={(-3.8,1.5)}]
\def\columns{4}
\def\rows{1}
\def\br{M1}
\def\rowone{Blue2,Blue3,Blue1,Blue3}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=1pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\end{scope}
\node[Box,above=of M3)](CM){Choose matching K/2 elements out of K elements};
%%
\begin{scope}[local bounding box=M4,shift={($(CM.north)+(-2.3,1.5)$)}]
\def\columns{8}
\def\rows{1}
\def\br{M4}
\def\rowone{BrownL,BrownL!20,BrownL!20,BrownL,BrownL!20,BrownL,BrownL,BrownL!20}
%
\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=1pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\end{scope}
%\fill[red](cell-8-1M4)circle(2pt);
\foreach \x in{1,...,8}{
\draw[Line,-latex](cell-\x-1M4)--(cell-\x-1M4|-CM.north);
}
\foreach \x in{1,...,4}{
\draw[Line,latex-](cell-\x-1M3)--(cell-\x-1M3|-CM.south);
}
\draw[Line,-latex](CI1)--(AR);
\draw[Line,-latex](M3)|-(CI1);
\draw[Line,-latex](M1)|-(CI1);
\draw[Line,-latex](M2)|-node[left,pos=0.3]{Select}(CM);
%%fitting
\scoped[on background layer]
\node[draw=BackLine,inner xsep=3mm,inner ysep=3mm,
yshift=0mm,fill=BackColor!50,fit=(AR1)(M1)(M4)(CM),line width=0.75pt](BB1){};
\node[anchor=north west,align=center]at(BB1.north west){Sparse operation\\ on Tensor Core};

%%below matrix Blue
\begin{scope}[local bounding box=DM1,shift={(0.2,-2.5)}]
\def\columns{4}
\def\rows{8}
\def\br{DM1}
\def\rowone{Blue2,Blue3,Blue1,Blue3}
\def\rowtwo{Blue2,Blue3,Blue1,Blue2}
\def\rowthree{Blue2,Blue1,Blue3,Blue2}
\def\rowfour{Blue2,Blue3,Blue1,Blue3}
\def\rowfive{Blue3,Blue1,Blue1,Blue2}
\def\rowsix{Blue2,Blue2,Blue1,Blue3}
\def\rowseven{Blue3,Blue1,Blue2,Blue3}
\def\rowosam{Blue3,Blue2,Blue1,Blue2}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}

%
\foreach \color [count=\x] in \rowone {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
    minimum height=\cellheight] at (cell-\x-1\br) {};
}
\foreach \color [count=\x] in \rowtwo {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-2\br) {};
}
\foreach \color [count=\x] in \rowthree {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-3\br) {};
}
\foreach \color [count=\x] in \rowfour {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-4\br) {};
}
\foreach \color [count=\x] in \rowfive {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-5\br) {};
}
\foreach \color [count=\x] in \rowsix {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-6\br) {};
}
\foreach \color [count=\x] in \rowseven {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-7\br) {};
}
\foreach \color [count=\x] in \rowosam {
    \node[fill=\color,draw=black!80,line width=2pt, minimum size=\cellsize,
               minimum height=\cellheight] at (cell-\x-8\br) {};
}
\end{scope}
\node[below=0.7 of DM1,align=center](NZ){Non-zero data\\ values};
\draw[|-|,thick]([yshift=-5.5]cell-1-8DM1.south west)--node[below=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{K/2}([yshift=-5.5]cell-4-8DM1.south east);
\node[left=1mm of DM1.west,rotate=90,anchor=south]{A matrix (Sparse)};
\begin{scope}[local bounding box=DM2,shift={(3.4,-2.5)}]
\def\columns{4}
\def\rows{8}
\def\br{DM2}
\def\cellsize{2.5mm}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=OrangeL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
%
\node[below=0.7 of DM2,align=center](2B){2-bits\\ indices};
\draw[|-|,thick]([yshift=-5.5]cell-1-8DM2.south west)--node[below=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{K/2}([yshift=-5.5]cell-4-8DM2.south east);
\draw[|-|,thick]([xshift=9.5]cell-4-8DM2.south east)--node[right=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{M}([xshift=9.5]cell-4-1DM2.north east);
%
\node[draw=none,inner xsep=0mm,inner ysep=0mm,
yshift=0mm,fill=none,fit=(NZ)(2B),line width=0.75pt](BB2){};
\node[below=2pt of BB2](SP){\textbf{Sparse M $\times$ N $\times$ K GEMM}};
%%%last matrix Green
\begin{scope}[local bounding box=DM3,shift={(6,-2.5)}]
\def\columns{4}
\def\rows{8}
\def\br{DM3}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=GreenL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
%
%
\node[below=0.7 of DM3,align=center](CM){C matrix\\ (Dense)};
\draw[|-|,thick]([yshift=9.5]cell-1-1DM3.north west)--node[above=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{N}([yshift=9.5]cell-4-1DM3.north east);
\draw[|-|,thick]([xshift=7.5]cell-4-8DM3.south east)--node[right=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{M}([xshift=7.5]cell-4-1DM3.north east);
%
%fitting
\node[draw=red,inner xsep=1.2mm,inner ysep=1.2mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM1)(cell-4-1DM2),line width=3.5pt](BB3){};
\draw[red,line width=1.5pt](BB3)--(BB1.south);
\node[draw=red,inner xsep=1.0mm,inner ysep=1.0mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM3),line width=3.5pt](BB3){};
\draw[red,line width=1.5pt](BB3)--(BB1.south east);
%%%last upper matrix brown
\begin{scope}[local bounding box=DM4,shift={(6,3.5)}]
\def\columns{4}
\def\rows{8}
\def\br{DM4}

\foreach \x in {1,...,\columns}{
    \foreach \y in {1,...,\rows}{
        %
        \node[draw=black!80, fill=BrownL, minimum width=\cellsize,
                    minimum height=\cellheight, line width=2pt] (cell-\x-\y\br) at (\x*\cellsize,-\y*\cellheight) {};
    }
}
\end{scope}
%
\node[above=0.2 of DM4,align=center](BM){B matrix\\ (Dense)};
\draw[|-|,thick]([xshift=7.5]cell-4-8DM4.south east)--node[right=0pt,
                       font=\usefont{T1}{phv}{m}{n}\small]{K}([xshift=7.5]cell-4-1DM4.north east);
\node[left=1mm of DM1.west,rotate=90,anchor=south]{A matrix (Sparse)};
%fitting
\node[draw=red,inner xsep=1.2mm,inner ysep=1.2mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM4)(cell-1-8DM4),line width=3.5pt](BB4){};
\draw[red,line width=1.5pt](BB4)--(BB1.east);
\node[draw=red,inner xsep=1.0mm,inner ysep=1.0mm,outer sep=0pt,
yshift=0mm,fill=none,fit=(cell-1-1DM3),line width=3.5pt](BB3){};
\draw[red,line width=1.5pt](BB3)--(BB1.south east);
\end{scope}
\path[]($(RIGHT)!0.5!(LEFT)$)--++(90:6)coordinate(GO);
\path[]($(RIGHT)!0.5!(LEFT)$)--++(270:6)coordinate(DO);
\path[VioletLine!60,mysnake,line width=1pt](GO)--(DO);
\end{tikzpicture}
```
:::

Modern hardware accelerators provide specialized support for sparse operations, though the degree of acceleration depends on sparsity structure and hardware capabilities. @sec-hardware-acceleration examines these accelerators in depth; here we summarize their sparsity-specific features.

\index{Sparse Tensor Cores}
GPUs\index{Tensor Cores!sparse acceleration}\index{Hardware-Aware Optimization!sparse tensor cores} with Sparse Tensor Cores (NVIDIA Ampere and later) accelerate structured sparsity patterns like 2:4, achieving up to 2 $\times$ speedup by skipping zero multiplications [@NVIDIA2020]. However, this acceleration requires the sparsity pattern to match hardware expectations, and unstructured sparsity typically sees limited benefit [@hoefler2021sparsity]. TPUs\index{TPU!sparse matrix support} provide support for sparse weight matrices, though this capability has evolved across generations. The original TPU was designed primarily for dense operations; later versions and research adaptations (such as Sparse-TPU) have added sparse matrix support. The systolic array architecture can process non-zero elements efficiently when sparsity patterns are predictable, making this particularly beneficial for transformer models where large matrix multiplications dominate [@jouppi2021ten]. FPGAs\index{FPGA!custom sparsity patterns} offer the most flexibility: unlike GPUs and TPUs, they can be programmed to handle arbitrary sparse formats, making them suitable for unstructured pruning or application-specific patterns where general-purpose accelerators underperform.

Across all platforms, sparse operations reduce memory bandwidth requirements and energy consumption by accessing fewer elements[^fn-sparse-energy-savings]. This benefit compounds with quantization: a sparse INT8 model requires less memory traffic than either technique alone [@Gale2020].

[^fn-sparse-energy-savings]: **Sparse Energy Savings**: Sparse operations significantly reduce energy consumption by accessing fewer elements. For example, structured 2:4 sparsity patterns enable GPU acceleration while maintaining model accuracy. Energy savings vary by hardware platform and sparsity pattern.

#### Challenges and Limitations {#sec-model-compression-challenges-limitations-975f}

While sparsity offers significant efficiency advantages, several challenges limit its practical effectiveness. @tbl-sparsity-optimization summarizes these challenges.

| **Challenge**                          | **Description**                                                                                               | **Impact**                                                                                    |
|:---------------------------------------|:--------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|
| **Unstructured Sparsity Optimization** | Irregular sparse patterns make it difficult to exploit sparsity on hardware.                                  | Limited hardware acceleration and reduced computational savings.                              |
| **Algorithmic Complexity**             | Sophisticated pruning and sparse matrix operations require complex algorithms.                                | High computational overhead and algorithmic complexity for large models.                      |
| **Hardware Support**                   | Hardware accelerators are optimized for structured sparsity, making unstructured sparsity harder to optimize. | Suboptimal hardware utilization and lower performance for unstructured sparsity.              |
| **Accuracy Trade-off**                 | Aggressive sparsity may degrade model accuracy if not carefully balanced.                                     | Potential loss in performance, requiring careful tuning and validation.                       |
| **Energy Efficiency**                  | Overhead from sparse matrix storage and management can offset the energy savings from reduced computation.    | Power consumption may not improve if the overhead surpasses savings from sparse computations. |
| **Limited Applicability**              | Sparsity may not benefit all models or tasks, especially in domains requiring dense representations.          | Not all models or hardware benefit equally from sparsity.                                     |

: **Sparsity Optimization Challenges**: Unstructured sparsity, while reducing model size, hinders hardware acceleration due to irregular memory access patterns, limiting potential computational savings and requiring specialized hardware or software to realize efficiency gains. {#tbl-sparsity-optimization}

\index{Compressed Sparse Row!storage format}
The central challenge is the gap between theoretical and practical speedups. Unstructured pruning removes individual weights based on importance, creating irregular patterns that hardware accelerators struggle to exploit. Most GPUs and TPUs optimize for structured data; without regular patterns, they cannot skip zero elements efficiently. Pruning algorithms themselves introduce overhead, as determining which weights to prune requires sophisticated importance estimation that can be computationally expensive for large models. Even when sparsity is achieved, sparse matrix storage formats (such as Compressed Sparse Row; see @sec-algorithm-foundations-sparse-matrix-formats-0bd0) add indexing overhead that can offset computational savings—a rule of thumb is that sparsity typically must exceed 90–95% to be worthwhile for performance.

The accuracy-efficiency trade-off requires careful calibration. Aggressive sparsity can degrade accuracy beyond acceptable thresholds, and the relationship is often non-linear, as models may tolerate 70% sparsity with minimal impact but collapse at 80%. Finding the optimal operating point requires extensive experimentation.

Energy efficiency is not guaranteed. While sparse operations reduce arithmetic operations, the overhead of sparse indexing and irregular memory access can increase power consumption on hardware not optimized for sparse patterns. On edge devices with tight power budgets, these overheads may outweigh the benefits.

Finally, sparsity benefits vary by model type. Tasks requiring dense representations (image segmentation, some reinforcement learning) may not benefit from sparsity, and older hardware lacking sparse acceleration may see no improvement or even regression.

#### Combined Optimizations {#sec-model-compression-combined-optimizations-99fd}

The techniques examined throughout this chapter (pruning, quantization, operator fusion, dynamic computation, and sparsity) do not exist in isolation. Production deployments rarely apply a single technique; instead, they compose multiple approaches to achieve compression ratios impossible with any individual method. While sparsity offers significant efficiency advantages on its own, it achieves its full potential when combined with other optimization techniques. These combinations introduce coordination challenges that require careful management [@hoefler2021sparsity].

The interaction between sparsity and pruning is the most direct: pruning creates sparsity, but the *pattern* determines hardware efficiency. Structured pruning (entire filters or layers) produces regular sparsity that GPUs and TPUs accelerate efficiently. Unstructured pruning creates irregular patterns that may require specialized sparse matrix formats to realize speedups [@elsen2020fast; @gale2019state].

Combining sparsity with quantization yields multiplicative compression but introduces its own complexity. GPUs and TPUs with dedicated sparse tensor cores accelerate this combination effectively, while general-purpose CPUs often struggle with the combined overhead of sparse indexing and dequantization [@nagel2021white; @zhang2021learning].

The recurring theme across all combinations is hardware alignment. Efficient model designs (depthwise separable convolutions, dynamic computation) amplify sparsity benefits only when the target hardware supports the resulting operation patterns [@dettmers2019sparse]. Selecting technique combinations requires understanding target platform capabilities, as explored in @sec-hardware-acceleration.

The coordination challenges inherent in combining sparsity with other techniques point to a broader principle: optimization techniques rarely succeed in isolation, and their effectiveness depends on sequencing decisions and hardware alignment.

::: {.callout-perspective title="The Optimization Composition Problem"}

Unlike software functions that compose predictably, optimization techniques interact through shared physical resources: memory bandwidth, cache capacity, and arithmetic units. Pruning changes sparsity patterns that affect quantization's dynamic range. Quantization changes numerical precision that affects fusion's memory traffic assumptions. Operator fusion changes execution schedules that affect dynamic computation's branching decisions. Effective optimization therefore requires treating the model-hardware pair as a coupled system rather than optimizing each dimension independently. This is a systems engineering problem, not merely a machine learning one.
:::

With the three optimization dimensions now fully explored, practitioners need systematic guidance for translating this knowledge into deployment decisions.


## Technique Selection {#sec-model-compression-technique-selection-ba16}

An engineer deploying a transformer model faces a concrete decision: the model exceeds the target device's memory by 3 $\times$, inference latency is 4 $\times$ above the SLO, and the power budget allows no more than 2 W sustained. Should she quantize first, prune first, distill to a smaller architecture, or combine techniques? The answer depends on which constraint is binding, what accuracy loss is tolerable, and how much engineering time is available. This section provides structured guidance for navigating that decision.

### Mapping Constraints to Techniques {#sec-model-compression-mapping-constraints-techniques-ff2c}

Understanding how system constraints map to optimization dimensions guides practitioners toward the most relevant approaches. @tbl-constraint-opt-mapping maps system constraints to specific optimization dimensions, guiding technique selection based on deployment requirements.

| **System Constraint**      | **Model Representation** | **Numerical Precision** | **Architectural Efficiency** |
|:---------------------------|:-------------------------|:------------------------|:-----------------------------|
| **Computational Cost**     | ✗                        | ✓                       | ✓                            |
| **Memory and Storage**     | ✓                        | ✓                       | ✗                            |
| **Latency and Throughput** | ✓                        | ✗                       | ✓                            |
| **Energy Efficiency**      | ✗                        | ✓                       | ✓                            |
| **Scalability**            | ✓                        | ✗                       | ✓                            |

: **Optimization Dimensions**: System constraints drive optimization along three core dimensions: model representation, numerical precision, and architectural efficiency, each addressing different resource limitations and performance goals. {#tbl-constraint-opt-mapping}

Although each system constraint primarily aligns with one or more optimization dimensions, the relationships are not strictly one-to-one. Many optimization techniques affect multiple constraints simultaneously. Structuring model optimization along these three dimensions allows practitioners to analyze trade-offs more effectively and select optimizations that best align with deployment requirements.

### Decision Framework {#sec-model-compression-decision-framework-0d69}

The binding constraint of the deployment target determines which technique to reach for first, because each optimization addresses a different resource bottleneck.

When model size is the primary constraint, as with over-the-air updates or storage-limited devices, quantization provides the most direct reduction. INT8 post-training quantization delivers a 4 $\times$ size reduction with minimal accuracy loss and requires no retraining, making it the natural first choice. When further reduction is needed, INT4 quantization doubles the compression to 8 $\times$ at the cost of 1--3% typical accuracy degradation. For applications where accuracy is paramount, combining knowledge distillation to a smaller architecture with subsequent quantization preserves quality while still achieving substantial compression.

When inference latency is the bottleneck, the optimization must reduce the actual number of operations executed, not just the storage footprint. Structured pruning accomplishes this by removing entire channels or filters, directly cutting the FLOP count and producing dense sub-networks that run efficiently on commodity hardware. If the target hardware supports INT8 execution, adding quantization on top of structured pruning accelerates the arithmetic itself. For latency-critical applications with some accuracy flexibility, early-exit architectures offer an additional dimension by terminating computation early for easy inputs.

\index{Quantization!weight-only for LLMs}
LLM generation presents a distinct bottleneck: autoregressive decoding is dominated by memory bandwidth rather than compute, because each token generation loads the entire weight matrix but performs relatively little arithmetic. Weight-only quantization (INT4 or INT8 weights with FP16 activations) therefore provides nearly linear speedup by reducing the bytes that must traverse the memory hierarchy.

When energy and power consumption drive the optimization, quantization again leads because it reduces both compute energy (cheaper arithmetic) and memory energy (fewer bytes transferred). Structured pruning complements quantization by reducing the total operation count. Combining both techniques yields multiplicative energy savings that neither achieves alone.

These choices also depend on the available engineering budget. When fine-tuning is feasible, QAT replaces PTQ for better accuracy at the same precision level, knowledge distillation enables maximum accuracy preservation, and NAS can discover hardware-specific architectures that outperform manual designs. When rapid deployment is required, PTQ with a calibration dataset can be completed in hours rather than days, and magnitude-based pruning with brief fine-tuning offers a practical middle ground. Techniques demanding large search budgets, such as NAS or full QAT, are best reserved for production systems with longer optimization timelines.

This decision framework provides starting points for individual technique selection. Validating that a chosen technique actually achieves its intended goal requires systematic profiling and measurement, which @sec-model-compression-efficiency-measurement-2424 formalizes in detail. However, production deployments rarely rely on a single technique. Combining pruning with quantization, or distillation with hardware-aware design, introduces interaction effects that can either amplify benefits or create unexpected accuracy degradation. The following section addresses how to sequence and combine techniques effectively.


## Optimization Strategies {#sec-model-compression-optimization-strategies-f2f6}

The decision framework above guides individual technique selection, but the largest optimization gains emerge from combining multiple techniques. Because pruning, quantization, and architectural efficiency operate at different levels of the stack, they provide multiplicative benefits when sequenced appropriately.

Why do certain combinations work? Pruning and quantization create synergistic effects because pruning reduces parameter count while quantization reduces precision, yielding multiplicative compression\index{Model Compression!compression ratio}\index{Compression Ratio!multiplicative effects}. Applying pruning first concentrates important weights into a smaller parameter set, making subsequent quantization more effective and reducing the search space for optimal quantization strategies. This sequential approach achieves compression ratios exceeding either technique alone.

Knowledge distillation integrates effectively with quantization by mitigating accuracy loss from aggressive precision reduction. Training student models to match teacher behavior rather than just minimizing task loss is particularly effective for extreme quantization scenarios where direct quantization would cause unacceptable accuracy degradation.

Neural architecture search enables co-design approaches that optimize model structures specifically for quantization constraints, identifying architectures that maintain accuracy under low-precision operations. This co-design produces models inherently suited for subsequent optimization, improving the effectiveness of both quantization and pruning techniques.

@fig-compression-methods compares how different compression strategies exhibit varying trade-offs between model size and accuracy loss. Pruning combined with quantization (red circles) achieves high compression ratios with minimal accuracy loss, while quantization alone (yellow squares) provides a reasonable balance. In contrast, SVD (green diamonds) requires a larger model size to maintain accuracy, illustrating how different techniques impact compression effectiveness.

::: {#fig-compression-methods fig-env="figure" fig-pos="htb" fig-cap="**Combined Compression Effectiveness**: Pruning combined with quantization (red circles) achieves the highest compression ratio at near-zero accuracy loss, followed by pruning alone and quantization alone, while SVD (green diamonds) requires the largest model size to maintain accuracy. Source: [@han2016deep]." fig-alt="Line graph of accuracy loss versus model size ratio. Four curves show pruning plus quantization achieving smallest size at near-zero loss, followed by pruning only, quantization only, and SVD requiring largest size to maintain accuracy."}
```{.tikz}
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
\definecolor{other}{HTML}{D7191C}
\definecolor{WeightGradient}{HTML}{FDAE61}
\definecolor{Optimization}{HTML}{ABDDA4}
\definecolor{Activation}{HTML}{2B83BA}
\begin{axis}[name=boundary,
 axis line style={draw=none},
  width=14cm,
  height=7cm,
  xlabel={Model Size Ratio after Compression},
  ylabel={Accuracy Loss},
  xmin=2, xmax=20,
  ymin=-4.5, ymax=0.5,
  xtick={2,5,8,11,14,17,20},
  xticklabel={\pgfmathprintnumber{\tick}\%},
  ytick={-4.5,-4,-3.5,-3,...,0.5},
  yticklabel={\pgfmathprintnumber{\tick}\%},
    legend style={
    at={(0.5,1.05)},
    anchor=south,
    legend columns=4,
    font=\footnotesize,
    /tikz/every even column/.append style={column sep=0.5cm}
  },
  axis line style={black},
  tick align=outside,
  tick label style={/pgf/number format/assume math mode=true},
  ticklabel style={font=\footnotesize\usefont{T1}{phv}{m}{n}},
  grid=both,
  grid style={line width=.4pt, draw=gray!80},
  %major grid style={line width=.4pt,draw=gray!50},
  clip=false,
  enlargelimits=false,
  legend style={fill=BrownL!40,draw=none,row sep=1.85pt,
 font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
  forget plot,
]

% Pruning + Quantization (red, circle)
    \addplot+[
      scatter,
      scatter src=explicit symbolic,
      line width=1.5pt,
      draw= Activation,
      smooth,
      mark size=2.5pt,
      %
      mark options={fill=white,draw= Activation},
      scatter/classes={
        a={mark=none},
        b={mark=*}
      },
    ]
    table[row sep=crcr, meta=class] {
  x      y        class\\
  2.74   -4.7     a\\
  2.71   -1.82    b\\
  2.75   -1.25    b\\
  2.90   -0.60    b\\
  3.11   -0.24    b\\
  3.32   -0.10    b\\
  3.69   -0.01    b\\
  4.25    0.010   b\\
  5.00    0.02    b\\
  5.70    0.02    b\\
  6.39    0.02    b\\
  7.37    0.02    b\\
    };
      \addlegendimage{
      Activation,
      line width=1.25pt,
      mark=*,
      mark options={fill=white,draw=Activation},
      mark size=2.5pt
    }
\addlegendentry{Pruning + Quantization}

% Pruning Only (purple, triangle)
    \addplot+[
      scatter,
      scatter src=explicit symbolic,
      line width=1.5pt,
      draw= green!70!black,
      smooth,
      mark size=3.5pt,
      %
      mark options={fill=white,draw= green!70!black,},
      scatter/classes={
        a={mark=none},
        b={mark=triangle*}
      },
    ]
    table[row sep=crcr, meta=class] {
  x      y        class\\
  4.05   -4.7     a\\
  4.25   -4.2     b\\
  5.29   -1.99    b\\
  6.25   -1.04    b\\
  7.15   -0.6     b\\
  8.35   -0.28    b\\
  10.01  -0.073   b\\
  11.15   0.005   b\\
  12.55   0.05    b\\
    };
      \addlegendimage{
      green!70!black,
      line width=1.25pt,
      mark=triangle*,
      mark options={fill=white,draw=green!70!black},
      mark size=3.5pt
    }
     \addlegendentry{Pruning Only}
% Quantization Only
    \addplot+[
      scatter,
      scatter src=explicit symbolic,
      line width=1.5pt,
      draw=orange,
      smooth,
      mark size=2.5pt,
      %
      mark options={fill=white,draw=orange},
      scatter/classes={
        a={mark=none},
        b={mark=square*}
      },
    ]
    table[row sep=crcr, meta=class] {
      x      y        class\\
      6.45   -4.7     a\\
      6.48   -3.66    b\\
      6.65   -2.21    b\\
      7.19   -1.06    b\\
      8.07   -0.58    b\\
      9.9    -0.3      b\\
      13.03  -0.13    b\\
      16.07  -0.05    b\\
      19.3    0.01    b\\
      20.4    0.02    a\\
    };
%
    \addlegendimage{
      orange,
      line width=1.25pt,
      mark=square*,
      mark options={fill=white,draw=orange},
      mark size=2.5pt
    }
\addlegendentry{Quantization Only}
% SVD
    \addplot+[
      scatter,
      scatter src=explicit symbolic,
      line width=1.5pt,
      draw=RedLine,
      smooth,
      mark size=3.5pt,
      %
      mark options={fill=white,draw=RedLine},
      scatter/classes={
        a={mark=none},
        b={mark=diamond*}
      },
    ]
    table[row sep=crcr, meta=class] {
      x      y       class\\
      14.19  -4.7    a\\
      15.09  -2.58   b\\
      15.7   -1.86   a\\
      16.9   -1.35   a\\
      19.62  -0.83   b\\
      20.4   -0.73   a\\
    };
    % Legend
    \addlegendimage{
      RedLine,
      line width=1.25pt,
      mark=diamond*,
      mark options={fill=white,draw=RedLine},
      mark size=3.5pt
    }
    \addlegendentry{SVD}
\end{axis}
\end{tikzpicture}
```
:::

\index{BERT!mobile deployment pipeline}
Sequencing critically impacts results, as the following example demonstrates.

::: {.callout-example title="BERT-Base Mobile Deployment Pipeline"}

Consider deploying BERT-Base on mobile devices through three stages. **Stage one** applies structured pruning, removing 30% of attention heads and 40% of intermediate FFN dimensions, resulting in 75% parameter reduction with accuracy dropping from 76.2% to 75.1%. **Stage two** uses knowledge distillation to recover accuracy to 75.9%. **Stage three** applies quantization-aware training with INT8 quantization, achieving 4 $\times$ additional memory reduction with final accuracy of 75.6%. The combined impact: 16 $\times$ memory reduction (440 MB to 28 MB), 12 $\times$ inference speedup on mobile CPU, and only 0.6% final accuracy loss versus 2.1% if quantization had been applied before pruning.
:::

This example illustrates why sequencing matters: pruning first concentrates important weights into smaller ranges, making subsequent quantization more effective. Applying quantization before pruning reduces numerical precision available for importance-based pruning decisions, degrading final accuracy. Effective combination requires understanding these dependencies and developing application sequences that maximize cumulative benefits.

With dozens of techniques across three optimization dimensions, rigorous measurement is essential for validating that optimizations achieve their intended goals. A practitioner who prunes, quantizes, and fuses without profiling the actual impact on target hardware is optimizing blindly.


## Efficiency Measurement {#sec-model-compression-efficiency-measurement-2424}

A model quantized to INT8 should be 4 $\times$ smaller and roughly 3 $\times$ faster, but does it actually achieve those gains on the target hardware? Theoretical compression ratios and measured deployment improvements often diverge, sometimes dramatically, because real speedups depend on memory hierarchy effects, kernel implementations, and hardware utilization patterns that theory alone cannot predict. Translating theoretical compression ratios into measurable deployment improvements therefore requires systematic profiling and evaluation.

This section addresses three critical questions: Where should optimization efforts focus? How do we measure whether optimizations achieve their intended goals? How do we validate that combined techniques deliver expected benefits?

### Profiling and Opportunity Analysis {#sec-model-compression-profiling-opportunity-analysis-477f}

Optimization begins with profiling\index{Profiling!model compression} to identify where computational resources are being consumed and which components offer the greatest optimization potential. A critical first step is determining whether model optimization will actually improve system performance, as model computation often represents only a fraction of total system overhead in production environments.

Modern machine learning models exhibit heterogeneous resource consumption: specific layers, operations, or data paths contribute disproportionately to memory usage, computational cost, or latency. Understanding these patterns is essential for prioritizing optimization efforts and achieving maximum impact with minimal accuracy degradation.

Effective profiling begins with establishing baseline measurements across relevant performance dimensions. Memory consumption — both static (model parameters and buffers) and dynamic allocation during inference — determines whether a model fits on the target device at all. Computational bottlenecks, measured in both FLOPs and actual wall-clock execution time, reveal which layers dominate the inference budget. For battery-powered and edge deployments, power consumption profiles determine operational feasibility: a model that drains a phone battery in an hour is unusable regardless of its accuracy. End-to-end latency measurements identify which operations contribute most to inference delay, often revealing that memory-bound operations like layer normalization consume disproportionate wall-clock time relative to their FLOP count.

A critical caveat applies when translating profiling metrics into optimization estimates.

::: {.callout-warning title="FLOPs Reduction ≠ Proportional Speedup"}

Reducing a model's FLOPs by 50% does not guarantee 50% latency reduction. Memory-bound operations (common in LLM inference and normalization layers) see minimal benefit from compute reduction because they are bottlenecked by data movement, not arithmetic. Critically, Amdahl's Law\index{Amdahl's Law!model compression} (@sec-machine-foundations-amdahls-law-gustafsons-law-b741) applies at the system level: if model inference accounts for only 20% of end-to-end latency (with the rest spent on data loading, pre-processing, and post-processing), then even perfect model optimization yields at most 1.25 $\times$ overall speedup. Always profile on target hardware before estimating optimization benefits.
:::

Consider profiling a Vision Transformer (ViT) for edge deployment. Using PyTorch Profiler reveals three key findings: attention layers consume 65% of total FLOPs (highly amenable to structured pruning), layer normalization consumes 8% of latency despite only 2% of FLOPs (a memory-bound operation), and the final classification head consumes 1% of computation but 15% of parameter memory. This profile suggests a clear priority ordering: first, apply magnitude-based pruning to attention layers for high FLOP reduction; second, quantize the classification head to INT8 for large memory savings with minimal accuracy impact; third, fuse layer normalization operations to reduce the memory bandwidth bottleneck.

Beyond these baseline measurements, modern optimization requires understanding model sensitivity to different types of modifications. Not all parameters contribute equally to accuracy. Layer-wise sensitivity analysis\index{Sensitivity Analysis!layer-wise} reveals which network components are most important for maintaining accuracy, guiding decisions about where to apply aggressive pruning or quantization and where to use conservative approaches.

### Measuring Optimization Effectiveness {#sec-model-compression-measuring-optimization-effectiveness-a3b2}

Optimization requires rigorous measurement frameworks that go beyond simple accuracy metrics to capture the full impact of optimization decisions. Effective measurement considers multiple objectives simultaneously: accuracy preservation, computational efficiency gains, memory reduction, latency improvement, and energy savings. Balancing these often-competing objectives requires careful trade-off analysis.

The measurement framework should establish clear baselines before applying any optimizations. Accuracy baselines include not only top-line metrics like classification accuracy but also calibration, fairness across demographic groups, and robustness to input variations. Efficiency baselines capture computational cost (FLOPs, memory bandwidth), execution time across hardware platforms, peak memory consumption, and energy consumption profiles.

When quantizing ResNet-50 from FP32 to INT8, baseline metrics show Top-1 accuracy of 76.1%, inference latency on V100 of 4.2 ms, model size of 98 MB, and energy per inference of 0.31 J. Post-quantization metrics reveal Top-1 accuracy of 75.8% (0.3% degradation), inference latency of 1.3 ms (3.2 $\times$ speedup), model size of 25 MB (3.9 $\times$ reduction), and energy per inference of 0.08 J (3.9 $\times$ improvement). Additional analysis shows per-class accuracy degradation ranging from 0.1% to 1.2% with highest impact on fine-grained categories, calibration error increasing from 2.1% to 3.4%, and INT8 quantization providing 3.2 $\times$ speedup on GPU but only 1.8 $\times$ on CPU, demonstrating hardware-dependent gains.

With these comprehensive baselines in place, the measurement framework must track optimization impact systematically. Rather than evaluating techniques in isolation, applying our three-dimensional framework requires understanding how different approaches interact when combined. Sequential application can lead to compounding benefits or unexpected interactions that diminish overall effectiveness. @sec-benchmarking provides additional structured evaluation methods for comprehensive performance assessment.

Rigorous measurement tells practitioners *whether* their optimizations succeeded, but the measurements themselves require tooling to perform. Profiling, quantization, pruning, and deployment all depend on software frameworks that automate otherwise prohibitively complex workflows. We turn now to the implementation tools that make these techniques practical.


## Implementation Tools {#sec-model-compression-implementation-tools-4990}

Understanding optimization techniques is necessary but not sufficient; practical implementation relies on robust software support. Without framework tooling, quantization would require manual modification of model definitions and careful insertion of quantization operations throughout the network, while pruning would demand direct manipulation of weight tensors. Both become prohibitively complex as models scale.

Modern machine learning frameworks provide high-level APIs and automated workflows that abstract away implementation complexity, making sophisticated optimization techniques accessible to practitioners. Frameworks address key challenges: providing pre-built modules for common optimization techniques, assisting with hyperparameter tuning (pruning schedules, quantization bit-widths), managing accuracy-compression trade-offs through automated evaluation, and ensuring hardware compatibility through device-specific code generation.

This software infrastructure transforms theoretical optimization techniques into practical tools readily applied in production environments. @sec-ml-operations details the operational considerations for these workflows, including model versioning strategies, monitoring optimization impact on data pipelines, managing optimization artifacts across development and deployment environments, and establishing rollback procedures when optimizations fail. This accessibility bridges the gap between academic research and industrial applications, enabling widespread deployment of efficient machine learning models.

### Model Optimization APIs and Tools {#sec-model-compression-model-optimization-apis-tools-1849}

Leading frameworks such as TensorFlow, PyTorch, and MXNet provide comprehensive APIs enabling practitioners to apply optimization techniques without implementing complex algorithms from scratch. @sec-ml-frameworks examines these frameworks in depth. Built-in optimizations enhance model efficiency while ensuring adherence to established best practices.

TensorFlow's Model Optimization Toolkit\index{Framework Toolkits!optimization support} facilitates quantization, pruning, and clustering. QAT converts floating-point models to lower-precision formats (INT8) while preserving accuracy, systematically managing both weight and activation quantization across diverse architectures. Pruning algorithms introduce sparsity by removing redundant connections at varying granularity levels, from individual weights to entire layers, allowing practitioners to tailor strategies to specific requirements. Weight clustering\index{Weight Clustering}\index{Model Compression!weight clustering} groups similar weights for compression while preserving functionality, providing multiple pathways for improving model efficiency.

Similarly, PyTorch offers optimization support through built-in modules for quantization and pruning. The `torch.quantization` package provides tools for converting models to lower-precision representations, supporting both post-training quantization and quantization-aware training. @lst-qat_example demonstrates PyTorch's quantization-aware training API:

::: {#lst-qat_example lst-cap="**Quantization-Aware Training**: Prepares a model to be trained in lower-precision formats, ensuring that quantization errors are accounted for during training."}
```{.python}
import torch
from torch.quantization import QuantStub, DeQuantStub,
     prepare_qat

# Define a model with quantization support
class QuantizedModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = QuantStub()
        self.conv = torch.nn.Conv2d(3, 64, 3)
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        return self.dequant(x)

# Prepare model for quantization-aware training
model = QuantizedModel()
model.qconfig = torch.quantization.get_default_qat_qconfig()
model_prepared = prepare_qat(model)
```
:::

For pruning, PyTorch provides the `torch.nn.utils.prune` module, which supports both unstructured and structured pruning. @lst-pytorch_pruning illustrates both pruning approaches:

::: {#lst-pytorch_pruning lst-cap="**PyTorch Pruning APIs**: Applies unstructured and structured pruning techniques to reduce model complexity while maintaining performance. *Source: PyTorch Documentation*"}
```{.python}
import torch.nn.utils.prune as prune

# Apply unstructured pruning
module = torch.nn.Linear(10, 10)
prune.l1_unstructured(module, name="weight", amount=0.3)
# Prune 30% of weights

# Apply structured pruning
prune.ln_structured(module, name="weight", amount=0.5, n=2, dim=0)
```
:::

These tools integrate into PyTorch's training pipelines, enabling experimentation with different optimization strategies.

Built-in optimization APIs provide pre-tested, production-ready tools with standardized interfaces, reducing implementation complexity while ensuring consistent, reproducible results across model architectures and teams. These frameworks also bridge the gap between research and practice: as new optimization techniques emerge, framework maintainers incorporate them into APIs, making state-of-the-art methods accessible to practitioners. The result is rapid experimentation, where developers can test strategies, compare effectiveness, and iterate toward optimal configurations.

### Hardware-Specific Optimization Libraries {#sec-model-compression-hardwarespecific-optimization-libraries-3ab1}

\index{Inference Runtime!optimization}
\index{ML Compiler!hardware-specific optimization}
The optimization techniques in this chapter produce models ready for hardware-specific acceleration. Libraries like TensorRT, XLA[^fn-xla-compiler], OpenVINO, and TVM[^fn-tvm-compiler] translate these optimized models into efficient execution on target platforms. @sec-hardware-acceleration examines the hardware acceleration principles underlying how these tools exploit accelerator capabilities for pruned, quantized, and architecturally optimized models.

[^fn-xla-compiler]: **XLA (Accelerated Linear Algebra)**: Google's ML compiler that optimizes computational graphs for efficient execution on TPUs and GPUs, examined in @sec-hardware-acceleration.

[^fn-tvm-compiler]: **TVM (Tensor Virtual Machine)**: Apache's cross-platform ML compiler enabling deployment across diverse hardware from a single model definition, as discussed in @sec-hardware-acceleration.

Framework integration enables practitioners to apply optimizations without implementing hardware-specific code directly. For model representation optimizations like pruning, these libraries provide sparsity-aware acceleration through optimized kernels. For numerical precision optimization, they offer extensive support for both PTQ and QAT, implementing INT8 and INT4 quantization during model conversion. Architectural efficiency techniques integrate through operator-level tuning, including aggressive fusion and kernel reordering.

This integration of hardware optimization libraries with machine learning frameworks enables developers to effectively implement the optimization techniques covered in this chapter while ensuring optimal adaptation to target hardware. @sec-benchmarking and @sec-ml-operations detail deployment strategies that build on these optimization foundations.

Beyond APIs and hardware-specific libraries, visualization tools help practitioners understand how pruning, quantization, and other optimizations affect model behavior.

Quantization error histograms reveal whether quantization errors are Gaussian or contain problematic outliers. Activation visualizations help detect overflow and saturation issues. @fig-color-mapping shows color-mapped AlexNet kernels. TensorFlow's Quantization Debugger, PyTorch's FX Graph Mode, and TensorRT Inspector provide these capabilities.

![**Convolutional Kernel Weights**: Color mapping reveals learned feature patterns in convolutional filters. First-layer filters learn oriented edges, color blobs, and frequency patterns; analyzing weight distributions helps diagnose issues like dead or saturated filters.](images/svg/kernel_weights.svg){#fig-color-mapping width=70% fig-alt="Grid of 96 small color images showing AlexNet first-layer convolutional kernels. Patterns include oriented edges, color blobs, and Gabor-like filters learned from ImageNet training."}

Sparsity heat maps show sparsity distribution across layers (@fig-sparse-heat-map). Darker regions indicate higher sparsity. Trend plots track sparsity progression across pruning iterations. TensorBoard, Netron, and SparseML provide these tools.

![**Sparsity Distribution**: Darker shades indicate higher sparsity where more weights were removed. The heatmap reveals how pruning affects different layers non-uniformly, with later layers typically exhibiting higher sparsity than early feature-extraction layers.](images/svg/sparsity_heatmap.svg){#fig-sparse-heat-map fig-alt="Heatmap visualization of a pruned neural network with weight matrix blocks. Darker regions indicate higher sparsity where more weights have been removed. Lighter regions show retained weights."}

With the implementation tools and visualization capabilities established, the natural question is: how do these techniques compare when a practitioner must choose among them? Each optimization approach carries distinct trade-offs in accuracy, training cost, and hardware requirements, and a structured comparison clarifies which to reach for first.


## Technique Comparison {#sec-model-compression-technique-comparison-3142}

A comparative analysis across the three major approaches reveals how each addresses distinct aspects of the efficiency-accuracy trade-off. Pruning works best when sparse computation hardware is available and when reducing floating-point operations is critical. Quantization provides the most versatile approach with broad hardware support, making it ideal for diverse deployment scenarios. Knowledge distillation requires significant computational investment but produces consistently high-quality compressed models, making it the right choice when accuracy preservation is paramount. @tbl-optimization-comparison summarizes these trade-offs for systematic technique selection.

| **Technique**    | **Primary Goal**    | **Accuracy Impact** | **Training Cost**       | **Hardware Dependency** | **Best For**                          |
|:-----------------|:--------------------|:--------------------|:------------------------|:------------------------|:--------------------------------------|
| **Pruning**      | Reduce FLOPs/Size   | Moderate            | Low (fine-tuning)       | High (for sparse ops)   | Latency-critical apps                 |
| **Quantization** | Reduce Size/Latency | Low                 | Low (PTQ) / High (QAT)  | High (INT8 support)     | Edge/Mobile deployment                |
| **Distillation** | Reduce Size         | Low-Moderate        | High (student training) | Low                     | Creating smaller, high-quality models |

: **Optimization Technique Trade-offs**: Comparison of the three major optimization approaches across key performance dimensions, highlighting how each technique addresses different constraints and deployment scenarios. Pruning excels for computational reduction but requires sparse hardware support, quantization provides balanced size and speed improvements with wide hardware compatibility, while distillation produces high-quality compressed models at higher training cost. {#tbl-optimization-comparison}

\index{Model Compression!sequential application}
These techniques combine synergistically, with quantization often applied after pruning or distillation to achieve compound compression benefits. Production systems frequently employ sequential application: initial pruning reduces parameter count, quantization optimizes numerical representation, and fine-tuning through distillation principles recovers any accuracy loss. Sequential application enables compression ratios of 10–50 $\times$ while maintaining competitive accuracy across diverse deployment scenarios.

With the complete optimization toolkit now surveyed—from individual techniques through combination strategies—the most instructive lessons often come not from what works but from what fails. The following fallacies and pitfalls capture the most common mistakes engineers make when applying these techniques, each grounded in the quantitative trade-offs we have established throughout the chapter.


## Fallacies and Pitfalls {#sec-model-compression-fallacies-pitfalls-1b5e}

```{python}
#| label: fp-setup
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ FALLACIES & PITFALLS CONSTANTS
# ├─────────────────────────────────────────────────────────────────────────────
# │ Context: Fallacies and Pitfalls section of Model Compression chapter
# │
# │ Goal: Provide quantitative examples for common optimization misconceptions.
# │ Show: How pruning and quantization interact non-linearly to affect speed and accuracy.
# │ How: Pre-compute stats for bit-width degradation and dequantization overhead.
# │
# │ Imports: mlsys.formatting (fmt)
# │ Exports: int8_size_reduction_str, bert_fp32_mb_str, bert_int8_mb_str,
# │          pruning_target_str, param_removal_str, resnet_fp32_acc_str,
# │          resnet_int8_acc_str, resnet_binary_acc_str, dequant_overhead_str,
# │          expected_speedup_str, actual_speedup_str
# └─────────────────────────────────────────────────────────────────────────────
from mlsys.formatting import fmt, check
from mlsys.constants import KIB_TO_BYTES

# ┌── P.I.C.O. ISOLATED SCENARIO ───────────────────────────────────────────────
class FallaciesAnalysis:
    """
    Namespace for Fallacies and Pitfalls.
    Scenario: Misinterpreting compression speedups.
    """

    # ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
    # Quantization parameters
    expected_bits = 32
    target_bits = 8
    overhead_pct = 15  # dequantization overhead

    # Combined pruning + quantization scenario (Pitfall 5)
    prune_speedup = 2        # 50% structured sparsity = 2× theoretical
    actual_combined_pct = 28  # real-world end-to-end speedup (%)

    # ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
    quant_speedup = expected_bits / target_bits                    # 4× from INT8
    combined_expected = quant_speedup * prune_speedup              # 8× theoretical
    quant_after_overhead = quant_speedup * (1 - overhead_pct/100)  # 3.4× actual quant-only

    # ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
    check(quant_after_overhead < quant_speedup, "Actual speedup should be less than theoretical due to overhead.")
    check(actual_combined_pct < combined_expected * 100, "Real-world speedup must be less than theoretical.")

    # ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
    int8_size_reduction_str = f"{int(quant_speedup)}"
    expected_speedup_str = fmt(combined_expected, precision=0, commas=False)
    actual_speedup_str = fmt(actual_combined_pct, precision=0, commas=False)
    dequant_overhead_str = f"{overhead_pct}"

    bert_fp32_mb_str = "440"
    bert_int8_mb_str = "110"
    pruning_target_str = "70"
    param_removal_str = "40"
    resnet_fp32_acc_str = "76.2"
    resnet_int8_acc_str = "76.1"
    resnet_binary_acc_str = "51"

# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
int8_size_reduction_str = FallaciesAnalysis.int8_size_reduction_str
expected_speedup_str = FallaciesAnalysis.expected_speedup_str
actual_speedup_str = FallaciesAnalysis.actual_speedup_str
dequant_overhead_str = FallaciesAnalysis.dequant_overhead_str
bert_fp32_mb_str = FallaciesAnalysis.bert_fp32_mb_str
bert_int8_mb_str = FallaciesAnalysis.bert_int8_mb_str
pruning_target_str = FallaciesAnalysis.pruning_target_str
param_removal_str = FallaciesAnalysis.param_removal_str
resnet_fp32_acc_str = FallaciesAnalysis.resnet_fp32_acc_str
resnet_int8_acc_str = FallaciesAnalysis.resnet_int8_acc_str
resnet_binary_acc_str = FallaciesAnalysis.resnet_binary_acc_str
```

Model optimization involves counterintuitive interactions between techniques that appear independent. Engineers often assume strategies compose linearly and that theoretical metrics predict deployment performance. The following fallacies and pitfalls capture errors that waste optimization effort, degrade accuracy, or miss deployment requirements despite substantial investment.

**Fallacy:** *Optimization techniques can be applied independently without considering their interactions.*

Engineers assume optimization strategies compose additively: 50% pruning plus `{python} int8_size_reduction_str` $\times$ quantization yields combined benefits. In reality, techniques interact non-linearly and compound losses. A BERT model pruned to `{python} pruning_target_str`% parameters maintains 97.8% performance, but applying INT8 quantization afterward drops accuracy to 94.2%, while QAT on the pruned model achieves 96.5%. Knowledge distillation from heavily pruned teachers transfers degenerate attention patterns that reduce student accuracy by 3–5% compared to distilling from dense models. As @sec-model-compression-optimization-strategies-f2f6 demonstrates, successful optimization requires coordinated application where techniques are sequenced together. Organizations that apply aggressive combinations without measuring interactions waste weeks recovering lost accuracy.

**Pitfall:** *Optimizing for theoretical metrics rather than actual deployment performance.*

Teams reduce FLOPs by 60% and celebrate efficiency gains without profiling deployment hardware. A pruned model with `{python} param_removal_str`% fewer parameters shows irregular sparsity patterns that prevent vectorization, achieving only 12% latency reduction instead of the expected `{python} param_removal_str`% on ARM processors. INT8 quantization reduces a transformer from `{python} bert_fp32_mb_str` MB to `{python} bert_int8_mb_str` MB, but dequantization overhead on GPUs lacking low-precision acceleration increases latency by 15% despite the `{python} int8_size_reduction_str` $\times$ size reduction. As shown in @sec-model-compression-profiling-opportunity-analysis-477f, memory bandwidth, cache behavior, and instruction-level parallelism determine actual performance, not operation counts. Production deployments require measuring wall-clock latency on target hardware.

**Fallacy:** *Aggressive quantization maintains model performance with minimal accuracy loss.*

Engineers assume quantization scales uniformly: if INT8 loses 1%, then INT4 loses 2%. In practice, precision reduction exhibits threshold effects where models collapse catastrophically. ResNet-50 quantized to INT8 maintains `{python} resnet_int8_acc_str`% vs `{python} resnet_fp32_acc_str`% FP32 accuracy, but naive INT4 quantization drops accuracy by 5–15% depending on the method. Binary weights achieve only ~`{python} resnet_binary_acc_str`% on ImageNet. BERT with INT8 weights retains 99.1% of FP32 GLUE performance, but INT4 attention mechanisms cause numerical instability reducing scores by 8–12%. LayerNorm and Softmax require FP16 minimum precision; quantizing them to INT8 causes divergence. As @sec-model-compression-precision-reduction-strategies-db83 demonstrates, mixed-precision approaches maintain accuracy where uniform quantization fails.

**Pitfall:** *Using post-training optimization without considering training-aware alternatives.*

Teams apply post-training quantization (PTQ) to avoid retraining and achieve 96.8% of baseline BERT performance. However, quantization-aware training (QAT) retains 99.1%, recovering the 2.3-point gap through learned quantization parameters. Post-training pruning of ResNet-50 to `{python} pruning_target_str`% parameters drops accuracy to 73.8%, while gradual magnitude pruning during training maintains 75.6% at the same sparsity. Knowledge distillation during student training achieves 97% of teacher performance versus 92–94% when distilling post-hoc. As detailed in @sec-model-compression-precision-reduction-strategies-db83, the 2–4% accuracy improvements from training-aware methods often determine whether models meet production thresholds.

**Pitfall:** *Assuming compression ratios translate directly into proportional deployment gains.*

Teams achieve `{python} int8_size_reduction_str` $\times$ model size reduction through INT8 quantization and expect `{python} int8_size_reduction_str` $\times$ memory savings in deployment. In practice, runtime overhead erodes compression gains. Dequantization kernels add `{python} dequant_overhead_str`% latency overhead converting INT8 weights back to FP16. Pruned models with irregular sparsity achieve only 12% latency reduction despite `{python} param_removal_str`% parameter removal because hardware cannot skip zeroed weights efficiently. As @sec-model-compression-profiling-opportunity-analysis-477f demonstrates, a BERT model pruned to 50% sparsity and quantized to INT8 achieves `{python} actual_speedup_str`% end-to-end speedup rather than the expected `{python} expected_speedup_str` $\times$, because unstructured sparsity creates irregular memory access. Production workflows must profile *deployed* latency on target hardware, not extrapolate from compression ratios.


## Summary {#sec-model-compression-summary-8229}

Model compression is not a bag of tricks but an engineering discipline built on three complementary dimensions: *structural optimization* determines what the model computes, *precision optimization* determines how precisely it computes, and *architectural optimization* determines how efficiently those computations execute on physical hardware. The most important lesson of this chapter is that these dimensions compose multiplicatively. Pruning alone might achieve 2 $\times$ compression; quantization alone might achieve 4 $\times$; but pruning, distillation, and quantization applied together can achieve 16 $\times$ — as BERT's compression from 440 MB to 28 MB demonstrates. The second lesson is equally important: theoretical compression ratios lie. A 4 $\times$ reduction in parameters translates to 4 $\times$ latency improvement only when the optimization aligns with the hardware's execution model. Unstructured sparsity on hardware that lacks sparse kernels achieves almost nothing; INT8 quantization on hardware without INT8 units achieves even less. Profile on target hardware, not paper metrics.

Combined with the data selection techniques from @sec-data-selection, these model-centric optimizations complete the efficiency toolkit: data selection maximizes learning from available examples, while model compression minimizes resources required for deployment.

::: {.callout-takeaways title="From Benchmark Winner to Production Model"}

* **Three dimensions of optimization**: Structural (what to compute), precision (how precisely), architectural (how efficiently). Combine all three for maximum compression.
* **PTQ is a strong baseline**: INT8 post-training quantization requires no retraining and delivers 4 $\times$ compression. Use QAT or distillation when baseline accuracy is insufficient for the application.
* **For LLMs, weight-only quantization wins**: INT4 weights with FP16 activations dominate because generation is memory-bandwidth bound, not compute bound.
* **Structured pruning for commodity hardware**: Unstructured sparsity requires specialized accelerators. Structured pruning (channels, heads) delivers real latency gains on GPUs.
* **Order matters when combining techniques**: Pruning before quantization is more effective; architecture changes should align with quantization constraints. Distillation can mitigate quantization accuracy loss.
* **Profile on target hardware, not paper metrics**: FLOPs and parameter count often mispredict real-world performance. A 2 $\times$ FLOP reduction may yield only 1.2 $\times$ speedup.
* **Hardware-aware design turns compression into real speedups**: Align pruning structure, quantization format, and operator choices with the capabilities of the target accelerator to convert theoretical savings into measured latency gains.

:::

The optimization techniques explored here (pruning, quantization, distillation, and architecture search) transform models from research artifacts into deployable systems. But even a perfectly compressed model remains a mathematical abstraction until it meets silicon. The natural question becomes: what hardware features exist to exploit these optimizations, and how do accelerator architectures turn theoretical compression ratios into real throughput gains?

::: {.callout-chapter-connection title="From Math to Physics"}

We have compressed the model's logic, shaving off every unnecessary bit. Logic, however, must eventually run on physics. We turn next to @sec-hardware-acceleration, where we explore how GPUs, TPUs, and NPUs are designed to exploit these optimizations and execute compressed models at maximum throughput.

:::

::: { .quiz-end }
:::