mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-27 02:28:12 -05:00
300 lines
11 KiB
Plaintext
300 lines
11 KiB
Plaintext
---
|
||
title: "Quantization: Not a Free Lunch"
|
||
subtitle: "INT4 gives 4x speedup for decode. For training, it gives 0x."
|
||
description: "Discover that quantization only helps when you are memory-bound. Compare the effect of INT4 on LLM decode (dramatic) vs. training (negligible). The regime determines whether the optimization works."
|
||
categories: ["algorithm", "intermediate"]
|
||
---
|
||
|
||
## The Question
|
||
|
||
INT4 quantization reduces model size by 4x. Intuitively, 4x fewer bytes should mean 4x
|
||
faster. Blogs claim massive speedups. **Does INT4 always give 4x speedup?**
|
||
|
||
You already have the tools to predict the answer. Before reading further, think about
|
||
what Tutorial 1 taught you: *which ceiling determines performance?* If quantization
|
||
reduces bytes but not FLOPs, when would it help — and when would it do nothing?
|
||
|
||
This tutorial makes the prediction, runs the experiment, and reveals that the answer
|
||
depends entirely on the regime.
|
||
|
||
::: {.callout-note}
|
||
## Prerequisites
|
||
Complete [Tutorial 1: The Memory Wall](01_memory_wall.qmd) and
|
||
[Tutorial 2: Two Phases, One Request](02_two_phases.qmd). You should understand
|
||
memory-bound vs. compute-bound regimes and why LLM decode is memory-bound.
|
||
:::
|
||
|
||
::: {.callout-note}
|
||
## What You Will Learn
|
||
|
||
- **Predict** whether quantization will help a given workload based on its regime
|
||
- **Measure** the speedup from INT8 and INT4 for both memory-bound and compute-bound workloads
|
||
- **Explain** why the same optimization yields 4x in one case and 0x in another
|
||
- **Evaluate** the accuracy-compression trade-off using `CompressionModel`
|
||
:::
|
||
|
||
::: {.callout-tip}
|
||
## Background: What Quantization Actually Does
|
||
|
||
Quantization reduces the number of bytes per parameter:
|
||
|
||
| Precision | Bytes/Param | Relative to FP16 |
|
||
|:----------|:------------|:-----------------|
|
||
| FP16 | 2 | 1x (baseline) |
|
||
| INT8 | 1 | 0.5x |
|
||
| INT4 | 0.5 | 0.25x |
|
||
|
||
For **memory-bound** workloads, performance scales with bytes loaded from HBM per step.
|
||
Halving bytes halves latency. For **compute-bound** workloads, performance scales with
|
||
FLOP/s. Fewer bytes does not change the number of FLOPs, so latency stays the same.
|
||
|
||
The key question is always: *which ceiling am I hitting?*
|
||
:::
|
||
|
||
---
|
||
|
||
## 1. Setup
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| output: false
|
||
import sys, os, importlib.util
|
||
current_dir = os.getcwd()
|
||
root_path = os.path.abspath(os.path.join(current_dir, "../../../"))
|
||
if not os.path.exists(os.path.join(root_path, "mlsysim")):
|
||
root_path = os.path.abspath("../../")
|
||
package_path = os.path.join(root_path, "mlsysim")
|
||
init_file = os.path.join(package_path, "__init__.py")
|
||
spec = importlib.util.spec_from_file_location("mlsysim", init_file)
|
||
mlsysim = importlib.util.module_from_spec(spec)
|
||
sys.modules["mlsysim"] = mlsysim
|
||
spec.loader.exec_module(mlsysim)
|
||
Engine = mlsysim.Engine
|
||
```
|
||
|
||
```python
|
||
import mlsysim
|
||
from mlsysim import ServingModel, SingleNodeModel, CompressionModel
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Memory-Bound Case: LLM Decode at Batch 1
|
||
|
||
LLM decoding at batch 1 is the textbook memory-bound workload. Each token generation
|
||
must reload the entire model from HBM. Fewer bytes per parameter means fewer bytes to
|
||
load means lower inter-token latency:
|
||
|
||
```{python}
|
||
from mlsysim import ServingModel
|
||
from mlsysim.show import table, info
|
||
|
||
model = mlsysim.Models.Llama3_8B
|
||
hardware = mlsysim.Hardware.Cloud.H100
|
||
solver = ServingModel()
|
||
|
||
rows = []
|
||
baseline_itl = None
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
r = solver.solve(
|
||
model=model, hardware=hardware,
|
||
seq_len=2048, batch_size=1, precision=prec
|
||
)
|
||
itl_ms = r.itl.to("ms").magnitude
|
||
if baseline_itl is None:
|
||
baseline_itl = itl_ms
|
||
speedup = baseline_itl / itl_ms
|
||
rows.append([prec, r.itl.to('ms'), r.model_weights_size, f"{speedup:.1f}x"])
|
||
|
||
table(["Precision", "ITL", "Weights", "Speedup vs FP16"], rows)
|
||
```
|
||
|
||
INT8 gives roughly 2x speedup. INT4 gives roughly 4x. The speedup tracks the byte
|
||
reduction almost exactly — because the workload is purely memory-bound. Every byte
|
||
you eliminate directly reduces the time to load model weights.
|
||
|
||
---
|
||
|
||
## 3. Compute-Bound Case: Training at Large Batch
|
||
|
||
Now let's try the same optimization on a compute-bound workload — ResNet-50 training
|
||
at batch 256 on the A100:
|
||
|
||
```{python}
|
||
from mlsysim import SingleNodeModel
|
||
|
||
train_model = mlsysim.Models.ResNet50
|
||
train_hw = mlsysim.Hardware.Cloud.A100
|
||
train_solver = SingleNodeModel()
|
||
|
||
rows = []
|
||
baseline_lat = None
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
p = train_solver.solve(
|
||
model=train_model, hardware=train_hw,
|
||
batch_size=256, precision=prec
|
||
)
|
||
lat_ms = p.latency.to("ms").magnitude
|
||
if baseline_lat is None:
|
||
baseline_lat = lat_ms
|
||
speedup = baseline_lat / lat_ms if lat_ms > 0 else 0
|
||
rows.append([prec, p.latency.to('ms'), f"{p.throughput:.0f} img/s", f"{speedup:.1f}x"])
|
||
|
||
table(["Precision", "Latency", "Throughput", "Speedup"], rows)
|
||
```
|
||
|
||
The speedup is negligible. Why? Because at batch 256, ResNet-50 training is
|
||
**compute-bound**. The bottleneck is arithmetic throughput (FLOP/s), not memory bandwidth.
|
||
Reducing bytes per parameter does not change the number of FLOPs in the forward and backward
|
||
passes. The GPU is already saturated with compute — loading weights faster does not help.
|
||
|
||
::: {.callout-warning}
|
||
## Nuance: INT8 Tensor Cores
|
||
|
||
In practice, GPUs with dedicated INT8/INT4 Tensor Cores (like A100 and H100) also gain
|
||
*higher compute throughput* at lower precision — e.g., the A100 does 624 TFLOP/s INT8 vs.
|
||
312 TFLOP/s FP16, a 2× compute boost. This means quantization simultaneously changes
|
||
*both* the memory ceiling (fewer bytes) and the compute ceiling (more INT ops/sec). For
|
||
workloads near the ridge point, this dual effect can shift the regime classification itself.
|
||
MLSys·im's first-order model captures the memory effect; the compute boost is a
|
||
second-order effect that depends on hardware-specific Tensor Core support.
|
||
:::
|
||
|
||
---
|
||
|
||
## 4. The Reveal: Same Optimization, Two Regimes
|
||
|
||
Let's put the results side by side to make the contrast stark:
|
||
|
||
```{python}
|
||
rows = []
|
||
|
||
# Memory-bound: LLM decode
|
||
decode_row = ["Llama-3 8B decode"]
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
r = solver.solve(model=model, hardware=hardware, seq_len=2048, batch_size=1, precision=prec)
|
||
decode_row.append(r.itl.to('ms'))
|
||
decode_row.append("Memory-bound")
|
||
rows.append(decode_row)
|
||
|
||
# Compute-bound: training
|
||
train_row = ["ResNet-50 train bs=256"]
|
||
for prec in ["fp16", "int8", "int4"]:
|
||
p = train_solver.solve(model=train_model, hardware=train_hw, batch_size=256, precision=prec)
|
||
train_row.append(p.latency.to('ms'))
|
||
train_row.append("Compute-bound")
|
||
rows.append(train_row)
|
||
|
||
table(["Workload", "FP16", "INT8", "INT4", "Regime"], rows)
|
||
```
|
||
|
||
::: {.callout-important}
|
||
## Key Insight
|
||
|
||
**Quantization reduces bytes loaded from memory. If you are memory-bound, fewer bytes
|
||
means proportional speedup. If you are compute-bound, fewer bytes means nothing — compute,
|
||
not memory, is the ceiling.** The regime determines whether the optimization works. INT4
|
||
gives ~4x for LLM decode (memory-bound) and ~0x for large-batch training (compute-bound).
|
||
The same technique, applied to different regimes, yields completely different results. Always
|
||
check the regime before choosing an optimization.
|
||
:::
|
||
|
||
---
|
||
|
||
## 5. The Accuracy Tax: CompressionModel
|
||
|
||
Quantization is not free — it trades accuracy for speed. The `CompressionModel` quantifies
|
||
this trade-off:
|
||
|
||
```{python}
|
||
from mlsysim import CompressionModel
|
||
|
||
comp_solver = CompressionModel()
|
||
|
||
rows = []
|
||
for bits in [16, 8, 4]:
|
||
c = comp_solver.solve(
|
||
model=model, hardware=hardware,
|
||
method="quantization", target_bitwidth=bits
|
||
)
|
||
rows.append([
|
||
bits,
|
||
c.compressed_size_gb,
|
||
f"{c.compression_ratio:.1f}x",
|
||
f"{c.estimated_accuracy_delta:+.1%}",
|
||
f"{c.memory_savings_pct:.1f}%"
|
||
])
|
||
|
||
table(["Bits", "Compressed", "Compression", "Accuracy", "Savings"], rows)
|
||
```
|
||
|
||
INT8 has minimal accuracy loss (< 1%). INT4 can degrade accuracy by 2-5% depending on the
|
||
model and calibration method. The decision is not "always quantize" — it is "quantize when
|
||
you are memory-bound AND the accuracy cost is acceptable for your application."
|
||
|
||
::: {.callout-warning}
|
||
## When NOT to Quantize
|
||
|
||
- **Training**: You are compute-bound at large batch sizes. Quantization does not help and
|
||
introduces gradient noise that can harm convergence.
|
||
- **High-accuracy applications**: Medical, financial, and safety-critical systems may not
|
||
tolerate even 1% accuracy loss.
|
||
- **Already compute-bound inference**: If your inference workload runs at large batch sizes
|
||
(e.g., offline batch processing), you are likely compute-bound already.
|
||
:::
|
||
|
||
---
|
||
|
||
## Your Turn
|
||
|
||
::: {.callout-caution}
|
||
## Exercises
|
||
|
||
**Exercise 1: Predict before you compute.**
|
||
Llama-3 70B has 5x more parameters than Llama-3 8B, making it even more memory-bound at
|
||
batch 1. Before running code, predict: will INT4 give a *larger* or *smaller* speedup for
|
||
the 70B model compared to the 8B? Write your prediction, then verify with
|
||
`mlsysim.Models.Llama3_70B`. (Hint: think about what determines speedup in the
|
||
memory-bound regime.)
|
||
|
||
**Exercise 2: Find the crossover batch size.**
|
||
At some batch size, LLM inference transitions from memory-bound to compute-bound. At that
|
||
point, quantization stops helping. Sweep batch sizes from 1 to 256 for Llama-3 8B on the
|
||
H100 and compare FP16 vs. INT4 ITL. At what batch size does the INT4 speedup drop below
|
||
2x? Below 1.5x?
|
||
|
||
**Exercise 3: Accuracy-compression frontier.**
|
||
Use `CompressionModel` to compare quantization (INT8, INT4) vs. **pruning** — a technique
|
||
that removes parameters entirely (setting them to zero), reducing both model size and
|
||
computation. Try sparsity levels of 0.5, 0.75, and 0.9 for Llama-3 8B. Build a table
|
||
showing compression ratio vs. accuracy delta for each method. Which method gives the best
|
||
compression-to-accuracy trade-off?
|
||
|
||
**Self-check:** A workload has arithmetic intensity of 5 FLOP/byte and the hardware ridge
|
||
point is 150 FLOP/byte. Is this workload memory-bound or compute-bound? Will quantization
|
||
help? (Answer: Memory-bound. Yes, quantization will help proportionally.)
|
||
:::
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
::: {.callout-tip}
|
||
## Summary
|
||
|
||
- **Quantization primarily reduces bytes loaded from memory**: it helps memory-bound workloads proportionally and compute-bound workloads negligibly (though dedicated INT8/INT4 Tensor Cores also increase compute throughput)
|
||
- **LLM decode at batch 1** is the ideal case for quantization: ~2x for INT8, ~4x for INT4
|
||
- **Large-batch training** is compute-bound: quantization provides near-zero speedup
|
||
- **The regime determines the outcome**: always check whether you are memory-bound or compute-bound before applying quantization
|
||
- **Accuracy is the tax**: INT8 costs < 1%, INT4 costs 2-5% — acceptable for some applications, not for others
|
||
:::
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
- **[KV-Cache: The Hidden Tax](03_kv_cache.qmd)** — Quantization also shrinks the KV-cache, allowing more concurrent users
|
||
- **[The Memory Wall](01_memory_wall.qmd)** — Revisit the memory wall to see how quantization shifts the bandwidth bottleneck
|
||
- **[Starving the GPU](04_starving_the_gpu.qmd)** — Another case where the bottleneck is not where you expect
|
||
- **[Where to Invest](09_sensitivity.qmd)** — Quantify exactly how much quantization buys you compared to hardware upgrades
|