--- title: "Quantization: Not a Free Lunch" subtitle: "INT4 gives 4x speedup for decode. For training, it gives 0x." description: "Discover that quantization only helps when you are memory-bound. Compare the effect of INT4 on LLM decode (dramatic) vs. training (negligible). The regime determines whether the optimization works." categories: ["algorithm", "intermediate"] --- ## The Question INT4 quantization reduces model size by 4x. Intuitively, 4x fewer bytes should mean 4x faster. Blogs claim massive speedups. **Does INT4 always give 4x speedup?** You already have the tools to predict the answer. Before reading further, think about what Tutorial 1 taught you: *which ceiling determines performance?* If quantization reduces bytes but not FLOPs, when would it help — and when would it do nothing? This tutorial makes the prediction, runs the experiment, and reveals that the answer depends entirely on the regime. ::: {.callout-note} ## Prerequisites Complete [Tutorial 1: The Memory Wall](01_memory_wall.qmd) and [Tutorial 2: Two Phases, One Request](02_two_phases.qmd). You should understand memory-bound vs. compute-bound regimes and why LLM decode is memory-bound. ::: ::: {.callout-note} ## What You Will Learn - **Predict** whether quantization will help a given workload based on its regime - **Measure** the speedup from INT8 and INT4 for both memory-bound and compute-bound workloads - **Explain** why the same optimization yields 4x in one case and 0x in another - **Evaluate** the accuracy-compression trade-off using `CompressionModel` ::: ::: {.callout-tip} ## Background: What Quantization Actually Does Quantization reduces the number of bytes per parameter: | Precision | Bytes/Param | Relative to FP16 | |:----------|:------------|:-----------------| | FP16 | 2 | 1x (baseline) | | INT8 | 1 | 0.5x | | INT4 | 0.5 | 0.25x | For **memory-bound** workloads, performance scales with bytes loaded from HBM per step. Halving bytes halves latency. For **compute-bound** workloads, performance scales with FLOP/s. Fewer bytes does not change the number of FLOPs, so latency stays the same. The key question is always: *which ceiling am I hitting?* ::: --- ## 1. Setup ```{python} #| echo: false #| output: false import mlsysim # installed via `pip install mlsysim` (see workflow) Engine = mlsysim.Engine ``` ```python import mlsysim from mlsysim import ServingModel, SingleNodeModel, CompressionModel ``` --- ## 2. Memory-Bound Case: LLM Decode at Batch 1 LLM decoding at batch 1 is the textbook memory-bound workload. Each token generation must reload the entire model from HBM. Fewer bytes per parameter means fewer bytes to load means lower inter-token latency: ```{python} from mlsysim import ServingModel from mlsysim.show import table, info model = mlsysim.Models.Llama3_8B hardware = mlsysim.Hardware.Cloud.H100 solver = ServingModel() rows = [] baseline_itl = None for prec in ["fp16", "int8", "int4"]: r = solver.solve( model=model, hardware=hardware, seq_len=2048, batch_size=1, precision=prec ) itl_ms = r.itl.to("ms").magnitude if baseline_itl is None: baseline_itl = itl_ms speedup = baseline_itl / itl_ms rows.append([prec, r.itl.to('ms'), r.model_weights_size, f"{speedup:.1f}x"]) table(["Precision", "ITL", "Weights", "Speedup vs FP16"], rows) ``` INT8 gives roughly 2x speedup. INT4 gives roughly 4x. The speedup tracks the byte reduction almost exactly — because the workload is purely memory-bound. Every byte you eliminate directly reduces the time to load model weights. --- ## 3. Compute-Bound Case: Training at Large Batch Now let's try the same optimization on a compute-bound workload — ResNet-50 training at batch 256 on the A100: ```{python} from mlsysim import SingleNodeModel train_model = mlsysim.Models.ResNet50 train_hw = mlsysim.Hardware.Cloud.A100 train_solver = SingleNodeModel() rows = [] baseline_lat = None for prec in ["fp16", "int8", "int4"]: p = train_solver.solve( model=train_model, hardware=train_hw, batch_size=256, precision=prec ) lat_ms = p.latency.to("ms").magnitude if baseline_lat is None: baseline_lat = lat_ms speedup = baseline_lat / lat_ms if lat_ms > 0 else 0 rows.append([prec, p.latency.to('ms'), f"{p.throughput:.0f} img/s", f"{speedup:.1f}x"]) table(["Precision", "Latency", "Throughput", "Speedup"], rows) ``` The speedup is negligible. Why? Because at batch 256, ResNet-50 training is **compute-bound**. The bottleneck is arithmetic throughput (FLOP/s), not memory bandwidth. Reducing bytes per parameter does not change the number of FLOPs in the forward and backward passes. The GPU is already saturated with compute — loading weights faster does not help. ::: {.callout-warning} ## Nuance: INT8 Tensor Cores In practice, GPUs with dedicated INT8/INT4 Tensor Cores (like A100 and H100) also gain *higher compute throughput* at lower precision — e.g., the A100 does 624 TFLOP/s INT8 vs. 312 TFLOP/s FP16, a 2× compute boost. This means quantization simultaneously changes *both* the memory ceiling (fewer bytes) and the compute ceiling (more INT ops/sec). For workloads near the ridge point, this dual effect can shift the regime classification itself. MLSys·im's first-order model captures the memory effect; the compute boost is a second-order effect that depends on hardware-specific Tensor Core support. ::: --- ## 4. The Reveal: Same Optimization, Two Regimes Let's put the results side by side to make the contrast stark: ```{python} rows = [] # Memory-bound: LLM decode decode_row = ["Llama-3 8B decode"] for prec in ["fp16", "int8", "int4"]: r = solver.solve(model=model, hardware=hardware, seq_len=2048, batch_size=1, precision=prec) decode_row.append(r.itl.to('ms')) decode_row.append("Memory-bound") rows.append(decode_row) # Compute-bound: training train_row = ["ResNet-50 train bs=256"] for prec in ["fp16", "int8", "int4"]: p = train_solver.solve(model=train_model, hardware=train_hw, batch_size=256, precision=prec) train_row.append(p.latency.to('ms')) train_row.append("Compute-bound") rows.append(train_row) table(["Workload", "FP16", "INT8", "INT4", "Regime"], rows) ``` ::: {.callout-important} ## Key Insight **Quantization reduces bytes loaded from memory. If you are memory-bound, fewer bytes means proportional speedup. If you are compute-bound, fewer bytes means nothing — compute, not memory, is the ceiling.** The regime determines whether the optimization works. INT4 gives ~4x for LLM decode (memory-bound) and ~0x for large-batch training (compute-bound). The same technique, applied to different regimes, yields completely different results. Always check the regime before choosing an optimization. ::: --- ## 5. The Accuracy Tax: CompressionModel Quantization is not free — it trades accuracy for speed. The `CompressionModel` quantifies this trade-off: ```{python} from mlsysim import CompressionModel comp_solver = CompressionModel() rows = [] for bits in [16, 8, 4]: c = comp_solver.solve( model=model, hardware=hardware, method="quantization", target_bitwidth=bits ) rows.append([ bits, c.compressed_size_gb, f"{c.compression_ratio:.1f}x", f"{c.estimated_accuracy_delta:+.1%}", f"{c.memory_savings_pct:.1f}%" ]) table(["Bits", "Compressed", "Compression", "Accuracy", "Savings"], rows) ``` INT8 has minimal accuracy loss (< 1%). INT4 can degrade accuracy by 2-5% depending on the model and calibration method. The decision is not "always quantize" — it is "quantize when you are memory-bound AND the accuracy cost is acceptable for your application." ::: {.callout-warning} ## When NOT to Quantize - **Training**: You are compute-bound at large batch sizes. Quantization does not help and introduces gradient noise that can harm convergence. - **High-accuracy applications**: Medical, financial, and safety-critical systems may not tolerate even 1% accuracy loss. - **Already compute-bound inference**: If your inference workload runs at large batch sizes (e.g., offline batch processing), you are likely compute-bound already. ::: --- ## Your Turn ::: {.callout-caution} ## Exercises **Exercise 1: Predict before you compute.** Llama-3 70B has 5x more parameters than Llama-3 8B, making it even more memory-bound at batch 1. Before running code, predict: will INT4 give a *larger* or *smaller* speedup for the 70B model compared to the 8B? Write your prediction, then verify with `mlsysim.Models.Llama3_70B`. (Hint: think about what determines speedup in the memory-bound regime.) **Exercise 2: Find the crossover batch size.** At some batch size, LLM inference transitions from memory-bound to compute-bound. At that point, quantization stops helping. Sweep batch sizes from 1 to 256 for Llama-3 8B on the H100 and compare FP16 vs. INT4 ITL. At what batch size does the INT4 speedup drop below 2x? Below 1.5x? **Exercise 3: Accuracy-compression frontier.** Use `CompressionModel` to compare quantization (INT8, INT4) vs. **pruning** — a technique that removes parameters entirely (setting them to zero), reducing both model size and computation. Try sparsity levels of 0.5, 0.75, and 0.9 for Llama-3 8B. Build a table showing compression ratio vs. accuracy delta for each method. Which method gives the best compression-to-accuracy trade-off? **Self-check:** A workload has arithmetic intensity of 5 FLOP/byte and the hardware ridge point is 150 FLOP/byte. Is this workload memory-bound or compute-bound? Will quantization help? (Answer: Memory-bound. Yes, quantization will help proportionally.) ::: --- ## Key Takeaways ::: {.callout-tip} ## Summary - **Quantization primarily reduces bytes loaded from memory**: it helps memory-bound workloads proportionally and compute-bound workloads negligibly (though dedicated INT8/INT4 Tensor Cores also increase compute throughput) - **LLM decode at batch 1** is the ideal case for quantization: ~2x for INT8, ~4x for INT4 - **Large-batch training** is compute-bound: quantization provides near-zero speedup - **The regime determines the outcome**: always check whether you are memory-bound or compute-bound before applying quantization - **Accuracy is the tax**: INT8 costs < 1%, INT4 costs 2-5% — acceptable for some applications, not for others ::: --- ## Next Steps - **[KV-Cache: The Hidden Tax](03_kv_cache.qmd)** — Quantization also shrinks the KV-cache, allowing more concurrent users - **[The Memory Wall](01_memory_wall.qmd)** — Revisit the memory wall to see how quantization shifts the bandwidth bottleneck - **[Starving the GPU](04_starving_the_gpu.qmd)** — Another case where the bottleneck is not where you expect - **[Where to Invest](09_sensitivity.qmd)** — Quantify exactly how much quantization buys you compared to hardware upgrades