TinyTorch/modules/15_quantization/ABOUT.md

---
title: "Quantization - Reduced Precision for Efficiency"
description: "INT8 quantization fundamentals, calibration strategies, and accuracy-efficiency trade-offs"
difficulty: "⭐⭐⭐"
time_estimate: "5-6 hours"
prerequisites: ["Profiling"]
next_steps: ["Compression"]
learning_objectives:
  - "Understand how quantization reduces memory by 4× through precision reduction from FP32 to INT8"
  - "Implement symmetric and asymmetric quantization with scale and zero-point parameters"
  - "Design calibration strategies using representative data to minimize accuracy degradation"
  - "Measure the accuracy-efficiency frontier: when 1% accuracy loss justifies 4× memory savings"
  - "Recognize quantization as educational foundation vs production INT8 hardware acceleration"
---

# 15. Quantization - Reduced Precision for Efficiency

**OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours

## Overview

This module implements quantization fundamentals: converting FP32 tensors to INT8 representation to reduce memory by 4×. You'll build the mathematics of scale/zero-point quantization, implement quantized linear layers, and measure accuracy-efficiency trade-offs. CRITICAL HONESTY: You're implementing quantization math in Python, NOT actual hardware INT8 operations. This teaches the principles that enable TensorFlow Lite/PyTorch Mobile deployment, but real speedups require specialized hardware (Edge TPU, Neural Engine) or compiled frameworks with INT8 kernels. Your implementation will be 4× more memory-efficient but not faster - understanding WHY teaches you what production quantization frameworks must optimize.

## Learning Objectives

By the end of this module, you will be able to:

- **Quantization Mathematics**: Implement symmetric and asymmetric INT8 quantization with scale/zero-point parameter calculation
- **Calibration Strategies**: Design percentile-based calibration to minimize accuracy loss when selecting quantization parameters
- **Memory-Accuracy Trade-offs**: Measure when 4× memory reduction justifies 0.5-2% accuracy degradation for deployment
- **Production Reality**: Distinguish between educational quantization (Python simulation) vs production INT8 (hardware acceleration, kernel fusion)
- **When to Quantize**: Recognize deployment scenarios where quantization is mandatory (mobile/edge) vs optional (cloud serving)

## Build → Use → Optimize

This module follows TinyTorch's **Build → Use → Optimize** framework:

1. **Build**: Implement INT8 quantization/dequantization, calibration logic, QuantizedLinear layers
2. **Use**: Quantize trained models, measure accuracy degradation vs memory savings on MNIST/CIFAR
3. **Optimize**: Analyze the accuracy-efficiency frontier - when does quantization enable deployment vs hurt accuracy unacceptably?

## Implementation Guide

### Quantization Flow: FP32 → INT8

Quantization compresses weights by reducing precision, trading accuracy for memory efficiency:

```{mermaid}
graph LR
    A[FP32 Weight<br/>4 bytes<br/>-3.14159] --> B[Quantize<br/>scale + zero_point]
    B --> C[INT8 Weight<br/>1 byte<br/>-126]
    C --> D[Dequantize<br/>Inference]
    D --> E[FP32 Compute<br/>Result]

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#ffe0b2
    style E fill:#f0fdf4
```

**Flow**: Original FP32 → Calibrate scale → Store as INT8 (4× smaller) → Dequantize for computation → FP32 result

### What You're Actually Building (Educational Quantization)

**Your Implementation:**
- Quantization math: FP32 → INT8 conversion with scale/zero-point
- QuantizedLinear: Store weights as INT8, compute in simulated quantized arithmetic
- Calibration: Find optimal scale parameters from representative data
- Memory measurement: Verify 4× reduction (32 bits → 8 bits)

**What You're NOT Building:**
- Actual INT8 hardware operations (requires CPU VNNI, ARM NEON, GPU Tensor Cores)
- Kernel fusion (eliminating quantize/dequantize overhead)
- Mixed-precision execution graphs (FP32 for sensitive ops, INT8 for matmul)
- Production deployment pipelines (TensorFlow Lite converter, ONNX Runtime optimization)

**Why This Matters:** Understanding quantization math is essential. But knowing that production speedups require hardware acceleration + compiler optimization prevents unrealistic expectations. Your 4× memory reduction is real; your lack of speedup teaches why TensorFlow Lite needs custom kernels.

### Core Quantization Mathematics

**Symmetric Quantization (Zero-Point = 0)**

Assumes data is centered around zero (common after BatchNorm):

```python
# Quantization: FP32 → INT8
scale = max(abs(tensor)) / 127.0  # Scale factor
quantized = round(tensor / scale).clip(-128, 127).astype(int8)

# Dequantization: INT8 → FP32
dequantized = quantized.astype(float32) * scale
```

- **Range**: INT8 is [-128, 127] (256 values)
- **Scale**: Maps largest FP32 value to 127
- **Zero-point**: Always 0 (symmetric around origin)
- **Use case**: Weights after normalization, activations after BatchNorm

**Asymmetric Quantization (With Zero-Point)**

Handles arbitrary data ranges (e.g., activations after ReLU: [0, max]):

```python
# Quantization: FP32 → INT8
min_val, max_val = tensor.min(), tensor.max()
scale = (max_val - min_val) / 255.0
zero_point = round(-min_val / scale)
quantized = round(tensor / scale + zero_point).clip(-128, 127).astype(int8)

# Dequantization: INT8 → FP32
dequantized = (quantized.astype(float32) - zero_point) * scale
```

- **Range**: Uses full [-128, 127] even if data is [0, 5]
- **Scale**: Maps data range to INT8 range
- **Zero-point**: Offset ensuring FP32 zero maps to specific INT8 value
- **Use case**: ReLU activations, input images, any non-centered data

**Trade-off:** Symmetric is simpler (no zero-point storage/computation), asymmetric uses range more efficiently (better for skewed distributions).

### Calibration - The Critical Step

Quantization quality depends entirely on scale/zero-point selection. Poor choices destroy accuracy.

**Naive Approach (Don't Do This):**
```python
# Use global min/max from training data
scale = (tensor_max - tensor_min) / 255
# Problem: Single outlier wastes most INT8 range
# Example: data in [0, 5] but one outlier at 100 → scale = 100/255
# Result: 95% of data maps to only 13 INT8 values (5/100 * 255 = 13)
```

**Calibration Approach (Correct):**
```python
# Use percentile-based clipping
max_val = np.percentile(np.abs(calibration_data), 99.9)
scale = max_val / 127
# Clips 0.1% outliers, uses INT8 range efficiently
# 99.9th percentile ignores rare outliers, preserves typical range
```

**Calibration Process:**
1. Collect 100-1000 samples of representative data (validation set)
2. For each layer, record activation statistics during forward passes
3. Compute percentile-based min/max (typically 99.9th percentile)
4. Calculate scale/zero-point from clipped statistics
5. Quantize weights/activations using calibrated parameters

**Why It Works:** Most activations follow normal-ish distributions. Outliers are rare but dominate min/max. Clipping 0.1% of outliers uses INT8 range 10-100× more efficiently with negligible accuracy loss.

### Per-Tensor vs Per-Channel Quantization

**Per-Tensor Quantization:**
- One scale/zero-point for entire weight tensor
- Simple: store 2 parameters per layer
- Example: Conv2D with 64×3×3×3 weights uses 1 scale, 1 zero-point

**Per-Channel Quantization:**
- Separate scale/zero-point per output channel
- Better accuracy: each channel uses its natural range
- Example: Conv2D with 64 output channels uses 64 scales, 64 zero-points
- Overhead: 128 extra parameters (64 scales + 64 zero-points)

**When to Use Per-Channel:**
- Weight magnitudes vary significantly across channels (common in Conv layers)
- Accuracy improvement (0.5-1.5%) justifies 0.1-0.5% memory overhead
- Production frameworks (PyTorch, TensorFlow Lite) default to per-channel for Conv/Linear

**Trade-off Table:**

| Quantization Scheme | Parameters | Accuracy | Complexity | Use Case |
|---------------------|------------|----------|------------|----------|
| Per-Tensor | 2 per layer | Baseline | Simple | Fast prototyping, small models |
| Per-Channel (Conv) | 2N (N=channels) | +0.5-1.5% | Medium | Production Conv layers |
| Per-Channel (Linear) | 2N (N=out_features) | +0.3-0.8% | Medium | Production Linear layers |
| Mixed (Conv per-channel, Linear per-tensor) | Hybrid | +0.4-1.2% | Medium | Balanced approach |

### QuantizedLinear - Quantized Neural Network Layer

Replaces regular Linear layer with quantized equivalent:

```python
class QuantizedLinear:
    def __init__(self, linear_layer: Linear):
        # Quantize weights at initialization
        self.weights_int8, self.weight_scale, self.weight_zp = quantize_int8(linear_layer.weight)
        self.bias_int8, self.bias_scale, self.bias_zp = quantize_int8(linear_layer.bias)

        # Store original FP32 for accuracy comparison
        self.original_weight = linear_layer.weight

    def forward(self, x: Tensor) -> Tensor:
        # EDUCATIONAL VERSION: Dequantize → compute in FP32 → quantize result
        # (Simulates quantization math but doesn't speed up computation)
        weight_fp32 = dequantize_int8(self.weights_int8, self.weight_scale, self.weight_zp)
        bias_fp32 = dequantize_int8(self.bias_int8, self.bias_scale, self.bias_zp)

        # Compute in FP32 (not actually faster - just lower precision storage)
        output = x @ weight_fp32.T + bias_fp32
        return output
```

**What Happens in Production (TensorFlow Lite, PyTorch Mobile):**

```python
# Production quantized matmul (conceptual - happens in C++/assembly)
def quantized_matmul_production(x_int8, weight_int8, x_scale, weight_scale, output_scale):
    # 1. INT8 x INT8 matmul using VNNI/NEON/Tensor Cores (FAST)
    accum_int32 = matmul_int8_hardware(x_int8, weight_int8)  # Specialized instruction

    # 2. Requantize accumulated INT32 → INT8 output
    combined_scale = (x_scale * weight_scale) / output_scale
    output_int8 = (accum_int32 * combined_scale).clip(-128, 127)

    # 3. Stay in INT8 for next layer (no dequantization unless necessary)
    return output_int8
```

**Key Differences:**
- **Your implementation**: Dequantize → FP32 compute → quantize (educational, slow)
- **Production**: INT8 → INT8 throughout, specialized hardware (4-10× speedup)

**Memory Savings (Real):** 4× reduction from storing INT8 instead of FP32
**Speed Improvement (Your Code):** ~0× (Python overhead dominates)
**Speed Improvement (Production):** 2-10× (hardware acceleration, kernel fusion)

### Model-Level Quantization

```python
def quantize_model(model, calibration_data=None):
    """
    Quantize all Linear layers in model.

    Args:
        model: Neural network with Linear layers
        calibration_data: Representative samples for activation calibration

    Returns:
        quantized_model: Model with QuantizedLinear layers
        calibration_stats: Scale/zero-point parameters per layer
    """
    quantized_layers = []
    for layer in model.layers:
        if isinstance(layer, Linear):
            q_layer = QuantizedLinear(layer)
            if calibration_data:
                q_layer.calibrate(calibration_data)  # Find optimal scales
            quantized_layers.append(q_layer)
        else:
            quantized_layers.append(layer)  # Keep ReLU, Softmax in FP32

    return quantized_layers
```

**Calibration in Practice:**
1. Run 100-1000 samples through original FP32 model
2. Record min/max activations for each layer
3. Compute percentile-clipped scales
4. Quantize weights with calibrated parameters
5. Test accuracy on validation set

## Getting Started

### Prerequisites

Ensure you've completed profiling fundamentals:

```bash
# Activate TinyTorch environment
source bin/activate-tinytorch.sh

# Verify prerequisite modules
tito test --module profiling
```

**Required Understanding:**
- Memory profiling (Module 14): Measuring memory consumption
- Tensor operations (Module 01): Understanding FP32 representation
- Linear layers (Module 03): Matrix multiplication mechanics

### Development Workflow

1. **Open the development file**: `modules/15_quantization/quantization_dev.py`
2. **Implement quantize_int8()**: FP32 → INT8 conversion with scale/zero-point calculation
3. **Implement dequantize_int8()**: INT8 → FP32 restoration
4. **Build QuantizedLinear**: Replace Linear layers with quantized versions
5. **Add calibration logic**: Percentile-based scale selection
6. **Implement quantize_model()**: Convert entire networks to quantized form
7. **Export and verify**: `tito module complete 15 && tito test --module quantization`

## Testing

### Comprehensive Test Suite

Run the full test suite to verify quantization functionality:

```bash
# TinyTorch CLI (recommended)
tito test --module quantization

# Direct pytest execution
python -m pytest tests/ -k quantization -v
```

### Test Coverage Areas

- ✅ **Quantization Correctness**: FP32 → INT8 → FP32 roundtrip error bounds (< 0.5% mean error)
- ✅ **Memory Reduction**: Verify 4× reduction in model size (weights + biases)
- ✅ **Symmetric vs Asymmetric**: Both schemes produce valid INT8 in [-128, 127]
- ✅ **Calibration Impact**: Percentile clipping reduces quantization error vs naive min/max
- ✅ **QuantizedLinear Equivalence**: Output matches FP32 Linear within tolerance (< 1% difference)
- ✅ **Model-Level Quantization**: Full network quantization preserves accuracy (< 2% degradation)

### Inline Testing & Quantization Analysis

The module includes comprehensive validation with real-time feedback:

```python
# Example inline test output
🔬 Unit Test: quantize_int8()...
✅ Symmetric quantization: range [-128, 127] ✓
✅ Scale calculation: max_val / 127 = 0.0234 ✓
✅ Roundtrip error: 0.31% mean error ✓
📈 Progress: quantize_int8() ✓

🔬 Unit Test: QuantizedLinear...
✅ Memory reduction: 145KB → 36KB (4.0×) ✓
✅ Output equivalence: 0.43% max difference vs FP32 ✓
📈 Progress: QuantizedLinear ✓
```

### Manual Testing Examples

```python
from quantization_dev import quantize_int8, dequantize_int8, QuantizedLinear
from tinytorch.nn import Linear

# Test quantization on random tensor
tensor = Tensor(np.random.randn(100, 100).astype(np.float32))
q_tensor, scale, zero_point = quantize_int8(tensor)

print(f"Original range: [{tensor.data.min():.2f}, {tensor.data.max():.2f}]")
print(f"Quantized range: [{q_tensor.data.min()}, {q_tensor.data.max()}]")
print(f"Scale: {scale:.6f}, Zero-point: {zero_point}")

# Dequantize and measure error
restored = dequantize_int8(q_tensor, scale, zero_point)
error = np.abs(tensor.data - restored.data).mean()
print(f"Roundtrip error: {error:.4f} ({error/np.abs(tensor.data).mean()*100:.2f}%)")

# Quantize a Linear layer
linear = Linear(128, 64)
q_linear = QuantizedLinear(linear)

print(f"\nOriginal weights: {linear.weight.data.nbytes} bytes")
print(f"Quantized weights: {q_linear.weights_int8.data.nbytes} bytes")
print(f"Reduction: {linear.weight.data.nbytes / q_linear.weights_int8.data.nbytes:.1f}×")
```

## Systems Thinking Questions

### Real-World Applications

- **Mobile ML Deployment**: TensorFlow Lite converts all models to INT8 for Android/iOS. Without quantization, models exceed app size limits (100-200MB) and drain battery 4× faster. Google Photos, Translate, Keyboard all run quantized models on-device.

- **Edge AI Devices**: Google Edge TPU (Coral), NVIDIA Jetson, Intel Neural Compute Stick require INT8 models. Hardware is designed exclusively for quantized operations - FP32 isn't supported or is 10× slower.

- **Cloud Inference Optimization**: AWS Inferentia, Azure Inferentia, Google Cloud TPU serve quantized models. INT8 reduces memory bandwidth (bottleneck for inference) and increases throughput by 2-4×. At scale (millions of requests/day), this saves millions in infrastructure costs.

- **Large Language Models**: LLaMA-65B is 130GB in FP16, doesn't fit on single 80GB A100 GPU. INT8 quantization → 65GB, enables serving. GPTQ pushes to 4-bit (33GB) with < 1% perplexity increase. Quantization is how enthusiasts run 70B models on consumer GPUs.

### Quantization Mathematics

- **Why INT8 vs INT4 or INT16?** INT8 is the sweet spot: 4× memory reduction with < 1% accuracy loss. INT4 gives 8× reduction but 2-5% accuracy loss (harder to deploy). INT16 only 2× reduction (not worth complexity). Hardware acceleration (VNNI, NEON, Tensor Cores) standardized on INT8.

- **Symmetric vs Asymmetric Trade-offs**: Symmetric is simpler (no zero-point) but wastes range for skewed data. ReLU activations are [0, max] - symmetric centers around 0, wasting negative range. Asymmetric uses full INT8 range but costs extra zero-point storage and computation.

- **Calibration Data Requirements**: Theory: more data → better statistics. Practice: diminishing returns after 500-1000 samples. Percentile estimates stabilize quickly. Critical requirement: calibration data MUST match deployment distribution. If calibration is ImageNet but deployment is medical images, quantization fails catastrophically.

- **Per-Channel Justification**: Conv2D with 64 output channels: per-channel stores 64 scales + 64 zero-points = 512 bytes. Total weights: 3×3×64×64 FP32 = 147KB. Overhead: 0.35%. Accuracy improvement: 0.5-1.5%. Clear win - explains why production frameworks default to per-channel.

### Production Deployment Characteristics

- **Speed Reality Check**: INT8 matmul is theoretically 4× faster (4× less memory bandwidth). Practice: 2-3× on CPU (quantize/dequantize overhead), 4-10× on specialized hardware (Edge TPU, Neural Engine designed for pure INT8 graphs). Your Python implementation is 0× faster (simulation overhead > bandwidth savings).

- **When Quantization is Mandatory**: Mobile deployment (app size limits, battery constraints, Neural Engine acceleration), Edge devices (limited memory/compute), Cloud serving at scale (cost optimization). Not negotiable - models either quantize or don't ship.

- **When to Avoid Quantization**: Accuracy-critical applications where 1% matters (medical diagnosis, autonomous vehicles), Early research iteration (quantization adds complexity), Models already tiny (< 10MB - quantization overhead not worth it), Cloud serving with abundant resources (FP32 throughput sufficient).

- **Quantization-Aware Training vs Post-Training**: PTQ (Post-Training Quantization) is fast (minutes) but loses 1-2% accuracy. QAT (Quantization-Aware Training) requires retraining (days/weeks) but loses < 0.5%. Choose PTQ for rapid iteration, QAT for production deployment. If using pretrained models you don't own (BERT, ResNet), PTQ is only option.

## Ready to Build?

You're about to implement the precision reduction mathematics that make mobile ML deployment possible. Quantization is the difference between a model that exists in research and a model that ships in apps used by billions.

This module teaches honest quantization: you'll implement the math correctly, achieve 4× memory reduction, and understand precisely why your Python code isn't faster (hardware acceleration requires specialized silicon + compiled kernels). This clarity prepares you for production deployment where TensorFlow Lite, PyTorch Mobile, and ONNX Runtime apply your quantization mathematics with real INT8 hardware operations.

Understanding quantization from first principles - implementing the scale/zero-point calculations yourself, calibrating with real data, measuring accuracy-efficiency trade-offs - gives you deep insight into the constraints that define production ML systems.

Choose your preferred way to engage with this module:

````{grid} 1 2 3 3

```{grid-item-card} Launch Binder
:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/15_quantization/quantization_dev.ipynb
:class-header: bg-light

Run this module interactively in your browser. No installation required.
```

```{grid-item-card} Open in Colab
:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/15_quantization/quantization_dev.ipynb
:class-header: bg-light

Use Google Colab for GPU access and cloud compute power.
```

```{grid-item-card} View Source
:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/15_quantization/quantization_dev.py
:class-header: bg-light

Browse the Python source code and understand the implementation.
```

````

```{admonition} Save Your Progress
:class: tip
Binder sessions are temporary. Download your completed notebook when done, or switch to local development for persistent work.
```

---

<div class="prev-next-area">
<a class="left-prev" href="../modules/14_profiling/ABOUT.html" title="previous page">← Module 14: Profiling</a>
<a class="right-next" href="../modules/16_compression/ABOUT.html" title="next page">Module 16: Compression →</a>
</div>