mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-03-09 21:12:00 -05:00

Files

Vijay Janapa Reddi 65c973fac1 Update module documentation: enhance ABOUT.md files across all modules

- Improve module descriptions and learning objectives
- Standardize documentation format and structure
- Add clearer guidance for students
- Enhance module-specific context and examples

2025-11-13 10:42:47 -05:00

20 KiB

Raw Permalink Blame History

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives

title

description

difficulty

time_estimate

prerequisites

next_steps

learning_objectives

Quantization - Reduced Precision for Efficiency

INT8 quantization fundamentals, calibration strategies, and accuracy-efficiency trade-offs

⭐⭐⭐

5-6 hours

Profiling

Compression

Understand how quantization reduces memory by 4× through precision reduction from FP32 to INT8

Implement symmetric and asymmetric quantization with scale and zero-point parameters

Design calibration strategies using representative data to minimize accuracy degradation

Measure the accuracy-efficiency frontier: when 1% accuracy loss justifies 4× memory savings

Recognize quantization as educational foundation vs production INT8 hardware acceleration

15. Quantization - Reduced Precision for Efficiency

OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours

Overview

This module implements quantization fundamentals: converting FP32 tensors to INT8 representation to reduce memory by 4×. You'll build the mathematics of scale/zero-point quantization, implement quantized linear layers, and measure accuracy-efficiency trade-offs. CRITICAL HONESTY: You're implementing quantization math in Python, NOT actual hardware INT8 operations. This teaches the principles that enable TensorFlow Lite/PyTorch Mobile deployment, but real speedups require specialized hardware (Edge TPU, Neural Engine) or compiled frameworks with INT8 kernels. Your implementation will be 4× more memory-efficient but not faster - understanding WHY teaches you what production quantization frameworks must optimize.

Learning Objectives

By the end of this module, you will be able to:

Quantization Mathematics: Implement symmetric and asymmetric INT8 quantization with scale/zero-point parameter calculation
Calibration Strategies: Design percentile-based calibration to minimize accuracy loss when selecting quantization parameters
Memory-Accuracy Trade-offs: Measure when 4× memory reduction justifies 0.5-2% accuracy degradation for deployment
Production Reality: Distinguish between educational quantization (Python simulation) vs production INT8 (hardware acceleration, kernel fusion)
When to Quantize: Recognize deployment scenarios where quantization is mandatory (mobile/edge) vs optional (cloud serving)

Build → Use → Optimize

This module follows TinyTorch's Build → Use → Optimize framework:

Build: Implement INT8 quantization/dequantization, calibration logic, QuantizedLinear layers
Use: Quantize trained models, measure accuracy degradation vs memory savings on MNIST/CIFAR
Optimize: Analyze the accuracy-efficiency frontier - when does quantization enable deployment vs hurt accuracy unacceptably?

Implementation Guide

Quantization Flow: FP32 → INT8

Quantization compresses weights by reducing precision, trading accuracy for memory efficiency:

graph LR
    A[FP32 Weight<br/>4 bytes<br/>-3.14159] --> B[Quantize<br/>scale + zero_point]
    B --> C[INT8 Weight<br/>1 byte<br/>-126]
    C --> D[Dequantize<br/>Inference]
    D --> E[FP32 Compute<br/>Result]

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#ffe0b2
    style E fill:#f0fdf4

Flow: Original FP32 → Calibrate scale → Store as INT8 (4× smaller) → Dequantize for computation → FP32 result

What You're Actually Building (Educational Quantization)

Your Implementation:

Quantization math: FP32 → INT8 conversion with scale/zero-point
QuantizedLinear: Store weights as INT8, compute in simulated quantized arithmetic
Calibration: Find optimal scale parameters from representative data
Memory measurement: Verify 4× reduction (32 bits → 8 bits)

What You're NOT Building:

Actual INT8 hardware operations (requires CPU VNNI, ARM NEON, GPU Tensor Cores)
Kernel fusion (eliminating quantize/dequantize overhead)
Mixed-precision execution graphs (FP32 for sensitive ops, INT8 for matmul)
Production deployment pipelines (TensorFlow Lite converter, ONNX Runtime optimization)

Why This Matters: Understanding quantization math is essential. But knowing that production speedups require hardware acceleration + compiler optimization prevents unrealistic expectations. Your 4× memory reduction is real; your lack of speedup teaches why TensorFlow Lite needs custom kernels.

Core Quantization Mathematics

Symmetric Quantization (Zero-Point = 0)

Assumes data is centered around zero (common after BatchNorm):

# Quantization: FP32 → INT8
scale = max(abs(tensor)) / 127.0  # Scale factor
quantized = round(tensor / scale).clip(-128, 127).astype(int8)

# Dequantization: INT8 → FP32
dequantized = quantized.astype(float32) * scale

Range: INT8 is [-128, 127] (256 values)
Scale: Maps largest FP32 value to 127
Zero-point: Always 0 (symmetric around origin)
Use case: Weights after normalization, activations after BatchNorm

Asymmetric Quantization (With Zero-Point)

Handles arbitrary data ranges (e.g., activations after ReLU: [0, max]):

# Quantization: FP32 → INT8
min_val, max_val = tensor.min(), tensor.max()
scale = (max_val - min_val) / 255.0
zero_point = round(-min_val / scale)
quantized = round(tensor / scale + zero_point).clip(-128, 127).astype(int8)

# Dequantization: INT8 → FP32
dequantized = (quantized.astype(float32) - zero_point) * scale

Range: Uses full [-128, 127] even if data is [0, 5]
Scale: Maps data range to INT8 range
Zero-point: Offset ensuring FP32 zero maps to specific INT8 value
Use case: ReLU activations, input images, any non-centered data

Trade-off: Symmetric is simpler (no zero-point storage/computation), asymmetric uses range more efficiently (better for skewed distributions).

Calibration - The Critical Step

Quantization quality depends entirely on scale/zero-point selection. Poor choices destroy accuracy.

Naive Approach (Don't Do This):

# Use global min/max from training data
scale = (tensor_max - tensor_min) / 255
# Problem: Single outlier wastes most INT8 range
# Example: data in [0, 5] but one outlier at 100 → scale = 100/255
# Result: 95% of data maps to only 13 INT8 values (5/100 * 255 = 13)

Calibration Approach (Correct):

# Use percentile-based clipping
max_val = np.percentile(np.abs(calibration_data), 99.9)
scale = max_val / 127
# Clips 0.1% outliers, uses INT8 range efficiently
# 99.9th percentile ignores rare outliers, preserves typical range

Calibration Process:

Collect 100-1000 samples of representative data (validation set)
For each layer, record activation statistics during forward passes
Compute percentile-based min/max (typically 99.9th percentile)
Calculate scale/zero-point from clipped statistics
Quantize weights/activations using calibrated parameters

Why It Works: Most activations follow normal-ish distributions. Outliers are rare but dominate min/max. Clipping 0.1% of outliers uses INT8 range 10-100× more efficiently with negligible accuracy loss.

Per-Tensor vs Per-Channel Quantization

Per-Tensor Quantization:

One scale/zero-point for entire weight tensor
Simple: store 2 parameters per layer
Example: Conv2D with 64×3×3×3 weights uses 1 scale, 1 zero-point

Per-Channel Quantization:

Separate scale/zero-point per output channel
Better accuracy: each channel uses its natural range
Example: Conv2D with 64 output channels uses 64 scales, 64 zero-points
Overhead: 128 extra parameters (64 scales + 64 zero-points)

When to Use Per-Channel:

Weight magnitudes vary significantly across channels (common in Conv layers)
Accuracy improvement (0.5-1.5%) justifies 0.1-0.5% memory overhead
Production frameworks (PyTorch, TensorFlow Lite) default to per-channel for Conv/Linear

Trade-off Table:

Quantization Scheme	Parameters	Accuracy	Complexity	Use Case
Per-Tensor	2 per layer	Baseline	Simple	Fast prototyping, small models
Per-Channel (Conv)	2N (N=channels)	+0.5-1.5%	Medium	Production Conv layers
Per-Channel (Linear)	2N (N=out_features)	+0.3-0.8%	Medium	Production Linear layers
Mixed (Conv per-channel, Linear per-tensor)	Hybrid	+0.4-1.2%	Medium	Balanced approach

QuantizedLinear - Quantized Neural Network Layer

Replaces regular Linear layer with quantized equivalent:

class QuantizedLinear:
    def __init__(self, linear_layer: Linear):
        # Quantize weights at initialization
        self.weights_int8, self.weight_scale, self.weight_zp = quantize_int8(linear_layer.weight)
        self.bias_int8, self.bias_scale, self.bias_zp = quantize_int8(linear_layer.bias)

        # Store original FP32 for accuracy comparison
        self.original_weight = linear_layer.weight

    def forward(self, x: Tensor) -> Tensor:
        # EDUCATIONAL VERSION: Dequantize → compute in FP32 → quantize result
        # (Simulates quantization math but doesn't speed up computation)
        weight_fp32 = dequantize_int8(self.weights_int8, self.weight_scale, self.weight_zp)
        bias_fp32 = dequantize_int8(self.bias_int8, self.bias_scale, self.bias_zp)

        # Compute in FP32 (not actually faster - just lower precision storage)
        output = x @ weight_fp32.T + bias_fp32
        return output

What Happens in Production (TensorFlow Lite, PyTorch Mobile):

# Production quantized matmul (conceptual - happens in C++/assembly)
def quantized_matmul_production(x_int8, weight_int8, x_scale, weight_scale, output_scale):
    # 1. INT8 x INT8 matmul using VNNI/NEON/Tensor Cores (FAST)
    accum_int32 = matmul_int8_hardware(x_int8, weight_int8)  # Specialized instruction

    # 2. Requantize accumulated INT32 → INT8 output
    combined_scale = (x_scale * weight_scale) / output_scale
    output_int8 = (accum_int32 * combined_scale).clip(-128, 127)

    # 3. Stay in INT8 for next layer (no dequantization unless necessary)
    return output_int8

Key Differences:

Your implementation: Dequantize → FP32 compute → quantize (educational, slow)
Production: INT8 → INT8 throughout, specialized hardware (4-10× speedup)

Memory Savings (Real): 4× reduction from storing INT8 instead of FP32 Speed Improvement (Your Code): ~0× (Python overhead dominates) Speed Improvement (Production): 2-10× (hardware acceleration, kernel fusion)

Model-Level Quantization

def quantize_model(model, calibration_data=None):
    """
    Quantize all Linear layers in model.

    Args:
        model: Neural network with Linear layers
        calibration_data: Representative samples for activation calibration

    Returns:
        quantized_model: Model with QuantizedLinear layers
        calibration_stats: Scale/zero-point parameters per layer
    """
    quantized_layers = []
    for layer in model.layers:
        if isinstance(layer, Linear):
            q_layer = QuantizedLinear(layer)
            if calibration_data:
                q_layer.calibrate(calibration_data)  # Find optimal scales
            quantized_layers.append(q_layer)
        else:
            quantized_layers.append(layer)  # Keep ReLU, Softmax in FP32

    return quantized_layers

Calibration in Practice:

Run 100-1000 samples through original FP32 model
Record min/max activations for each layer
Compute percentile-clipped scales
Quantize weights with calibrated parameters
Test accuracy on validation set

Getting Started

Prerequisites

Ensure you've completed profiling fundamentals:

# Activate TinyTorch environment
source bin/activate-tinytorch.sh

# Verify prerequisite modules
tito test --module profiling

Required Understanding:

Memory profiling (Module 14): Measuring memory consumption
Tensor operations (Module 01): Understanding FP32 representation
Linear layers (Module 03): Matrix multiplication mechanics

Development Workflow

Open the development file: modules/15_quantization/quantization_dev.py
Implement quantize_int8(): FP32 → INT8 conversion with scale/zero-point calculation
Implement dequantize_int8(): INT8 → FP32 restoration
Build QuantizedLinear: Replace Linear layers with quantized versions
Add calibration logic: Percentile-based scale selection
Implement quantize_model(): Convert entire networks to quantized form
Export and verify: tito module complete 15 && tito test --module quantization

Testing

Comprehensive Test Suite

Run the full test suite to verify quantization functionality:

# TinyTorch CLI (recommended)
tito test --module quantization

# Direct pytest execution
python -m pytest tests/ -k quantization -v

Test Coverage Areas

✅ Quantization Correctness: FP32 → INT8 → FP32 roundtrip error bounds (< 0.5% mean error)
✅ Memory Reduction: Verify 4× reduction in model size (weights + biases)
✅ Symmetric vs Asymmetric: Both schemes produce valid INT8 in [-128, 127]
✅ Calibration Impact: Percentile clipping reduces quantization error vs naive min/max
✅ QuantizedLinear Equivalence: Output matches FP32 Linear within tolerance (< 1% difference)
✅ Model-Level Quantization: Full network quantization preserves accuracy (< 2% degradation)

Inline Testing & Quantization Analysis

The module includes comprehensive validation with real-time feedback:

# Example inline test output
🔬 Unit Test: quantize_int8()...
✅ Symmetric quantization: range [-128, 127] ✓
✅ Scale calculation: max_val / 127 = 0.0234 ✓
✅ Roundtrip error: 0.31% mean error ✓
📈 Progress: quantize_int8() ✓

🔬 Unit Test: QuantizedLinear...
✅ Memory reduction: 145KB → 36KB (4.0×) ✓
✅ Output equivalence: 0.43% max difference vs FP32 ✓
📈 Progress: QuantizedLinear ✓

Manual Testing Examples

from quantization_dev import quantize_int8, dequantize_int8, QuantizedLinear
from tinytorch.nn import Linear

# Test quantization on random tensor
tensor = Tensor(np.random.randn(100, 100).astype(np.float32))
q_tensor, scale, zero_point = quantize_int8(tensor)

print(f"Original range: [{tensor.data.min():.2f}, {tensor.data.max():.2f}]")
print(f"Quantized range: [{q_tensor.data.min()}, {q_tensor.data.max()}]")
print(f"Scale: {scale:.6f}, Zero-point: {zero_point}")

# Dequantize and measure error
restored = dequantize_int8(q_tensor, scale, zero_point)
error = np.abs(tensor.data - restored.data).mean()
print(f"Roundtrip error: {error:.4f} ({error/np.abs(tensor.data).mean()*100:.2f}%)")

# Quantize a Linear layer
linear = Linear(128, 64)
q_linear = QuantizedLinear(linear)

print(f"\nOriginal weights: {linear.weight.data.nbytes} bytes")
print(f"Quantized weights: {q_linear.weights_int8.data.nbytes} bytes")
print(f"Reduction: {linear.weight.data.nbytes / q_linear.weights_int8.data.nbytes:.1f}×")

Systems Thinking Questions

Real-World Applications

Mobile ML Deployment: TensorFlow Lite converts all models to INT8 for Android/iOS. Without quantization, models exceed app size limits (100-200MB) and drain battery 4× faster. Google Photos, Translate, Keyboard all run quantized models on-device.
Edge AI Devices: Google Edge TPU (Coral), NVIDIA Jetson, Intel Neural Compute Stick require INT8 models. Hardware is designed exclusively for quantized operations - FP32 isn't supported or is 10× slower.
Cloud Inference Optimization: AWS Inferentia, Azure Inferentia, Google Cloud TPU serve quantized models. INT8 reduces memory bandwidth (bottleneck for inference) and increases throughput by 2-4×. At scale (millions of requests/day), this saves millions in infrastructure costs.
Large Language Models: LLaMA-65B is 130GB in FP16, doesn't fit on single 80GB A100 GPU. INT8 quantization → 65GB, enables serving. GPTQ pushes to 4-bit (33GB) with < 1% perplexity increase. Quantization is how enthusiasts run 70B models on consumer GPUs.

Quantization Mathematics

Why INT8 vs INT4 or INT16? INT8 is the sweet spot: 4× memory reduction with < 1% accuracy loss. INT4 gives 8× reduction but 2-5% accuracy loss (harder to deploy). INT16 only 2× reduction (not worth complexity). Hardware acceleration (VNNI, NEON, Tensor Cores) standardized on INT8.
Symmetric vs Asymmetric Trade-offs: Symmetric is simpler (no zero-point) but wastes range for skewed data. ReLU activations are [0, max] - symmetric centers around 0, wasting negative range. Asymmetric uses full INT8 range but costs extra zero-point storage and computation.
Calibration Data Requirements: Theory: more data → better statistics. Practice: diminishing returns after 500-1000 samples. Percentile estimates stabilize quickly. Critical requirement: calibration data MUST match deployment distribution. If calibration is ImageNet but deployment is medical images, quantization fails catastrophically.
Per-Channel Justification: Conv2D with 64 output channels: per-channel stores 64 scales + 64 zero-points = 512 bytes. Total weights: 3×3×64×64 FP32 = 147KB. Overhead: 0.35%. Accuracy improvement: 0.5-1.5%. Clear win - explains why production frameworks default to per-channel.

Production Deployment Characteristics

Speed Reality Check: INT8 matmul is theoretically 4× faster (4× less memory bandwidth). Practice: 2-3× on CPU (quantize/dequantize overhead), 4-10× on specialized hardware (Edge TPU, Neural Engine designed for pure INT8 graphs). Your Python implementation is 0× faster (simulation overhead > bandwidth savings).
When Quantization is Mandatory: Mobile deployment (app size limits, battery constraints, Neural Engine acceleration), Edge devices (limited memory/compute), Cloud serving at scale (cost optimization). Not negotiable - models either quantize or don't ship.
When to Avoid Quantization: Accuracy-critical applications where 1% matters (medical diagnosis, autonomous vehicles), Early research iteration (quantization adds complexity), Models already tiny (< 10MB - quantization overhead not worth it), Cloud serving with abundant resources (FP32 throughput sufficient).
Quantization-Aware Training vs Post-Training: PTQ (Post-Training Quantization) is fast (minutes) but loses 1-2% accuracy. QAT (Quantization-Aware Training) requires retraining (days/weeks) but loses < 0.5%. Choose PTQ for rapid iteration, QAT for production deployment. If using pretrained models you don't own (BERT, ResNet), PTQ is only option.

Ready to Build?

You're about to implement the precision reduction mathematics that make mobile ML deployment possible. Quantization is the difference between a model that exists in research and a model that ships in apps used by billions.

This module teaches honest quantization: you'll implement the math correctly, achieve 4× memory reduction, and understand precisely why your Python code isn't faster (hardware acceleration requires specialized silicon + compiled kernels). This clarity prepares you for production deployment where TensorFlow Lite, PyTorch Mobile, and ONNX Runtime apply your quantization mathematics with real INT8 hardware operations.

Understanding quantization from first principles - implementing the scale/zero-point calculations yourself, calibrating with real data, measuring accuracy-efficiency trade-offs - gives you deep insight into the constraints that define production ML systems.

Choose your preferred way to engage with this module:


```{grid-item-card} Launch Binder
:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/15_quantization/quantization_dev.ipynb
:class-header: bg-light

Run this module interactively in your browser. No installation required.
```

```{grid-item-card} Open in Colab
:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/15_quantization/quantization_dev.ipynb
:class-header: bg-light

Use Google Colab for GPU access and cloud compute power.
```

```{grid-item-card} View Source
:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/15_quantization/quantization_dev.py
:class-header: bg-light

Browse the Python source code and understand the implementation.
```

:class: tip
Binder sessions are temporary. Download your completed notebook when done, or switch to local development for persistent work.

← Module 14: Profiling Module 16: Compression →

20 KiB Raw Permalink Blame History Unescape Escape