All module references updated to reflect new ordering: - Module 15: Quantization (was 16) - Module 16: Compression (was 17) - Module 17: Memoization (was 15) Updated by module-developer and website-manager agents: - Module ABOUT files with correct numbers and prerequisites - Cross-references and "What's Next" chains - Website navigation (_toc.yml) and content - Learning path progression in LEARNING_PATH.md - Profile milestone completion message (Module 17) Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
3.8 KiB
title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives
| title | description | difficulty | time_estimate | prerequisites | next_steps | learning_objectives | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Quantization - Reduced Precision for Efficiency | INT8 quantization, calibration, and mixed-precision strategies | 3 | 5-6 hours |
|
|
|
15. Quantization
⚡ OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
Overview
Reduce model precision from FP32 to INT8 for 4× memory reduction and 2-4× inference speedup. This module implements quantization, calibration, and mixed-precision strategies used in production deployment.
Learning Objectives
By completing this module, you will be able to:
- Implement INT8 quantization for model weights and activations with scale/zero-point parameters
- Design calibration strategies using representative data to minimize accuracy degradation
- Apply mixed-precision training (FP16/FP32) for faster training with maintained accuracy
- Understand quantization-aware training vs post-training quantization trade-offs
- Measure memory and speed improvements while tracking accuracy impact
Why This Matters
Production Context
Quantization is mandatory for edge deployment:
- TensorFlow Lite uses INT8 quantization for mobile deployment; 4× smaller models
- ONNX Runtime supports INT8 inference; 2-4× faster on CPUs
- Apple Core ML quantizes models for iPhone Neural Engine; enables on-device ML
- Google Edge TPU requires INT8; optimized hardware for quantized operations
Historical Context
- Pre-2017: FP32 standard; quantization for special cases only
- 2017-2019: INT8 post-training quantization; TensorFlow Lite adoption
- 2019-2021: Quantization-aware training; maintains accuracy better
- 2021+: INT4, mixed-precision, dynamic quantization; aggressive compression
Quantization enables deployment where FP32 models wouldn't fit or run fast enough.
Implementation Guide
Core Components
Symmetric INT8 Quantization
Quantization: x_int8 = round(x_fp32 / scale)
Dequantization: x_fp32 = x_int8 * scale
where scale = max(|x|) / 127
Asymmetric Quantization (with zero-point)
Quantization: x_int8 = round(x_fp32 / scale) + zero_point
Dequantization: x_fp32 = (x_int8 - zero_point) * scale
Calibration: Use representative data to find optimal scale/zero-point parameters
Testing
tito export 15_quantization
tito test 15_quantization
Where This Code Lives
tinytorch/
├── quantization/
│ └── quantize.py
└── __init__.py
Systems Thinking Questions
-
Accuracy vs Efficiency: INT8 loses precision. When is <1% accuracy drop acceptable? When must you use QAT?
-
Per-Tensor vs Per-Channel: Per-channel quantization preserves accuracy better but increases complexity. When is it worth it?
-
Quantized Operations: INT8 matmul is faster, but quantize/dequantize adds overhead. When does quantization win overall?
Real-World Connections
Mobile Deployment: TensorFlow Lite, Core ML use INT8 for on-device inference Cloud Serving: ONNX Runtime, TensorRT use INT8 for cost-effective serving Edge AI: INT8 required for Coral Edge TPU, Jetson Nano deployment
What's Next?
In Module 16: Compression, you'll combine quantization with pruning:
- Remove unimportant weights (pruning)
- Quantize remaining weights (INT8)
- Achieve 10-50× compression with minimal accuracy loss
Ready to quantize models? Open modules/15_quantization/quantization_dev.py and start implementing.