mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 01:29:07 -05:00
Adds callout-definition blocks to all Vol.2 chapters and fixes pre-commit hook errors
- Adds standardized callout-definition blocks with bold term + clear definition to all Vol.2 chapters (distributed training, inference, network fabrics, etc.) - Fixes caption_inline_python errors: replaces Python inline refs in table captions with static text in responsible_engr, appendix_fleet, appendix_reliability, compute_infrastructure - Fixes undefined_inline_ref errors: adds missing code fence for PlatformEconomics class in ops_scale.qmd; converts display math blocks with Python refs to prose - Fixes render-pattern errors: moves inline Python outside $...$ math delimiters in conclusion, fleet_orchestration, inference, introduction, network_fabrics, responsible_ai, security_privacy, sustainable_ai, distributed_training - Fixes dropcap errors: restructures drop-cap sentences in hw_acceleration and nn_architectures to not start with cross-references - Fixes unreferenced-label errors: removes @ prefix from @sec-/@tbl- refs inside Python comment strings in training, model_compression, ml_systems - Adds clientA to codespell ignore words (TikZ node label in edge_intelligence) - Updates mlsys constants, hardware, models, and test_units for Vol.2 calculations - Updates _quarto.yml and references.bib for two-volume structure
This commit is contained in:
@@ -648,7 +648,7 @@ The core quantization mathematics: scale calculation, zero-point mapping, INT8 r
|
||||
|
||||
To appreciate why quantization is critical for production ML, consider these deployment scenarios:
|
||||
|
||||
- **Mobile AI**: iPhone has 6 GB RAM shared across all apps. A quantized BERT (110 MB) fits comfortably; FP32 version (440 MB) causes memory pressure and swapping.
|
||||
- **Mobile AI**: Modern smartphones have 8 GB+ RAM shared across all apps. A quantized BERT (110 MB) fits comfortably; FP32 version (440 MB) causes memory pressure and swapping.
|
||||
- **Edge computing**: IoT devices often have 512 MB RAM. Quantization enables on-device inference for privacy-sensitive applications (medical devices, security cameras).
|
||||
- **Data centers**: Serving 1000 requests/second requires multiple model replicas. With 4× memory reduction, you fit 4× more models per GPU, reducing serving costs by 75%.
|
||||
- **Battery life**: INT8 operations consume 2-4× less energy than FP32 on mobile processors. Quantized models drain battery slower, improving user experience.
|
||||
|
||||
Reference in New Issue
Block a user