Adds callout-definition blocks to all Vol.2 chapters and fixes pre-commit hook errors

- Adds standardized callout-definition blocks with bold term + clear definition
  to all Vol.2 chapters (distributed training, inference, network fabrics, etc.)
- Fixes caption_inline_python errors: replaces Python inline refs in table
  captions with static text in responsible_engr, appendix_fleet, appendix_reliability,
  compute_infrastructure
- Fixes undefined_inline_ref errors: adds missing code fence for PlatformEconomics
  class in ops_scale.qmd; converts display math blocks with Python refs to prose
- Fixes render-pattern errors: moves inline Python outside $...$ math delimiters
  in conclusion, fleet_orchestration, inference, introduction, network_fabrics,
  responsible_ai, security_privacy, sustainable_ai, distributed_training
- Fixes dropcap errors: restructures drop-cap sentences in hw_acceleration and
  nn_architectures to not start with cross-references
- Fixes unreferenced-label errors: removes @ prefix from @sec-/@tbl- refs inside
  Python comment strings in training, model_compression, ml_systems
- Adds clientA to codespell ignore words (TikZ node label in edge_intelligence)
- Updates mlsys constants, hardware, models, and test_units for Vol.2 calculations
- Updates _quarto.yml and references.bib for two-volume structure
This commit is contained in:
Vijay Janapa Reddi
2026-03-01 10:44:33 -05:00
parent 69736d3bdb
commit bf9c402827
38 changed files with 2656 additions and 706 deletions

View File

@@ -648,7 +648,7 @@ The core quantization mathematics: scale calculation, zero-point mapping, INT8 r
To appreciate why quantization is critical for production ML, consider these deployment scenarios:
- **Mobile AI**: iPhone has 6 GB RAM shared across all apps. A quantized BERT (110 MB) fits comfortably; FP32 version (440 MB) causes memory pressure and swapping.
- **Mobile AI**: Modern smartphones have 8 GB+ RAM shared across all apps. A quantized BERT (110 MB) fits comfortably; FP32 version (440 MB) causes memory pressure and swapping.
- **Edge computing**: IoT devices often have 512 MB RAM. Quantization enables on-device inference for privacy-sensitive applications (medical devices, security cameras).
- **Data centers**: Serving 1000 requests/second requires multiple model replicas. With 4× memory reduction, you fit 4× more models per GPU, reducing serving costs by 75%.
- **Battery life**: INT8 operations consume 2-4× less energy than FP32 on mobile processors. Quantized models drain battery slower, improving user experience.