Adds callout-definition blocks to all Vol.2 chapters and fixes pre-commit hook errors

- Adds standardized callout-definition blocks with bold term + clear definition to all Vol.2 chapters (distributed training, inference, network fabrics, etc.) - Fixes caption_inline_python errors: replaces Python inline refs in table captions with static text in responsible_engr, appendix_fleet, appendix_reliability, compute_infrastructure - Fixes undefined_inline_ref errors: adds missing code fence for PlatformEconomics class in ops_scale.qmd; converts display math blocks with Python refs to prose - Fixes render-pattern errors: moves inline Python outside $...$ math delimiters in conclusion, fleet_orchestration, inference, introduction, network_fabrics, responsible_ai, security_privacy, sustainable_ai, distributed_training - Fixes dropcap errors: restructures drop-cap sentences in hw_acceleration and nn_architectures to not start with cross-references - Fixes unreferenced-label errors: removes @ prefix from @sec-/@tbl- refs inside Python comment strings in training, model_compression, ml_systems - Adds clientA to codespell ignore words (TikZ node label in edge_intelligence) - Updates mlsys constants, hardware, models, and test_units for Vol.2 calculations - Updates _quarto.yml and references.bib for two-volume structure
2026-04-30 01:29:07 -05:00 · 2026-03-01 10:44:33 -05:00
parent 69736d3bdb
commit bf9c402827
38 changed files with 2656 additions and 706 deletions
--- a/tinytorch/src/15_quantization/ABOUT.md
+++ b/tinytorch/src/15_quantization/ABOUT.md
@@ -648,7 +648,7 @@ The core quantization mathematics: scale calculation, zero-point mapping, INT8 r

 To appreciate why quantization is critical for production ML, consider these deployment scenarios:

- **Mobile AI**: iPhone has 6 GB RAM shared across all apps. A quantized BERT (110 MB) fits comfortably; FP32 version (440 MB) causes memory pressure and swapping.
+- **Mobile AI**: Modern smartphones have 8 GB+ RAM shared across all apps. A quantized BERT (110 MB) fits comfortably; FP32 version (440 MB) causes memory pressure and swapping.
 - **Edge computing**: IoT devices often have 512 MB RAM. Quantization enables on-device inference for privacy-sensitive applications (medical devices, security cameras).
 - **Data centers**: Serving 1000 requests/second requires multiple model replicas. With 4× memory reduction, you fit 4× more models per GPU, reducing serving costs by 75%.
 - **Battery life**: INT8 operations consume 2-4× less energy than FP32 on mobile processors. Quantized models drain battery slower, improving user experience.