diff --git a/.codespell-ignore-words.txt b/.codespell-ignore-words.txt index 2ecfd914b..a71170ff9 100644 --- a/.codespell-ignore-words.txt +++ b/.codespell-ignore-words.txt @@ -38,3 +38,4 @@ Clos Marz Pease nd +ure diff --git a/book/quarto/contents/vol1/backmatter/appendix_math_foundations.qmd b/book/quarto/contents/vol1/backmatter/appendix_math_foundations.qmd index 36d5641be..ff57c3a57 100644 --- a/book/quarto/contents/vol1/backmatter/appendix_math_foundations.qmd +++ b/book/quarto/contents/vol1/backmatter/appendix_math_foundations.qmd @@ -19,16 +19,16 @@ Before diving in, here is a map of what this appendix covers and where each conc +-------------------------------+--------------------------------------------+-------------------------------------------+ | **Little's Law** | Size serving infrastructure for target QPS | @sec-serving | +-------------------------------+--------------------------------------------+-------------------------------------------+ -| **Memory Hierarchy** | Optimize data movement and cache usage | @sec-ai-acceleration, @sec-efficient-ai | +| **Memory Hierarchy** | Optimize data movement and cache usage | @sec-ai-acceleration, @sec-model-compression | +-------------------------------+--------------------------------------------+-------------------------------------------+ -| **Numerical Formats** | Choose precision for training vs inference | @sec-model-optimizations, @sec-serving | +| **Numerical Formats** | Choose precision for training vs inference | @sec-model-compression, @sec-serving | +-------------------------------+--------------------------------------------+-------------------------------------------+ | **GEMM Operations** | Understand the core computation in | @sec-dl-primer, @sec-ai-acceleration | | | neural networks | | +-------------------------------+--------------------------------------------+-------------------------------------------+ | **Backpropagation** | Debug gradient issues and memory usage | @sec-dl-primer, @sec-ai-training | +-------------------------------+--------------------------------------------+-------------------------------------------+ -| **Sparse Formats (CSR)** | Work with recommendation systems and | @sec-model-optimizations | +| **Sparse Formats (CSR)** | Work with recommendation systems and | @sec-model-compression | | | pruned models | | +-------------------------------+--------------------------------------------+-------------------------------------------+ | **Computational Graphs** | Understand compiler optimizations | @sec-ai-frameworks, @sec-ai-acceleration | diff --git a/book/quarto/contents/vol1/conclusion/conclusion.qmd b/book/quarto/contents/vol1/conclusion/conclusion.qmd index 8cae910f2..7392585db 100644 --- a/book/quarto/contents/vol1/conclusion/conclusion.qmd +++ b/book/quarto/contents/vol1/conclusion/conclusion.qmd @@ -82,7 +82,7 @@ You now have theoretical understanding and the conceptual foundation for profess | **3. Optimize the Bottleneck** | What limits performance? | Memory bandwidth utilization | @sec-ai-acceleration | | **4. Plan for Failure** | What happens when components fail? | MTBF, recovery time | @sec-ml-operations | | **5. Design Cost-Consciously** | What is the TCO? | $/FLOP, $/inference | @sec-introduction | -| **6. Co-Design for Hardware** | Does algorithm match hardware? | TOPS/W, arithmetic intensity | @sec-model-optimizations | +| **6. Co-Design for Hardware** | Does algorithm match hardware? | TOPS/W, arithmetic intensity | @sec-model-compression | +--------------------------------+------------------------------------+--------------------------------+--------------------------+ : **Six Core Systems Engineering Principles**: These principles provide enduring guidance regardless of how specific technologies evolve. Each principle connects to a core question engineers must answer, key metrics for measurement, and chapter examples demonstrating application. {#tbl-six-principles} @@ -145,7 +145,7 @@ Building on this understanding, frameworks and training systems embody both scal @sec-ai-training then revealed how these frameworks scale beyond single machines. Data parallelism strategies that transform weeks of training into hours, model parallelism that enables architectures too large for any single device, mixed precision techniques that double effective throughput, and gradient compression that reduces communication overhead all demonstrate Principle 2 in action: designing for ten times scale beyond current needs while maintaining hardware alignment. -@sec-efficient-ai demonstrates that efficiency determines whether AI moves beyond laboratories to resource-constrained deployment. Neural compression algorithms including pruning, quantization, and knowledge distillation, detailed in @sec-model-optimizations, address bottlenecks in memory, compute, and energy while maintaining performance. This multidimensional optimization requires identifying the limiting factor and addressing it directly rather than pursuing isolated improvements. +Model efficiency determines whether AI moves beyond laboratories to resource-constrained deployment. Neural compression algorithms including pruning, quantization, and knowledge distillation, detailed in @sec-model-compression, address bottlenecks in memory, compute, and energy while maintaining performance. This multidimensional optimization requires identifying the limiting factor and addressing it directly rather than pursuing isolated improvements. ### Engineering for Performance at Scale {#sec-conclusion-engineering-performance-scale-a99a} @@ -153,7 +153,7 @@ Technical foundations provide the substrate for ML systems, but a recommendation #### Model Architecture and Optimization {#sec-conclusion-model-architecture-optimization-4e0b} -@sec-dnn-architectures traced your journey from understanding simple perceptrons, where you first grasped how weighted inputs produce decisions, through convolutional networks that exploit spatial structure for parameter efficiency, to Transformer architectures whose attention mechanisms enabled the language understanding powering today's AI assistants. Yet architectural innovation alone proves insufficient for production deployment; @sec-model-optimizations bridges research architectures and production constraints through optimization techniques that make deployment feasible. +@sec-dnn-architectures traced your journey from understanding simple perceptrons, where you first grasped how weighted inputs produce decisions, through convolutional networks that exploit spatial structure for parameter efficiency, to Transformer architectures whose attention mechanisms enabled the language understanding powering today's AI assistants. Yet architectural innovation alone proves insufficient for production deployment; @sec-model-compression bridges research architectures and production constraints through optimization techniques that make deployment feasible. Following the hardware co-design principles outlined earlier, three complementary compression approaches address bottleneck optimization. Pruning removes redundant parameters while maintaining accuracy, quantization reduces precision requirements for 4x memory reduction, and knowledge distillation transfers capabilities to compact networks for resource-constrained deployment. @@ -175,7 +175,7 @@ Completing the performance engineering feedback loop, @sec-benchmarking-ai estab ### Navigating Production Reality {#sec-conclusion-navigating-production-reality-c406} -The third pillar addresses production deployment realities where all six principles converge under the constraint that systems must serve users reliably, securely, and responsibly. This convergence begins with a fundamental shift: the transition from training to inference inverts the optimization objectives that governed model development. Where training maximizes throughput over days of computation, inference optimizes latency per request under strict time constraints measured in milliseconds. @sec-benchmarking-ai explored this inversion through the MLPerf inference scenarios, revealing how different deployment contexts (SingleStream for mobile, Server for cloud APIs, Offline for batch processing) require fundamentally different optimization strategies. The benchmarking techniques target percentile latencies rather than aggregate throughput. The quantization methods that @sec-model-optimizations taught must be validated not just for accuracy preservation but for calibration with production traffic. Faster hardware does not automatically mean faster inference; as Amdahl's Law demonstrates, preprocessing and postprocessing often dominate latency. Production systems report preprocessing consuming 60–70% of total request time when inference runs on optimized accelerators. +The third pillar addresses production deployment realities where all six principles converge under the constraint that systems must serve users reliably, securely, and responsibly. This convergence begins with a fundamental shift: the transition from training to inference inverts the optimization objectives that governed model development. Where training maximizes throughput over days of computation, inference optimizes latency per request under strict time constraints measured in milliseconds. @sec-benchmarking-ai explored this inversion through the MLPerf inference scenarios, revealing how different deployment contexts (SingleStream for mobile, Server for cloud APIs, Offline for batch processing) require fundamentally different optimization strategies. The benchmarking techniques target percentile latencies rather than aggregate throughput. The quantization methods that @sec-model-compression taught must be validated not just for accuracy preservation but for calibration with production traffic. Faster hardware does not automatically mean faster inference; as Amdahl's Law demonstrates, preprocessing and postprocessing often dominate latency. Production systems report preprocessing consuming 60–70% of total request time when inference runs on optimized accelerators. The measurement principle becomes critical for production systems. Tracking p50, p95, and p99 latencies reveals how systems perform across the full range of requests, since mean latency tells little about user experience when one in a hundred users waits 40 times longer than average. @sec-benchmarking-ai established that systems performing well under light load can suddenly violate service level objectives when traffic increases, making tail latency monitoring essential for production deployment. @@ -203,9 +203,9 @@ The six principles you have learned will guide future development across three e As ML systems move beyond research labs, three deployment paradigms test different combinations of our established principles: resource-abundant cloud environments, resource-constrained edge devices, and emerging generative systems. -Cloud deployment prioritizes throughput and scalability, achieving high GPU utilization through kernel fusion, mixed precision training, and gradient compression. @sec-model-optimizations and @sec-ai-training explored these techniques, demonstrating how they combine to balance performance optimization with cost efficiency at scale. +Cloud deployment prioritizes throughput and scalability, achieving high GPU utilization through kernel fusion, mixed precision training, and gradient compression. @sec-model-compression and @sec-ai-training explored these techniques, demonstrating how they combine to balance performance optimization with cost efficiency at scale. -In contrast, mobile and edge systems face stringent power, memory, and latency constraints that demand sophisticated hardware-software co-design. @sec-model-optimizations introduced efficiency techniques including depthwise separable convolutions, neural architecture search, and quantization that enable deployment on devices with 100–1000x less computational power than data centers. Systems that cannot run on billions of edge devices cannot achieve global impact, making edge deployment essential for AI democratization[^fn-ai-democratization]. +In contrast, mobile and edge systems face stringent power, memory, and latency constraints that demand sophisticated hardware-software co-design. @sec-model-compression introduced efficiency techniques including depthwise separable convolutions, neural architecture search, and quantization that enable deployment on devices with 100–1000x less computational power than data centers. Systems that cannot run on billions of edge devices cannot achieve global impact, making edge deployment essential for AI democratization[^fn-ai-democratization]. [^fn-ai-democratization]: **AI Democratization**: Making AI accessible beyond a small number of well-resourced organizations through efficient systems engineering. Mobile-optimized models and cloud APIs can widen access, but doing so sustainably requires systematic optimization across hardware, algorithms, and infrastructure to maintain quality at scale. @@ -257,7 +257,7 @@ The intelligent systems that will define the coming century await your engineeri ## Continuing beyond this book {#sec-conclusion-continuing-volume-ii} -This book has deliberately focused on single-machine systems to establish principles you can directly observe and experiment with. Understanding bottlenecks on one machine, whether memory bandwidth limitations, CPU-GPU data transfer overhead, or preprocessing inefficiencies, enables recognition of when and why scaling to multiple machines becomes necessary. @sec-benchmarking-ai introduced latency analysis and the MLPerf scenarios that characterize deployment contexts from mobile to datacenter. @sec-model-optimizations demonstrated optimization techniques that produce identical compression ratios regardless of deployment scale. By mastering these foundations on systems you can instrument completely, you develop the diagnostic intuition required for distributed systems where visibility becomes fragmented across nodes. +This book has deliberately focused on single-machine systems to establish principles you can directly observe and experiment with. Understanding bottlenecks on one machine, whether memory bandwidth limitations, CPU-GPU data transfer overhead, or preprocessing inefficiencies, enables recognition of when and why scaling to multiple machines becomes necessary. @sec-benchmarking-ai introduced latency analysis and the MLPerf scenarios that characterize deployment contexts from mobile to datacenter. @sec-model-compression demonstrated optimization techniques that produce identical compression ratios regardless of deployment scale. By mastering these foundations on systems you can instrument completely, you develop the diagnostic intuition required for distributed systems where visibility becomes fragmented across nodes. Each principle transforms at scale. The roofline model must now account for network bandwidth, adding a third ceiling to the analysis. Failure planning shifts from "if" to "when": with large clusters, mean time between failures can drop dramatically, demanding fundamentally different architectures. Volume II provides the quantitative tools for these analyses. diff --git a/book/quarto/contents/vol1/data_efficiency/data_efficiency.qmd b/book/quarto/contents/vol1/data_efficiency/data_efficiency.qmd index 491b766c8..6f2e86566 100644 --- a/book/quarto/contents/vol1/data_efficiency/data_efficiency.qmd +++ b/book/quarto/contents/vol1/data_efficiency/data_efficiency.qmd @@ -1,5 +1,4 @@ --- -bibliography: data_efficiency.bib quiz: data_efficiency_quizzes.json concepts: data_efficiency_concepts.yml glossary: data_efficiency_glossary.json @@ -143,9 +142,9 @@ This chapter equips you with the systems engineer's toolkit: techniques to minim ## The Information-Compute Ratio {#sec-data-efficiency-info-compute} -The Optimize Principles (@sec-optimize-principles) established that optimization is not a single objective but a search for the **Pareto Frontier**—the boundary where improving one metric (accuracy) necessarily degrades another (efficiency). We introduced three pillars of efficiency: data, model, and hardware. This chapter tackles the first and often most powerful lever. +The Optimize Principles established that optimization is not a single objective but a search for the **Pareto Frontier**—the boundary where improving one metric (accuracy) necessarily degrades another (efficiency). We introduced three pillars of efficiency: data, model, and hardware. This chapter tackles the first and often most powerful lever. -While model compression (@sec-model-compression) and hardware acceleration (@sec-hw-acceleration) focus on the *execution* of the math, **Data Efficiency** reduces the *amount* of math required by optimizing what enters the training pipeline. +While model compression (@sec-model-compression) and hardware acceleration (@sec-ai-acceleration) focus on the *execution* of the math, **Data Efficiency** reduces the *amount* of math required by optimizing what enters the training pipeline. Data engineering (@sec-data-engineering) ensures that data is clean, accessible, and correctly formatted. Data efficiency asks a different question: *how much information does each sample contribute to the model's learning per unit of computation?* @@ -660,6 +659,7 @@ Self-supervised pre-training fundamentally changes the economics of the data pip - **Total time**: 1-2 weeks per task after pre-training **The ROI**: + - **100× reduction in labeling cost** (\$1M → \$10k) - **10× reduction in per-task compute** (amortized over tasks) - **20-50× faster time-to-deployment** per task @@ -892,7 +892,7 @@ When synthetic data is generated by ML models (not simulators), there's a risk o ::: ::: {.callout-example title="Lighthouse Example: Keyword Spotting Data Efficiency"} -**Scenario**: Our **Keyword Spotting Lighthouse model** (@sec-dnn-architectures)—a DS-CNN with **200 K** parameters—represents the extreme end of data efficiency challenges. You're building a wake-word detector ("Hey Device") for a microcontroller with 256KB SRAM (see @sec-tinyml for hardware constraints). The model must be tiny (~50KB quantized), but you need 10,000+ labeled audio samples to train it—samples that don't exist yet. +**Scenario**: Our **Keyword Spotting Lighthouse model** (@sec-dnn-architectures)—a DS-CNN with **200 K** parameters—represents the extreme end of data efficiency challenges. You're building a wake-word detector ("Hey Device") for a microcontroller with 256KB SRAM (see @sec-ml-systems-tiny-ml-ubiquitous-sensing-scale-51d8 for hardware constraints). The model must be tiny (~50KB quantized), but you need 10,000+ labeled audio samples to train it—samples that don't exist yet. **The Data Collection Problem**: @@ -1112,6 +1112,7 @@ Where $T_{selection}$ is the time spent scoring the pool and $T_{train}$ is the - Train on full 1M dataset for 100 epochs: 1M × 100 × 0.01s = **1,000,000 seconds** (278 hours) **Analysis**: + - Option A saves 247 hours vs. baseline (89% reduction) ✓ - Option B saves 250 hours vs. baseline (90% reduction) ✓ - Option B beats Option A by 2.2 hours—proxy selection has better ROI @@ -1125,8 +1126,6 @@ Where $T_{selection}$ is the time spent scoring the pool and $T_{train}$ is the **Problem**: You are using active learning to select the best 10% of samples for training. Your selection algorithm requires running the full model on the unlabeled pool. Is this efficient? - - **The Math**: 1. **Full Training**: 100 epochs. Total cost = $100 \times C_{epoch}$. 2. **Selection (Full Model)**: Scoring the full dataset is equivalent to **1 epoch** of training. $T_{selection} = 1 \times C_{epoch}$. @@ -1330,6 +1329,7 @@ For any technique, there exists a **break-even point** where the investment equa **Example: Active Learning Break-Even** Suppose labeling costs \$10/sample and active learning requires: + - Initial labeled set: 1,000 samples (\$10,000) - Oracle queries per round: 100 samples - Inference cost per round: \$50 (scoring unlabeled pool) @@ -1423,7 +1423,7 @@ A coordinator node performs selection on the full dataset, then distributes sele ``` Coordinator: score_all_samples() → selected_indices -Broadcast: selected_indices → all workers +Broadcast: selected_indices → all workers Workers: train on subset(local_shard, selected_indices) ``` @@ -1490,6 +1490,7 @@ Active learning is particularly challenging in distributed settings because the 5. **Broadcast**: Selected indices distributed to all workers for training **Performance** (measured on 8× A100 cluster): + - Embedding: 20 minutes (parallel) - Deduplication: 15 minutes (distributed hash join) - Scoring: 30 minutes (parallel, 5 epochs proxy training) @@ -1526,7 +1527,7 @@ Data efficiency and model compression are therefore *complementary*. The techniq ### Data Efficiency × Hardware Acceleration -Hardware acceleration (@sec-hw-acceleration) increases throughput through specialized accelerators, kernel optimization, and parallelization. Data efficiency affects which hardware bottlenecks dominate: +Hardware acceleration (@sec-ai-acceleration) increases throughput through specialized accelerators, kernel optimization, and parallelization. Data efficiency affects which hardware bottlenecks dominate: | Scenario | Likely Bottleneck | Hardware Optimization | |----------|-------------------|----------------------| @@ -1551,7 +1552,7 @@ The distributed selection challenges discussed in @sec-data-efficiency-distribut The full optimization stack, from data to deployment, can be visualized as a pipeline where each stage amplifies or attenuates the effects of others: ``` -Raw Data → [Data Efficiency] → Curated Data → [Training] → Model +Raw Data → [Data Efficiency] → Curated Data → [Training] → Model → [Compression] → Compact Model → [Hardware] → Deployed System ``` @@ -1836,7 +1837,6 @@ d) Calculate the cost-per-accuracy-point for each strategy ($/% success rate). ::: - Data efficiency transforms how we think about the ML development lifecycle. Rather than treating data as a static input to be collected and labeled, this chapter has reframed it as a dynamic resource to be optimized—and critically, as a *systems* problem where the goal is minimizing total cost (compute, storage, labeling, energy, time) rather than just maximizing accuracy. We explored the three-stage optimization pipeline: **Static Pruning** removes redundancy before training through coreset selection and deduplication; **Dynamic Selection** focuses compute on informative examples during training through curriculum and active learning; and **Synthetic Generation** creates data where none exists through augmentation, simulation, and distillation. Together, these strategies address the "Data Wall"—the fundamental asymmetry between exponentially growing compute and slowly growing high-quality data. @@ -1874,4 +1874,3 @@ With a high-efficiency data pipeline in place, we have minimized the "math requi ::: { .quiz-end } ::: - diff --git a/book/quarto/contents/vol1/data_engineering/data_engineering.qmd b/book/quarto/contents/vol1/data_engineering/data_engineering.qmd index 6b82aefe2..be0226ba8 100644 --- a/book/quarto/contents/vol1/data_engineering/data_engineering.qmd +++ b/book/quarto/contents/vol1/data_engineering/data_engineering.qmd @@ -72,8 +72,6 @@ If your labels are only 90% accurate, your model’s ceiling is 90% accuracy, re ::: {.callout-perspective title="Napkin Math: The Deduplication Dividend"} **Problem**: You have a 10TB dataset of web-scraped images. Training takes 2 weeks on your GPU cluster ($5,000). You suspect 30% of the images are near-duplicates. Is it worth the engineering time to remove them? - - This compilation metaphor reflects a broader paradigm shift in how ML practitioners approach system development. Traditional ML development fixes the dataset early, then iterates on model architectures and hyperparameters. Research benchmarks reinforce this pattern by providing static datasets where progress is measured purely through algorithmic innovation. Production systems face a different reality: datasets continuously evolve, data quality varies across sources and time, and model improvements often plateau while data improvements continue yielding gains. This realization has catalyzed what Andrew Ng termed the shift from model-centric to **data-centric AI** [@ng2021datacentric]. ::: {.callout-definition title="Data-Centric AI"} @@ -198,7 +196,7 @@ But what makes this particular framework the right one? Understanding why these Why four pillars? Why not three or five? Unlike arbitrary categorizations, these four pillars derive naturally from fundamental constraints that govern all data-intensive systems. Understanding this derivation provides both memorability and defensibility—you can reconstruct the framework from first principles rather than memorizing an arbitrary taxonomy. -**Quality emerges from statistical learning theory.** Machine learning assumes that training data and production data are drawn from the same distribution: $P_{train}(X, Y) \approx P_{serve}(X, Y)$. When this assumption breaks—through labeling errors, sampling bias, or data corruption—the learned function $f(x)$ fails to generalize. Quality is not a preference but a *mathematical prerequisite* for learning. The cascading failures documented in @sec-data-engineering-data-cascades-need-systematic-foundations-e6f5 are direct consequences of violating this distributional assumption. +**Quality emerges from statistical learning theory.** Machine learning assumes that training data and production data are drawn from the same distribution: $P_{train}(X, Y) \approx P_{serve}(X, Y)$. When this assumption breaks—through labeling errors, sampling bias, or data corruption—the learned function $f(x)$ fails to generalize. Quality is not a preference but a *mathematical prerequisite* for learning. The cascading failures documented in @sec-data-engineering-data-cascades are direct consequences of violating this distributional assumption. **Reliability emerges from distributed systems reality.** Any system with more than one component experiences partial failures. Networks partition. Disks fail. Services crash. The CAP theorem formalizes this: distributed systems cannot simultaneously guarantee Consistency, Availability, and Partition tolerance. Data pipelines spanning multiple services, storage systems, and processing stages must assume failures will occur. Reliability is not about preventing failures but *surviving them gracefully*—through redundancy, idempotent operations, and graceful degradation. @@ -212,7 +210,7 @@ These derivations reveal why the pillars are *necessary and sufficient*. Necessa Every data engineering decision, from choosing storage formats to designing ingestion pipelines, should be evaluated against four principles. @fig-four-pillars illustrates how these pillars interact, with each contributing to system success through systematic decision-making. -Data quality provides the foundation for system success. Quality issues compound throughout the ML lifecycle through "Data Cascades" (@sec-data-engineering-data-cascades-need-systematic-foundations-e6f5), where early failures propagate and amplify downstream. Quality includes accuracy, completeness, consistency, and fitness for the intended ML task. The mathematical foundations of this relationship appear in @sec-dl-primer and @sec-dnn-architectures. +Data quality provides the foundation for system success. Quality issues compound throughout the ML lifecycle through "Data Cascades" (@sec-data-engineering-data-cascades), where early failures propagate and amplify downstream. Quality includes accuracy, completeness, consistency, and fitness for the intended ML task. The mathematical foundations of this relationship appear in @sec-dl-primer and @sec-dnn-architectures. ML systems require consistent, predictable data processing that handles failures gracefully. Reliability means building systems that continue operating despite component failures, data anomalies, or unexpected load patterns. This includes error handling, monitoring, and recovery mechanisms throughout the data pipeline. @@ -3518,7 +3516,7 @@ user_age = int(record["age"]) # Works until upstream sends "25 years" **Fallacy:** _More data always leads to better model performance._ -Learning theory emphasizes sample complexity, creating the intuition that larger datasets reduce overfitting. However, this assumes constant quality as quantity scales. Research on ImageNet found label errors in 3.4% of validation set examples despite extensive curation [@northcutt2021pervasive], meaning a 10-million-example dataset contains 340,000 mislabeled instances. Empirical studies show 10% label noise reduces model accuracy by 5 to 8 percentage points for typical vision tasks, with the degradation compounding during training. As @sec-data-engineering-data-quality-metrics establishes, a curated 100,000-example dataset covering diverse demographics and edge cases outperforms a haphazard 1-million-example dataset skewed toward common cases. Training on 10x more data requires 10x more GPU hours while potentially achieving worse accuracy, meaning teams that prioritize curation over collection achieve better performance at lower infrastructure costs. +Learning theory emphasizes sample complexity, creating the intuition that larger datasets reduce overfitting. However, this assumes constant quality as quantity scales. Research on ImageNet found label errors in 3.4% of validation set examples despite extensive curation [@northcutt2021pervasive], meaning a 10-million-example dataset contains 340,000 mislabeled instances. Empirical studies show 10% label noise reduces model accuracy by 5 to 8 percentage points for typical vision tasks, with the degradation compounding during training. As @sec-data-engineering-data-quality-as-code establishes, a curated 100,000-example dataset covering diverse demographics and edge cases outperforms a haphazard 1-million-example dataset skewed toward common cases. Training on 10x more data requires 10x more GPU hours while potentially achieving worse accuracy, meaning teams that prioritize curation over collection achieve better performance at lower infrastructure costs. **Pitfall:** _Treating data labeling as a simple mechanical task that can be outsourced without oversight._ diff --git a/book/quarto/contents/vol1/dl_primer/dl_primer.qmd b/book/quarto/contents/vol1/dl_primer/dl_primer.qmd index a680fcf00..d6377499a 100644 --- a/book/quarto/contents/vol1/dl_primer/dl_primer.qmd +++ b/book/quarto/contents/vol1/dl_primer/dl_primer.qmd @@ -44,7 +44,7 @@ When a neural network fails in production, engineers cannot debug the problem by As established in the **Silicon Contract** (@sec-build-principles), every model architecture makes an implicit agreement with the hardware. To honor that contract, a systems engineer must understand the mathematical operators that determine the terms of the agreement. This chapter examines the foundations of deep learning—not just as an algorithmic technique, but as a **computational workload** that dictates the movement of bits and the consumption of energy. -The preceding chapters established critical foundations: the ML workflow (@sec-workflow) showed how projects progress from problem definition through deployment, and data engineering (@sec-data-engineering) covered how to prepare the raw material that neural networks consume. Now we examine what happens inside the model itself—the mathematical operations that transform data into predictions and drive the computational demands that shape every system design decision. +The preceding chapters established critical foundations: the ML workflow (@sec-ai-workflow) showed how projects progress from problem definition through deployment, and data engineering (@sec-data-engineering) covered how to prepare the raw material that neural networks consume. Now we examine what happens inside the model itself—the mathematical operations that transform data into predictions and drive the computational demands that shape every system design decision. Identifying objects in photographs, such as cats, illustrates the shift from rule-based programming to machine learning. Using traditional programming, you would need to write explicit rules: look for triangular ears, check for whiskers, verify the presence of four legs, examine fur patterns, and handle countless variations in lighting, angles, poses, and breeds. Each edge case demands additional rules, creating increasingly complex decision trees that still fail when encountering unexpected variations. This limitation, the impossibility of manually encoding all patterns for complex real-world problems, drove the evolution from rule-based programming to machine learning. diff --git a/book/quarto/contents/vol1/dnn_architectures/dnn_architectures.qmd b/book/quarto/contents/vol1/dnn_architectures/dnn_architectures.qmd index 3a888fa59..6bf64c0eb 100644 --- a/book/quarto/contents/vol1/dnn_architectures/dnn_architectures.qmd +++ b/book/quarto/contents/vol1/dnn_architectures/dnn_architectures.qmd @@ -45,7 +45,7 @@ Neural network architectures are not interchangeable implementations of the same The **Deep Learning Primer** (@sec-dl-primer) established the mathematical operators—matrix multiplication, activation functions, gradient computation—that form the "verbs" of neural networks. This chapter examines how these operators assemble into **architectures**: specialized structures optimized for specific data types and computational constraints. As defined in the **Silicon Contract** (@sec-build-principles), every architecture makes an implicit agreement with hardware, trading computational patterns for efficiency on particular problem classes. -Machine learning systems face a fundamental engineering trade-off: **representational power versus computational efficiency**. In the context of the **Iron Law of ML Systems** (@sec-intro-iron-law), architectural choice is the primary determinant of the **Ops** term. A transformer’s attention mechanism enables global relationships but scales as $O(N^2)$ operations; a CNN exploits spatial locality to reduce operations to $O(N)$. Navigating this trade-off—choosing the right inductive biases for your data while setting a manageable "Ops budget"—defines the practice of neural architecture selection. +Machine learning systems face a fundamental engineering trade-off: **representational power versus computational efficiency**. In the context of the **Iron Law of ML Systems** (@sec-ai-training-iron-law), architectural choice is the primary determinant of the **Ops** term. A transformer’s attention mechanism enables global relationships but scales as $O(N^2)$ operations; a CNN exploits spatial locality to reduce operations to $O(N)$. Navigating this trade-off—choosing the right inductive biases for your data while setting a manageable "Ops budget"—defines the practice of neural architecture selection. Five architectural families define modern neural computation, each optimized for different data characteristics: diff --git a/book/quarto/contents/vol1/frameworks/frameworks.qmd b/book/quarto/contents/vol1/frameworks/frameworks.qmd index ba4da3f3b..37059460a 100644 --- a/book/quarto/contents/vol1/frameworks/frameworks.qmd +++ b/book/quarto/contents/vol1/frameworks/frameworks.qmd @@ -995,7 +995,7 @@ For ImageNet (1.28M training images), compilation pays off within the first epoc ### The Dispatch Overhead Law {#sec-ai-frameworks-dispatch-overhead-law} -A second principle emerges from the Framework Performance Equation (@eq-framework-performance): when does framework overhead—rather than compute or memory—dominate execution time? +A second principle emerges from the Dispatch Overhead Equation (@eq-dispatch-overhead): when does framework overhead—rather than compute or memory—dominate execution time? :::{.callout-important title="The Dispatch Overhead Law"} Framework overhead dominates when operations are small relative to dispatch cost: @@ -3283,16 +3283,10 @@ JAX takes a fundamentally different approach, embracing functional programming p [^pure-function]: **Pure Function**: Has no side effects and always returns the same output for the same inputs. Pure functions enable mathematical reasoning about code behavior and safe program transformations. -[^fn-cuda]: **CUDA (Compute Unified Device Architecture)**: NVIDIA's parallel computing platform launched in 2007 that transformed ML by enabling general-purpose GPU computing. GPUs can execute thousands of threads simultaneously, often providing order-of-magnitude speedups (e.g., \(\approx 10\) to \(100\times\)) for large, well-optimized matrix operations compared to CPU execution, fundamentally changing how ML frameworks approach computation scheduling. - [^fn-xla]: **XLA (Accelerated Linear Algebra)**: Google's domain-specific compiler optimizing tensor operations across CPUs, GPUs, and TPUs. Achieves 3–10$\times$ speedups through operation fusion (reducing memory traffic), layout optimization (matching hardware preferences), and hardware-specific codegen. Powers JAX compilation and TensorFlow's tf.function, demonstrating that ML benefits from specialized compiler infrastructure. [^fn-onnx]: **ONNX (Open Neural Network Exchange)**: Industry standard for representing ML models that enables interoperability between frameworks. Supported by Microsoft, Facebook, AWS, and others, ONNX allows models trained in PyTorch to be deployed in TensorFlow Serving or optimized with TensorRT, solving the framework fragmentation problem. -[^fn-tensorrt]: **TensorRT**: NVIDIA's inference optimization library maximizing throughput and minimizing latency on NVIDIA GPUs. Achieves 2–5$\times$ speedup through layer fusion (combining operations), INT8 calibration (4$\times$ memory reduction), and kernel auto-tuning selecting optimal implementations per layer. Critical for production deployment, TensorRT powers NVIDIA Triton inference servers handling billions of daily predictions. - -[^fn-horovod]: **Horovod**: Uber's distributed training framework (2017) providing unified API for TensorFlow, PyTorch, and MXNet. Named after a traditional Russian folk dance symbolizing coordination. Ring-allreduce implementation achieves 85--95% theoretical bandwidth utilization, scaling to thousands of GPUs. Widely adopted before native framework distribution support matured, influencing distributed training patterns industry-wide. - ### Framework Design Philosophy {#sec-ai-frameworks-framework-design-philosophy-571b} The three fundamental problems—execution, differentiation, and abstraction—admit no universally optimal solution. Each framework makes deliberate trade-offs that reflect their creators' priorities and intended use cases. These philosophical commitments shape everything from API design to performance characteristics, and understanding them helps developers choose frameworks that align with their project requirements and working styles. @@ -3607,7 +3601,7 @@ When `loss.backward()` executes: ### Phase 3: Memory Traffic Analysis (The Physics at Work) -Applying the Framework Performance Equation (@eq-framework-performance) to this step: +Applying the Dispatch Overhead Equation (@eq-dispatch-overhead) to this step: +---------------------------+----------------------+--------------------+--------------------------+ | **Component** | **FLOPs** | **Memory Traffic** | **Arithmetic Intensity** | @@ -3662,7 +3656,7 @@ Following the Patterson & Hennessy textbook tradition, this section identifies f **Fallacy:** _"All frameworks provide equivalent performance for the same model architecture."_ -Engineers assume that since ResNet-50 is mathematically identical across frameworks, performance must be equivalent. @tbl-framework-efficiency-matrix disproves this: PyTorch achieves 52ms inference at 32% hardware utilization, while TensorRT delivers 3ms at 88% utilization—a **17× performance gap** on identical hardware. The difference arises from kernel fusion (@eq-fusion-speedup), graph optimization depth, and memory access patterns. Organizations assuming framework equivalence routinely miss production targets by 5-10×. +Engineers assume that since ResNet-50 is mathematically identical across frameworks, performance must be equivalent. @tbl-framework-efficiency-matrix disproves this: PyTorch achieves 52ms inference at 32% hardware utilization, while TensorRT delivers 3ms at 88% utilization—a **17× performance gap** on identical hardware. The difference arises from kernel fusion, graph optimization depth, and memory access patterns. Organizations assuming framework equivalence routinely miss production targets by 5-10×. **Pitfall:** _Choosing frameworks based on popularity rather than project requirements._ @@ -3670,7 +3664,7 @@ Engineers assume that since ResNet-50 is mathematically identical across framewo **Fallacy:** _"Framework abstractions eliminate the need for systems knowledge."_ -The Roofline Model (@fig-roofline-model) disproves this fallacy. Element-wise operations like ReLU achieve arithmetic intensity of 0.125 FLOPS/byte—utilizing **under 0.1%** of an A100's peak compute regardless of framework. Understanding when operations are memory-bound vs. compute-bound (@tbl-arithmetic-intensity) and which sequences can be fused (@eq-fusion-speedup) separates efficient implementations from those leaving 80-90% of hardware capacity unused. No framework can optimize away fundamental physics. +The Roofline Model (@fig-roofline-model) disproves this fallacy. Element-wise operations like ReLU achieve arithmetic intensity of 0.125 FLOPS/byte—utilizing **under 0.1%** of an A100's peak compute regardless of framework. Understanding when operations are memory-bound vs. compute-bound and which sequences can be fused separates efficient implementations from those leaving 80-90% of hardware capacity unused. No framework can optimize away fundamental physics. **Pitfall:** _Ignoring vendor lock-in from framework-specific formats._ @@ -3682,7 +3676,7 @@ Framework-infrastructure mismatches impose substantial operational overhead. Ten **Fallacy:** _"Larger batch sizes always improve throughput."_ -The Framework Performance Equation (@eq-framework-performance) reveals why this fails. A 7B parameter model in FP16 consumes 14GB, leaving 66GB on an A100-80GB. But increasing batch size from 8 to 32 quadruples activation memory for transformers (due to attention's $O(S^2)$ scaling), potentially triggering gradient checkpointing that **reduces throughput by 20-30%** despite the larger batch. The optimal batch size balances compute saturation (70-80% utilization), activation memory, and framework allocator overhead. Blindly maximizing batch size often achieves *lower* throughput than smaller batches. +The Dispatch Overhead Equation (@eq-dispatch-overhead) reveals why this fails. A 7B parameter model in FP16 consumes 14GB, leaving 66GB on an A100-80GB. But increasing batch size from 8 to 32 quadruples activation memory for transformers (due to attention's $O(S^2)$ scaling), potentially triggering gradient checkpointing that **reduces throughput by 20-30%** despite the larger batch. The optimal batch size balances compute saturation (70-80% utilization), activation memory, and framework allocator overhead. Blindly maximizing batch size often achieves *lower* throughput than smaller batches. **Pitfall:** _Treating compilation overhead as negligible._ @@ -3715,6 +3709,5 @@ These problems are interconnected and constrained by physics—particularly the Framework development continues evolving toward greater developer productivity, broader hardware support, and more flexible deployment options. The trend toward hybrid execution models (write eager, deploy optimized) reflects the industry's recognition that development flexibility and production performance need not be mutually exclusive. We have established the software substrate of ML—the frameworks that translate abstract architectures into executable kernels. The control room is ready, and the dials are tuned. But a control room without a source of energy is just a room with glowing lights. To transform these frameworks into intelligent systems, we need to scale them across massive datasets and hardware fleets. **The control room is ready. Now we need the power plant.** We turn next to the systems of scale: **AI Training** (@sec-ai-training). - ::: { .quiz-end } ::: diff --git a/book/quarto/contents/vol1/hw_acceleration/hw_acceleration.qmd b/book/quarto/contents/vol1/hw_acceleration/hw_acceleration.qmd index a5fddf701..dd72a0cac 100644 --- a/book/quarto/contents/vol1/hw_acceleration/hw_acceleration.qmd +++ b/book/quarto/contents/vol1/hw_acceleration/hw_acceleration.qmd @@ -42,7 +42,7 @@ The fundamental surprise of modern computing is that arithmetic is nearly free w ## AI Hardware Acceleration Fundamentals {#sec-ai-acceleration-ai-hardware-acceleration-fundamentals-2096} -Hardware acceleration is the engineering discipline of maximizing the **Throughput** and **Bandwidth** denominators in the **Iron Law of ML Systems** (@sec-intro-iron-law). While architectural choices (@sec-dnn-architectures) set the "Ops budget," the hardware determines the theoretical floor for execution time. However, as the Iron Law reveals, increasing peak throughput only improves performance if **Utilization** remains high. This chapter examines the hardware-software co-design required to reach those theoretical ceilings. +Hardware acceleration is the engineering discipline of maximizing the **Throughput** and **Bandwidth** denominators in the **Iron Law of ML Systems** (@sec-ai-training-iron-law). While architectural choices (@sec-dnn-architectures) set the "Ops budget," the hardware determines the theoretical floor for execution time. However, as the Iron Law reveals, increasing peak throughput only improves performance if **Utilization** remains high. This chapter examines the hardware-software co-design required to reach those theoretical ceilings. The Purpose section posed a deceptively simple question: why does moving data cost more than computing on it? This chapter answers that question by examining how specialized accelerators are architected to minimize data movement while maximizing compute throughput. Every architectural innovation examined here, from systolic arrays to tensor cores to memory hierarchies, represents a different strategy for addressing the fundamental asymmetry between cheap arithmetic and expensive memory access. @@ -272,8 +272,6 @@ This historical progression reveals a crucial insight: each wave of hardware spe What distinguishes AI acceleration from earlier specialization waves extends beyond the hardware itself to the scale of integration required. AI accelerators must work seamlessly with frameworks like TensorFlow, PyTorch, and JAX. They require sophisticated compiler support for graph-level transformations, kernel fusion, and memory scheduling. And they must deploy across environments from data centers to mobile devices, each with distinct performance and efficiency requirements. This creates a system-level transformation that requires tight hardware-software coupling—the focus of most of this chapter. -[^fn-heterogeneous]: **Heterogeneous Computing**: Computing systems that combine different types of processors (CPUs, GPUs, TPUs, FPGAs) to optimize performance for diverse workloads. Modern data centers mix x86 CPUs for control tasks, GPUs for training, and TPUs for inference. Programming heterogeneous systems requires frameworks like OpenCL or CUDA that can coordinate execution across different architectures, but offers 10-100$\times$ efficiency gains by matching each task to optimal hardware. - But first, we must understand _what_ bottleneck AI accelerators are designed to solve. The answer determines every subsequent architectural decision. ### The Integration Bottleneck: Why AI Needs Specialized Hardware {#sec-ai-acceleration-integration-bottleneck} diff --git a/book/quarto/contents/vol1/introduction/introduction.qmd b/book/quarto/contents/vol1/introduction/introduction.qmd index e81d11325..7e0b4e6be 100644 --- a/book/quarto/contents/vol1/introduction/introduction.qmd +++ b/book/quarto/contents/vol1/introduction/introduction.qmd @@ -865,8 +865,6 @@ Given the three components of the AI Triad, which should we prioritize to advanc Richard Sutton's 2019 essay "The Bitter Lesson" formalizes the historical pattern we just traced [@sutton2019bitter]. Looking back at seven decades of research, Sutton observed that general methods which can leverage increasing computation consistently outperform approaches that encode human expertise. He writes: "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin." -[^fn-richard-sutton]: **Richard Sutton**: A leading researcher in reinforcement learning. Sutton co-authored *Reinforcement Learning: An Introduction* with Andrew Barto and developed foundational algorithms including temporal difference learning and policy gradient methods. In 2024, Sutton and Barto received the ACM Turing Award for contributions to reinforcement learning. Sutton's essay "The Bitter Lesson" summarizes a recurring pattern in AI history: methods that scale with computation tend to dominate approaches that encode domain-specific human knowledge [@sutton2019bitter]. - @tbl-ai-evolution-performance provides quantitative validation of this principle. The shift from expert systems to statistical learning to deep learning has dramatically improved performance on representative tasks, with each transition enabled by increased computational scale rather than cleverer encoding of human knowledge. +----------------------------------+--------------------------------+-------------------------+---------------------+--------------------------------+ @@ -1123,24 +1121,14 @@ As the triangle illustrates, no single element can function in isolation. Algori This triangular dependency means that advancing any single component in isolation provides limited benefit. Improved algorithms cannot realize their potential without sufficient data and computational capacity. Larger datasets become burdensome without algorithms capable of extracting meaningful patterns and infrastructure capable of processing them efficiently. More powerful hardware accelerates computation but cannot compensate for poor data quality or unsuitable algorithmic approaches. Machine learning systems demand careful orchestration of all three components, with each constraining and enabling the others. - - ### Lighthouse Archetypes: The Systems Detectives {#sec-intro-lighthouse} - - To ground the abstract interdependencies of the Iron Law in concrete practice, this textbook employs five recurring **Lighthouse Archetypes**. We do not use these models merely as examples; they are **Systems Detectives**—canonical workloads that we will use in every chapter to "interrogate" the Silicon Contract. - - Each archetype represents a distinct extreme of the Iron Law. For instance, **ResNet-50** allows us to investigate the **Compute Term** in its purest form, while **GPT-2/Llama** acts as our primary probe for **Memory Bandwidth** bottlenecks. By following these same workloads from data engineering through to edge deployment, you will see how a single architectural choice propagates physical and economic constraints across the entire system. - - @tbl-lighthouse-examples summarizes the quantitative characteristics that make each archetype a unique diagnostic tool for ML systems engineering. - - +----------------------+--------------------+----------------+--------------------------+------------------------+-----------------------+ | **Archetype** | **Parameters** | **Model Size** | **Arithmetic Intensity** | **Primary Constraint** | **Deployment Target** | @@ -1167,12 +1155,8 @@ Each archetype represents a distinct extreme of the Iron Law. For instance, **Re +----------------------+--------------------+----------------+--------------------------+------------------------+-----------------------+ - - : **Lighthouse Archetype Specifications**: These workloads recur throughout subsequent chapters, acting as "Detectives" that reveal how different systems principles (e.g., quantization, batching, or pruning) affect performance differently across various architectural constraints. {#tbl-lighthouse-examples} - - Each archetype manifests different constraints within the AI Triad, ensuring that the principles developed throughout this text are tested against the diversity of real-world systems engineering challenges. Later in this chapter, we complement these technical workloads with three deployment case studies—Waymo, FarmBeats, and AlphaFold—that illustrate how the same core challenges manifest in production systems under radically different constraints. diff --git a/book/quarto/contents/vol1/ops/ops.qmd b/book/quarto/contents/vol1/ops/ops.qmd index 75fcdb0bb..b4581e825 100644 --- a/book/quarto/contents/vol1/ops/ops.qmd +++ b/book/quarto/contents/vol1/ops/ops.qmd @@ -50,7 +50,7 @@ ML monitoring must answer: *Is the model still accurate? Have input distribution The operational discipline of MLOps exists to close this observability gap, transforming statistical health into actionable signals before business metrics reveal the damage. ::: -If Benchmarking (@sec-benchmarking-ai) provides the *sensors* for our deployment control loop, then **MLOps** is the complete *control system*. Benchmarking tells us how a model performs at a point in time; MLOps ensures that performance is maintained over the model's operational lifetime. Together, they close the **Verification Gap** (@sec-deploy-principles) by continuously recalibrating the model against a changing world. In this final functional layer, we manage the "Entropy" of the machine, ensuring that the **Iron Law** remains satisfied even as data distributions drift. +If Benchmarking (@sec-benchmarking-ai) provides the *sensors* for our deployment control loop, then **MLOps** is the complete *control system*. Benchmarking tells us how a model performs at a point in time; MLOps ensures that performance is maintained over the model's operational lifetime. Together, they close the **Verification Gap** (@sec-introduction) by continuously recalibrating the model against a changing world. In this final functional layer, we manage the "Entropy" of the machine, ensuring that the **Iron Law** remains satisfied even as data distributions drift. Machine Learning Operations (MLOps)[^fn-mlops-emergence] provides the disciplinary framework that synthesizes monitoring, automation, and governance into coherent production architectures. This operational discipline addresses the challenge of translating experimental success into *sustained* system performance—not just deployment, but deployment that remains effective. @@ -1356,6 +1356,8 @@ Once trained and validated, a model must be integrated into a production environ Teams must properly package, test, and track ML models for reliable production deployment. One common approach involves containerizing models using containerization technologies[^fn-containerization-orchestration], ensuring smooth portability across environments. +[^fn-containerization-orchestration]: **Containerization and Orchestration**: Docker (2013) pioneered application containerization, packaging code with dependencies into portable units. Kubernetes (2014, open-sourced by Google) emerged as the dominant orchestration platform, managing container deployment, scaling, and networking across clusters. + Production deployment requires frameworks that handle model packaging, versioning, and integration with serving infrastructure. Tools like MLflow and model registries manage these deployment artifacts, while serving-specific frameworks (detailed in the Inference Serving section) handle the runtime optimization and scaling requirements. Before full-scale rollout, teams deploy updated models to staging or QA environments[^fn-tensorflow-serving-origins] to rigorously test performance. diff --git a/book/quarto/contents/vol1/parts/optimize_principles.qmd b/book/quarto/contents/vol1/parts/optimize_principles.qmd index 15625daae..2fefcf1ea 100644 --- a/book/quarto/contents/vol1/parts/optimize_principles.qmd +++ b/book/quarto/contents/vol1/parts/optimize_principles.qmd @@ -56,8 +56,8 @@ Think of ML optimization as a compiler pipeline. Just as a compiler transforms s 2. **Model Compression (@sec-model-compression)**: Optimizing the *Binary*. After training completes, we optimize what comes out—pruning unnecessary weights, quantizing to lower precision, and distilling knowledge into smaller architectures. This reduces the cost of every inference. -3. **Hardware Acceleration (@sec-hw-acceleration)**: Optimizing the *Execution*. Finally, we optimize how the model runs—designing silicon (GPUs, TPUs, NPUs) and software kernels that match the mathematical patterns of modern AI. This maximizes throughput on the target platform. +3. **Hardware Acceleration (@sec-ai-acceleration)**: Optimizing the *Execution*. Finally, we optimize how the model runs—designing silicon (GPUs, TPUs, NPUs) and software kernels that match the mathematical patterns of modern AI. This maximizes throughput on the target platform. 4. **Benchmarking (@sec-benchmarking-ai)**: Validating the *Claims*. Optimization without measurement is guesswork. We learn to measure performance reliably, distinguish laboratory results from production reality, and avoid "optimization theater" where benchmark numbers improve but deployment doesn't. -Each stage compounds: a well-curated dataset trains a model that compresses cleanly and runs efficiently on optimized hardware. Skip any stage, and you leave performance on the table. \ No newline at end of file +Each stage compounds: a well-curated dataset trains a model that compresses cleanly and runs efficiently on optimized hardware. Skip any stage, and you leave performance on the table. diff --git a/book/quarto/contents/vol1/responsible_engr/responsible_engr.qmd b/book/quarto/contents/vol1/responsible_engr/responsible_engr.qmd index 5f09319b2..68279a7c1 100644 --- a/book/quarto/contents/vol1/responsible_engr/responsible_engr.qmd +++ b/book/quarto/contents/vol1/responsible_engr/responsible_engr.qmd @@ -212,7 +212,7 @@ This engineering-centered approach does not diminish the importance of diverse p ![**Responsible AI Governance Layers**. Responsible engineering operates within nested governance structures. At the center, engineering teams implement technical practices (audit trails, bias testing, explainable UIs). These exist within organizational safety culture and management strategies. Industry certification and external reviews provide independent oversight. Government regulation establishes the outermost boundary. Engineers must understand where their work fits in this ecosystem: technical excellence at the center enables compliance with requirements flowing from outer layers.](images/png/human_centered_ai.png){#fig-governance-layers fig-alt="Nested oval diagram showing governance layers from innermost to outermost: Team (reliable systems, software engineering), Organization (safety culture, organizational design), Industry (trustworthy certification, external reviews), and Government Regulation."} ::: {.callout-perspective title="The Full Cost of the Iron Law"} -The **Iron Law of ML Systems** established in @sec-intro-iron-law holds that system performance depends on the interaction between data, compute, and system overhead. We have spent previous chapters optimizing each term—compressing models (@sec-model-compression), accelerating hardware (@sec-ai-acceleration), and automating operations (@sec-ml-operations). But every optimization has costs beyond those captured in benchmarks. +The **Iron Law of ML Systems** established in @sec-ai-training-iron-law holds that system performance depends on the interaction between data, compute, and system overhead. We have spent previous chapters optimizing each term—compressing models (@sec-model-compression), accelerating hardware (@sec-ai-acceleration), and automating operations (@sec-ml-operations). But every optimization has costs beyond those captured in benchmarks. A model quantized for edge deployment consumes less energy—but also produces outputs that may differ across demographic groups. A recommendation system optimized for engagement maximizes a business metric—but may amplify harmful content. Responsible engineering extends our accounting to include these broader impacts: the carbon cost of computation, the fairness cost of optimization choices, and the societal cost of deployment at scale. The Iron Law governs *how fast* our systems run; responsible engineering governs *how well* they serve. ::: @@ -452,22 +452,14 @@ The 30 percentage point TPR disparity far exceeds common industry thresholds of Several mitigation approaches exist, each with distinct tradeoffs. Threshold adjustment lowers the approval threshold for Group B to equalize TPR but may increase false positives for that group. Reweighting increases the weight of Group B samples during training to give the model stronger signal about this population but may reduce overall accuracy. Adversarial debiasing trains with an adversary that prevents the model from learning group membership but adds training complexity. The choice among these approaches requires stakeholder input about which tradeoffs are acceptable in the specific application context. - - #### The Fairness-Accuracy Pareto Frontier {#sec-responsible-engineering-pareto-frontier} - - In systems engineering, we rarely find a "free lunch." Just as optimizing for Power often sacrifices Performance, optimizing for Fairness often sacrifices aggregate Accuracy. To surpass the qualitative discussions of ethics, you must treat this tradeoff as a **Pareto Frontier**—the set of optimal configurations where you cannot improve one metric without degrading another. - - ::: {.callout-perspective title="Napkin Math: The Price of Fairness"} **The Problem**: Your stakeholders demand that you eliminate a 20% True Positive Rate (TPR) disparity in a hiring model. What is the "Price of Fairness" in terms of hiring quality? - - **The Physics**: You can equalize TPRs by adjusting the classification threshold ($\tau$) for the disadvantaged group. * **Original State**: Group A (TPR=90%), Group B (TPR=70%). Aggregate Accuracy = 85%. @@ -476,8 +468,6 @@ In systems engineering, we rarely find a "free lunch." Just as optimizing for Po * **The Cost**: Lowering the threshold increases **False Positives** (hiring candidates who don't meet the bar). - - **The Calculation**: 1. To close the 20% TPR gap, you must accept a **5% increase** in False Positives. @@ -489,14 +479,10 @@ In systems engineering, we rarely find a "free lunch." Just as optimizing for Po * In this scenario, closing the gap reduces the system's **Total Utility by 3%**. - - **The Systems Conclusion**: The "Price of Fairness" in this system is a 3% utility tax. This is not a "bug"; it is a **System Constraint**. Your job is not to find a "fair" model, but to present the **Pareto Curve** to stakeholders so they can choose the Utility/Fairness tradeoff that aligns with organizational values. ::: - - Quantifying disparities is necessary but not sufficient. When a loan applicant receives a rejection, stating that "the model's true positive rate for your demographic group is 60% compared to 90% for other groups" provides no actionable information. The applicant needs to know: Why was *my* application rejected? What could I change? These questions require explainability, which is the ability to articulate which input features drove specific predictions. diff --git a/book/quarto/contents/vol1/serving/serving.qmd b/book/quarto/contents/vol1/serving/serving.qmd index 534813c32..8493e0c36 100644 --- a/book/quarto/contents/vol1/serving/serving.qmd +++ b/book/quarto/contents/vol1/serving/serving.qmd @@ -42,7 +42,7 @@ Training and serving want opposite things. Training maximizes throughput—large ## The Serving Paradigm {#sec-serving-paradigm} -Serving represents the critical transition from model development to production deployment, where the **Iron Law of ML Systems** (@sec-intro-iron-law) undergoes a fundamental shift. In training, we optimized for high Throughput and Bandwidth to process massive datasets. In serving, the **Latency term** ($\text{Latency}_{\text{fixed}}$)—the irreducible overhead of request scheduling, network round-trips, and system orchestration—suddenly becomes the dominant constraint. This chapter explores how to re-engineer the system to minimize this fixed 'tax' on every prediction. +Serving represents the critical transition from model development to production deployment, where the **Iron Law of ML Systems** (@sec-ai-training-iron-law) undergoes a fundamental shift. In training, we optimized for high Throughput and Bandwidth to process massive datasets. In serving, the **Latency term** ($\text{Latency}_{\text{fixed}}$)—the irreducible overhead of request scheduling, network round-trips, and system orchestration—suddenly becomes the dominant constraint. This chapter explores how to re-engineer the system to minimize this fixed 'tax' on every prediction. Serving introduces a fundamental inversion that transforms everything established in prior chapters. Training optimizes for samples processed per hour over days of computation. Serving must deliver predictions within milliseconds under unpredictable load. @sec-benchmarking-ai established techniques for measuring throughput and accuracy under controlled conditions; production serving faces traffic patterns that no benchmark could anticipate. @sec-model-compression provided quantization methods that reduced model size; serving must validate that those optimizations preserve accuracy under real traffic distributions. This inversion from throughput to latency, from controlled to unpredictable, from offline to real time defines the serving challenge. diff --git a/book/quarto/contents/vol2/distributed_training/distributed_training.qmd b/book/quarto/contents/vol2/distributed_training/distributed_training.qmd index 2388aa6d2..36ef6fb47 100644 --- a/book/quarto/contents/vol2/distributed_training/distributed_training.qmd +++ b/book/quarto/contents/vol2/distributed_training/distributed_training.qmd @@ -816,9 +816,9 @@ Production teams typically benchmark communication patterns and scaling efficien Having established the mathematical foundation that makes data parallelism theoretically sound, where gradient averaging preserves the statistical properties of SGD, we now examine how these principles translate into concrete implementation steps. Each step in the implementation corresponds to a phase in the gradient averaging process formalized above, from distributing data subsets to synchronizing the computed gradients. -The process of data parallelism can be broken into a series of distinct steps, each with its role in ensuring the system operates efficiently. Consider @fig-train-data-parallelism: it traces the complete workflow from dataset splitting through gradient aggregation, showing how each GPU processes its assigned batch before synchronization brings all gradients together for parameter updates. +The process of data parallelism can be broken into a series of distinct steps, each with its role in ensuring the system operates efficiently. Consider @fig-dist-train-data-parallelism: it traces the complete workflow from dataset splitting through gradient aggregation, showing how each GPU processes its assigned batch before synchronization brings all gradients together for parameter updates. -::: {#fig-train-data-parallelism fig-env="figure" fig-pos="htb" fig-cap="**Data Parallelism Implementation Pipeline**. The five-stage workflow for data parallel training: (1) split input data into non-overlapping subsets, (2) assign batches to GPUs, (3) compute forward and backward passes independently, (4) synchronize gradients via AllReduce, and (5) update parameters uniformly across all devices. This approach contrasts with model parallelism, where the model itself is partitioned rather than replicated." fig-alt="Flowchart showing 5-stage data parallelism: Input Data splits into 4 batches assigned to GPUs 1-4, each performs forward and backward pass, gradients synchronize and aggregate, then model updates."} +::: {#fig-dist-train-data-parallelism fig-env="figure" fig-pos="htb" fig-cap="**Data Parallelism Implementation Pipeline**. The five-stage workflow for data parallel training: (1) split input data into non-overlapping subsets, (2) assign batches to GPUs, (3) compute forward and backward passes independently, (4) synchronize gradients via AllReduce, and (5) update parameters uniformly across all devices. This approach contrasts with model parallelism, where the model itself is partitioned rather than replicated." fig-alt="Flowchart showing 5-stage data parallelism: Input Data splits into 4 batches assigned to GPUs 1-4, each performs forward and backward pass, gradients synchronize and aggregate, then model updates."} ```{.tikz} \begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}] \tikzset{Line/.style={line width=0.75pt,black!50,text=black @@ -1314,9 +1314,9 @@ Device coordination follows a specific pattern during training. In the forward p ### Model Parallelism Implementation {#sec-distributed-training-model-parallelism-implementation} -Model parallelism divides neural networks across multiple computing devices, with each device computing a distinct portion of the model's operations. This division allows training of models whose parameter counts exceed single-device memory capacity. The technique encompasses device coordination, data flow management, and gradient computation across distributed model segments. @fig-model-parallelism captures this bidirectional data flow: input data propagates forward through sequentially assigned model partitions while gradients flow backward to update parameters, with intermediate results transferring across device boundaries at each stage. +Model parallelism divides neural networks across multiple computing devices, with each device computing a distinct portion of the model's operations. This division allows training of models whose parameter counts exceed single-device memory capacity. The technique encompasses device coordination, data flow management, and gradient computation across distributed model segments. @fig-dist-model-parallelism captures this bidirectional data flow: input data propagates forward through sequentially assigned model partitions while gradients flow backward to update parameters, with intermediate results transferring across device boundaries at each stage. -::: {#fig-model-parallelism fig-env="figure" fig-pos="htb" fig-cap="**Model Parallelism Data Flow**. Sequential distribution of model partitions across three devices: input data flows forward through each partition in order (top path), while gradients propagate backward during the update phase (bottom path). Each device handles a distinct portion of the model, with intermediate activations and gradients transferring at partition boundaries. This approach enables training of models exceeding single-device memory at the cost of sequential dependencies that reduce hardware utilization." fig-alt="Linear flow diagram showing model parallelism: Input Data flows through Model Parts 1-3 on Devices 1-3 to Predictions via forward pass arrows above, with gradient update arrows returning below."} +::: {#fig-dist-model-parallelism fig-env="figure" fig-pos="htb" fig-cap="**Model Parallelism Data Flow**. Sequential distribution of model partitions across three devices: input data flows forward through each partition in order (top path), while gradients propagate backward during the update phase (bottom path). Each device handles a distinct portion of the model, with intermediate activations and gradients transferring at partition boundaries. This approach enables training of models exceeding single-device memory at the cost of sequential dependencies that reduce hardware utilization." fig-alt="Linear flow diagram showing model parallelism: Input Data flows through Model Parts 1-3 on Devices 1-3 to Predictions via forward pass arrows above, with gradient update arrows returning below."} ```{.tikz} \begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}] \tikzset{Line/.style={line width=1.0pt,black!50,text=black @@ -1400,9 +1400,9 @@ Model parallelism can be implemented through different strategies for dividing t #### Layer-wise Partitioning {#sec-distributed-training-layerwise-partitioning} -Layer-wise partitioning assigns distinct model layers to separate computing devices. In transformer architectures, this translates to specific devices managing defined sets of attention and feed-forward blocks. @fig-layers-blocks demonstrates this partitioning for a 24-layer transformer: six consecutive blocks reside on each of four devices, with forward activations flowing left-to-right and backward gradients propagating right-to-left across the device boundaries. +Layer-wise partitioning assigns distinct model layers to separate computing devices. In transformer architectures, this translates to specific devices managing defined sets of attention and feed-forward blocks. @fig-dist-layers-blocks demonstrates this partitioning for a 24-layer transformer: six consecutive blocks reside on each of four devices, with forward activations flowing left-to-right and backward gradients propagating right-to-left across the device boundaries. -::: {#fig-layers-blocks fig-env="figure" fig-pos="htb" fig-cap="**Layer-Wise Model Parallelism**. A 24-layer transformer distributed across four GPUs, with six consecutive transformer blocks assigned to each device. Forward activations (black arrows) flow left-to-right through device boundaries, while backward gradients (red arrows) propagate right-to-left during parameter updates. This partitioning reduces per-GPU memory from the full model size to 1/4, but introduces sequential dependencies where downstream devices wait for upstream computation to complete." fig-alt="24-layer transformer split across 4 devices: Blocks 1-6 on GPU 1, 7-12 on GPU 2, 13-18 on GPU 3, 19-24 on GPU 4. Black arrows show forward activation flow; red arrows show backward gradient propagation."} +::: {#fig-dist-layers-blocks fig-env="figure" fig-pos="htb" fig-cap="**Layer-Wise Model Parallelism**. A 24-layer transformer distributed across four GPUs, with six consecutive transformer blocks assigned to each device. Forward activations (black arrows) flow left-to-right through device boundaries, while backward gradients (red arrows) propagate right-to-left during parameter updates. This partitioning reduces per-GPU memory from the full model size to 1/4, but introduces sequential dependencies where downstream devices wait for upstream computation to complete." fig-alt="24-layer transformer split across 4 devices: Blocks 1-6 on GPU 1, 7-12 on GPU 2, 13-18 on GPU 3, 19-24 on GPU 4. Black arrows show forward activation flow; red arrows show backward gradient propagation."} ```{.tikz} \begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}\small] \tikzset{Line/.style={line width=1.0pt,black!50 diff --git a/book/quarto/contents/vol2/edge_intelligence/edge_intelligence.qmd b/book/quarto/contents/vol2/edge_intelligence/edge_intelligence.qmd index 0025ed4f6..8b89968e7 100644 --- a/book/quarto/contents/vol2/edge_intelligence/edge_intelligence.qmd +++ b/book/quarto/contents/vol2/edge_intelligence/edge_intelligence.qmd @@ -2047,7 +2047,7 @@ $$ This reduces total rounds by $6.25\times$ at the cost of $2.5\times$ more communication per unit of local computation. The total client-epochs become $900 \times 10 \times 2 = 18,000$, a $15.6\times$ reduction from the $E = 5$ non-IID case. This illustrates why adaptive local epoch selection based on estimated data heterogeneity significantly improves federated learning efficiency. -**Communication-Computation Trade-off.** The interaction between local epochs and communication rounds creates a fundamental design trade-off visualized in @fig-convergence-tradeoff. More local epochs reduce communication frequency but increase client drift, while fewer local epochs maintain tighter synchronization at higher communication cost. +**Communication-Computation Trade-off.** The interaction between local epochs and communication rounds creates a fundamental design trade-off. More local epochs reduce communication frequency but increase client drift, while fewer local epochs maintain tighter synchronization at higher communication cost. ::: {.callout-note title="Figure: Communication-Computation Trade-off" collapse="false"} ```{.tikz} @@ -2824,7 +2824,7 @@ We established a three-pillar framework for navigating these constraints: **Mode * **Heterogeneity is the Default**: Unlike uniform datacenter racks, the edge is a fragmented ecosystem of microcontrollers and SoCs. Federated learning must account for "stragglers" and participation bias generated by these physical disparities. ::: -Part III has now examined deployment at scale across the full spectrum: from centralized inference systems serving millions of requests per second (@sec-inference-at-scale) to the advanced optimization tricks (@sec-optimization-at-scale) and the massively distributed edge fleets explored here. +Part III has now examined deployment at scale across the full spectrum: from centralized inference systems serving millions of requests per second (@sec-inference-at-scale) to the advanced serving optimization techniques covered there and the massively distributed edge fleets explored here. However, deploying a model, whether to a GPU cluster or a billion smartphones, is only the beginning. Scaling these services creates operational complexity that grows exponentially with every model version, device type, and deployment region. How do we ensure a model update does not silently fail on 10% of budget Android phones? How do we monitor drift across a million private, local datasets? diff --git a/book/quarto/contents/vol2/frontiers/frontiers.qmd b/book/quarto/contents/vol2/frontiers/frontiers.qmd index f4bf186b6..53245b549 100644 --- a/book/quarto/contents/vol2/frontiers/frontiers.qmd +++ b/book/quarto/contents/vol2/frontiers/frontiers.qmd @@ -1640,7 +1640,7 @@ Operational practices (CI/CD, monitoring, incident response) remain essential; A The six building blocks examined (data engineering, dynamic architectures, training paradigms, optimization, hardware, and operations) must work in concert for compound AI systems, but integration proves far more challenging than simply assembling components. Successful architectures require carefully designed interfaces, coordinated optimization across layers, and holistic understanding of how building blocks interact to create emergent capabilities or cascade failures. -Consider data flow through an integrated compound system serving a complex user query. Novel data engineering pipelines from @sec-agi-systems-data-engineering-scale-91a0 continuously generate synthetic training examples, curate web-scale corpora, and enable self-play learning that produce specialized training datasets for different components. These datasets feed into dynamic architectures from @sec-agi-systems-dynamic-architectures-compound-systems-fca0 where mixture-of-experts models route different aspects of queries to specialized components: mathematical reasoning to quantitative experts, creative writing to language specialists, code generation to programming-focused modules. Each expert was trained using methodologies from @sec-agi-systems-training-methodologies-compound-systems-e3fa including RLHF alignment, constitutional AI self-improvement, and continual learning that adapts to user feedback. Optimization techniques from @sec-optimization-at-scale enable deploying these components efficiently through quantization reducing memory footprints, pruning eliminating redundant parameters, and distillation transferring knowledge to smaller deployment models. This optimized model ensemble runs on heterogeneous hardware from @sec-frontiers-emerging-hardware combining GPU clusters for transformer inference, neuromorphic chips for event-driven perception, and specialized accelerators for symbolic reasoning. Finally, evolved MLOps from @sec-agi-systems-operations-continuous-system-evolution-ed9b monitors this complex deployment through semantic validation, handles component failures gracefully, and supports continuous learning updates without service interruption. +Consider data flow through an integrated compound system serving a complex user query. Novel data engineering pipelines from @sec-agi-systems-data-engineering-scale-91a0 continuously generate synthetic training examples, curate web-scale corpora, and enable self-play learning that produce specialized training datasets for different components. These datasets feed into dynamic architectures from @sec-agi-systems-dynamic-architectures-compound-systems-fca0 where mixture-of-experts models route different aspects of queries to specialized components: mathematical reasoning to quantitative experts, creative writing to language specialists, code generation to programming-focused modules. Each expert was trained using methodologies from @sec-agi-systems-training-methodologies-compound-systems-e3fa including RLHF alignment, constitutional AI self-improvement, and continual learning that adapts to user feedback. Optimization techniques from @sec-model-compression enable deploying these components efficiently through quantization reducing memory footprints, pruning eliminating redundant parameters, and distillation transferring knowledge to smaller deployment models. This optimized model ensemble runs on heterogeneous hardware from @sec-frontiers-emerging-hardware combining GPU clusters for transformer inference, neuromorphic chips for event-driven perception, and specialized accelerators for symbolic reasoning. Finally, evolved MLOps from @sec-agi-systems-operations-continuous-system-evolution-ed9b monitors this complex deployment through semantic validation, handles component failures gracefully, and supports continuous learning updates without service interruption. The critical insight: these building blocks cannot be developed in isolation. Data engineering decisions constrain which architectural patterns prove feasible; model architectures determine optimization opportunities; hardware capabilities bound achievable performance; operational requirements feed back to influence architectural choices. This creates a tightly coupled design space where co-optimization across building blocks often yields greater improvements than optimizing any single component. diff --git a/book/quarto/contents/vol2/orchestration/orchestration.qmd b/book/quarto/contents/vol2/orchestration/orchestration.qmd index 4d251c96f..1a56be430 100644 --- a/book/quarto/contents/vol2/orchestration/orchestration.qmd +++ b/book/quarto/contents/vol2/orchestration/orchestration.qmd @@ -387,7 +387,7 @@ For inference workloads, we analyzed autoscaling based on custom metrics, resour With the fleet built—compute nodes defined, networks connected, storage configured, and schedulers running—the machine is ready. But a training cluster is only a means to an end: producing models that serve users. -**Part III: Deployment at Scale** begins with @sec-inference, where we examine the challenges of serving these models to millions of users with low latency and high throughput. +**Part III: Deployment at Scale** begins with @sec-inference-at-scale, where we examine the challenges of serving these models to millions of users with low latency and high throughput. ```{=latex} \part{key:vol2_deployment} diff --git a/book/quarto/contents/vol2/parts/infrastructure_principles.qmd b/book/quarto/contents/vol2/parts/infrastructure_principles.qmd index e71675ab2..f79a056d5 100644 --- a/book/quarto/contents/vol2/parts/infrastructure_principles.qmd +++ b/book/quarto/contents/vol2/parts/infrastructure_principles.qmd @@ -36,6 +36,6 @@ A cluster with 10,000 GPUs is useless for training if they are connected by a 1 This section builds the physical substrate: 1. **Infrastructure (@sec-infrastructure)**: The silicon, power, and cooling systems of the AI Supercomputer. -2. **Networking (@sec-networking)**: The "Gradient Bus" (InfiniBand, RoCE) that connects the fleet. +2. **Communication (@sec-communication)**: The "Gradient Bus" (InfiniBand, RoCE) that connects the fleet. 3. **Storage (@sec-storage)**: The data hierarchy (NVMe to Object Store) that feeds the beast. 4. **Orchestration (@sec-orchestration)**: The "Brain" that manages resources and scheduling. diff --git a/book/quarto/contents/vol2/responsible_ai/responsible_ai.qmd b/book/quarto/contents/vol2/responsible_ai/responsible_ai.qmd index 2754c6701..761853150 100644 --- a/book/quarto/contents/vol2/responsible_ai/responsible_ai.qmd +++ b/book/quarto/contents/vol2/responsible_ai/responsible_ai.qmd @@ -1712,9 +1712,9 @@ While posthoc methods are flexible and broadly applicable, they come with limita In contrast, inherently interpretable models are transparent by design. Examples include decision trees, rule lists, linear models with monotonicity constraints, and k-nearest neighbor classifiers. These models expose their reasoning structure directly, enabling stakeholders to trace predictions through a set of interpretable rules or comparisons. In regulated or safety-important domains such as recidivism prediction or medical triage, inherently interpretable models may be preferred, even at the cost of some accuracy [@rudin2019stop]. However, these models generally do not scale well to high-dimensional or unstructured data, and their simplicity can limit performance in complex tasks. -@fig-interpretability-spectrum visualizes the relative interpretability of different model types along a spectrum: decision trees and linear regression offer transparency by design, whereas more complex architectures like neural networks and convolutional models require external techniques to explain their behavior. This distinction is central to choosing an appropriate model for a given application, particularly in settings where regulatory scrutiny or stakeholder trust is paramount. +@fig-vol2-interpretability-spectrum visualizes the relative interpretability of different model types along a spectrum: decision trees and linear regression offer transparency by design, whereas more complex architectures like neural networks and convolutional models require external techniques to explain their behavior. This distinction is central to choosing an appropriate model for a given application, particularly in settings where regulatory scrutiny or stakeholder trust is paramount. -::: {#fig-interpretability-spectrum fig-env="figure" fig-pos="htb" fig-cap="**Model Interpretability Spectrum**: Inherently interpretable models, such as linear regression and decision trees, offer transparent reasoning, while complex models like neural networks require posthoc explanation techniques to understand their predictions. This distinction guides model selection based on application needs, prioritizing transparency in regulated domains or when stakeholder trust is important." fig-alt="Horizontal spectrum from More Interpretable to Less Interpretable with 6 models: decision trees, linear regression, logistic regression, random forest, neural network, CNN. Brace marks first three as intrinsically interpretable."} +::: {#fig-vol2-interpretability-spectrum fig-env="figure" fig-pos="htb" fig-cap="**Model Interpretability Spectrum**: Inherently interpretable models, such as linear regression and decision trees, offer transparent reasoning, while complex models like neural networks require posthoc explanation techniques to understand their predictions. This distinction guides model selection based on application needs, prioritizing transparency in regulated domains or when stakeholder trust is important." fig-alt="Horizontal spectrum from More Interpretable to Less Interpretable with 6 models: decision trees, linear regression, logistic regression, random forest, neural network, CNN. Brace marks first three as intrinsically interpretable."} ```{.tikz} \begin{tikzpicture}[line join=round,font=\small\usefont{T1}{phv}{m}{n}] diff --git a/book/tools/scripts/utilities/validate_part_keys.py b/book/tools/scripts/utilities/validate_part_keys.py index b55a6cefb..ef7014f62 100755 --- a/book/tools/scripts/utilities/validate_part_keys.py +++ b/book/tools/scripts/utilities/validate_part_keys.py @@ -19,42 +19,49 @@ from pathlib import Path from typing import Dict, List, Set, Tuple def load_part_summaries() -> Dict: - """Load part summaries from YAML file.""" + """Load part summaries from YAML file(s).""" # Try multiple possible paths to handle being run from root or book/ + # Support both single-volume and two-volume structures possible_paths = [ Path("quarto/contents/parts/summaries.yml"), - Path("book/quarto/contents/parts/summaries.yml") + Path("book/quarto/contents/parts/summaries.yml"), + Path("quarto/contents/vol1/parts/summaries.yml"), + Path("book/quarto/contents/vol1/parts/summaries.yml"), + Path("quarto/contents/vol2/parts/summaries.yml"), + Path("book/quarto/contents/vol2/parts/summaries.yml") ] - - yaml_path = None - for p in possible_paths: - if p.exists(): - yaml_path = p - break - - if not yaml_path: + + # Collect all existing paths (for two-volume structure) + found_paths = [p for p in possible_paths if p.exists()] + + if not found_paths: print("❌ Error: summaries.yml not found in expected locations") return {} - try: - with open(yaml_path, 'r') as f: - data = yaml.safe_load(f) - if 'parts' not in data: - print("❌ Error: No 'parts' section in summaries.yml") - return {} + # Aggregate summaries from all found files + summaries = {} + for yaml_path in found_paths: + try: + with open(yaml_path, 'r') as f: + data = yaml.safe_load(f) + if 'parts' not in data: + print(f"⚠️ Warning: No 'parts' section in {yaml_path}") + continue - # Create a mapping of normalized keys to entries - summaries = {} - for part in data['parts']: - if 'key' in part: - key = part['key'].lower().replace('_', '').replace('-', '') - summaries[key] = part + # Create a mapping of normalized keys to entries + for part in data['parts']: + if 'key' in part: + key = part['key'].lower().replace('_', '').replace('-', '') + summaries[key] = part + except Exception as e: + print(f"❌ Error loading {yaml_path}: {e}") - return summaries - except Exception as e: - print(f"❌ Error loading summaries.yml: {e}") + if not summaries: + print("❌ Error: No parts found in any summaries.yml") return {} + return summaries + def find_qmd_files() -> List[Path]: """Find all .qmd files in the quarto directory.""" qmd_files = []