Files
cs249r_book/book/tools/scripts/figure_narrative_audit.md
Vijay Janapa Reddi 2390c3ab31 Refactor: consolidate Quarto config layers and content reorganization.
Unifies Quarto metadata into shared base/format/volume fragments while carrying through chapter path, asset, and tooling updates to keep the repository consistent and easier to maintain.
2026-02-12 15:38:55 -05:00

312 KiB
Raw Permalink Blame History

Figure Narrative Audit Report

backmatter/appendix_algorithm.qmd

@fig-backprop-graph (Line 143)

::: How to trace the computation. @fig-backprop-graph shows a simple two-layer network. Practice tracing both passes to understand what happens during training: Forward pass (black arrows, left to right): Start at x, your input. Multiply by W_1 to get hidden activation h. Cache h because you will need it later. Multiply h by W_2 to get output y. Cache y. Compare y to the target label to compute loss L.


backmatter/appendix_data.qmd

@fig-row-vs-col (Line 22)

Row-Oriented (CSV, JSON): Data is stored record-by-record. To read just the age column, you must scan every byte of every row. This is efficient for writing (appending a log) but inefficient for analytics (training on specific features). Column-Oriented (Parquet, Arrow): Data is stored column-by-column. To read the age column, the disk head seeks to that column's block and reads it sequentially. This enables projection pushdown (reading only the bytes you need) and vectorized processing (SIMD operations on columns). @fig-row-vs-col contrasts these two storage layouts. ::: {#fig-row-vs-col fig-env="figure" fig-pos="htb" fig-cap="Storage Layouts: Row-oriented formats pack data together by record (good for transactions). Column-oriented formats pack data by feature (good for analytics)." fig-alt="Diagram contrasting Row Store vs Column Store. Row store shows Record 1 [ID, Name, Age] followed by Record 2. Column store shows Column 1 [ID1, ID2...] followed by Column 2 [Name1, Name2...]."}


backmatter/appendix_machine.qmd

@fig-roofline (Line 20)

The Roofline Model [@williams2009roofline] answers a deceptively simple question: how fast can this workload possibly run on this hardware? The answer depends on whether you run out of compute or memory bandwidth first. Every operation has an arithmetic intensity: the ratio of computations performed to bytes moved from memory. Matrix multiplication has high arithmetic intensity because you can reuse each loaded element many times. Element-wise operations like ReLU have low intensity because you load a number, do one operation, and write it back. @fig-roofline illustrates how workloads are bounded by either memory bandwidth or compute throughput. ::: {#fig-roofline fig-env="figure" fig-pos="htb" fig-cap="The Roofline Model: Performance ceiling for a hypothetical accelerator. The sloped line represents memory bandwidth limits; the horizontal line represents peak compute. Every workload can be plotted on this diagram to determine its optimization strategy." fig-alt="A plot with arithmetic intensity on the x-axis and performance on the y-axis. Two lines form a roofline shape: a diagonal line rising from the origin labeled Memory Bound, and a horizontal line labeled Compute Bound. They meet at the Ridge Point."}

@fig-memory-hierarchy (Line 219)

The Memory Hierarchy

Computer systems use a hierarchy because no single technology provides both high capacity and low latency, as shown in @fig-memory-hierarchy. Every technique that keeps data higher in the pyramid (registers/cache) directly improves performance. ::: {#fig-memory-hierarchy fig-env="figure" fig-pos="htb" fig-cap="The Memory Hierarchy: Performance depends on data proximity. Accessing HBM is ~100x slower than registers; accessing SSD is ~100,000x slower." fig-alt="Pyramid showing Registers at top, followed by Cache, HBM/DRAM, and Storage at bottom."}


benchmarking/benchmarking.qmd

@fig-imagenet-gpus (Line 447)

::: System benchmarks evaluate performance across scales, ranging from single-chip configurations to large distributed systems, and AI workloads including both training and inference tasks. This evaluation approach ensures that benchmarks accurately reflect real-world deployment scenarios and deliver insights that inform both hardware selection decisions and system architecture design. @fig-imagenet-gpus reveals the striking correlation between GPU adoption and ImageNet classification error rates from 2010 to 2014: as GPU entries surged from 0 to 110, error rates plummeted from 28.2% to 7.3%, demonstrating how hardware capabilities and algorithmic advances drive progress in tandem. ::: {#fig-imagenet-gpus fig-env="figure" fig-pos="htb" fig-cap="GPU Adoption and Error Reduction: As GPU entries in ImageNet surged from 0 to 110 between 2010 and 2014, top-5 error rates dropped from 28.2% to 7.3%, demonstrating the co-evolution of hardware capabilities and algorithmic advances." fig-alt="Dual-axis chart with blue line showing top-5 error rate declining from 28% to 7% and green bars showing GPU entries rising from 0 to 110 between 2010 and 2014."}

@fig-granularity (Line 577)

The optimization techniques from Part III operate at different granularities (kernel fusion targets micro performance, pruning affects macro model behavior, data curation determines end-to-end generalization) and validation must match. A micro benchmark might show kernel speedup while a macro benchmark reveals memory bottlenecks that negate the gain; an end-to-end benchmark might expose data pipeline stalls invisible at any other level. @fig-granularity breaks down the ML system stack into three distinct evaluation layers, each revealing different performance characteristics. At the application level, end-to-end benchmarks assess overall system performance including data preprocessing, model training, and inference. At the model layer, benchmarks evaluate efficiency and accuracy, measuring how well models generalize to new data and their computational efficiency during training and inference. At the infrastructure layer, benchmarking examines individual hardware and software components like GPUs or TPUs. ::: {#fig-granularity fig-env="figure" fig-pos="htb" fig-cap="Benchmarking Granularity: Four-panel block diagram showing micro, model, application, and end-to-end evaluation layers. Each panel maps a distinct scope of assessment, from isolated kernel operations through full-system deployment, enabling targeted optimization at every level of the ML stack." fig-alt="Block diagram showing three evaluation layers: neural network nodes on left, model components in center, and end-to-end application with compute nodes on right, connected by dashed lines."}

@fig-benchmark-tradeoffs (Line 789)

: Benchmarking Granularity Levels. Different benchmark scopes target distinct stages of ML system development. Micro-benchmarks isolate individual operations for low-level optimization, macro-benchmarks evaluate complete models to guide architectural choices, and end-to-end benchmarks assess full system performance in production environments. {#tbl-benchmark-comparison} @fig-benchmark-tradeoffs visualizes the core trade-off between diagnostic power and real-world representativeness across benchmark granularity levels. This relationship illustrates why comprehensive ML system evaluation requires multiple benchmark types: micro-benchmarks provide precise optimization guidance for isolated components, while end-to-end benchmarks capture the complex interactions that emerge in production systems. The optimal benchmarking strategy combines insights from all three levels to balance detailed component analysis with realistic system-wide assessment. ::: {#fig-benchmark-tradeoffs fig-env="figure" fig-pos="htb" fig-cap="Isolation vs. Representativeness: The core trade-off in benchmarking granularity. Micro-benchmarks provide high diagnostic precision but limited real-world relevance, while end-to-end benchmarks capture realistic system behavior but offer less precise component-level insights. Effective ML system evaluation requires strategic combination of all three levels." fig-alt="Scatter plot with three labeled points along diagonal: micro-benchmarks at high isolation, macro-benchmarks at medium, and end-to-end benchmarks at high representativeness."}

@fig-benchmark-components (Line 836)

An AI benchmark provides this structured framework for evaluating artificial intelligence systems. While individual benchmarks vary significantly in their specific focus and granularity, they share common implementation components that enable consistent evaluation and comparison across different approaches. The essential components interconnect to form a complete evaluation pipeline. @fig-benchmark-components illustrates how task definition, dataset selection, model selection, and evaluation metrics build upon each other, creating a progression from problem specification through deployment assessment. ::: {#fig-benchmark-components fig-env="figure" fig-pos="htb" fig-cap="Anomaly Detection Pipeline: Nine-stage benchmark workflow applied to an industrial audio anomaly detection task. The pipeline progresses from problem definition through dataset selection, model training, quantization, and ARM embedded deployment, illustrating how each benchmark component feeds the next." fig-alt="Workflow diagram showing nine stages from problem definition through deployment, with detailed views of anomaly detection system, model training, quantization, and ARM embedded implementation."}

@fig-benchmark-components (Line 1015)

Problem Definition

A benchmark implementation begins with a formal specification of the machine learning task and its evaluation criteria. In machine learning, tasks represent well-defined problems that AI systems must solve. Consider the anomaly detection system depicted in @fig-benchmark-components, which processes audio signals to identify deviations from normal operation patterns. This industrial monitoring application exemplifies how formal task specifications translate into practical implementations. The formal definition of any benchmark task encompasses both the computational problem and its evaluation framework. While the specific tasks vary significantly by domain, well-established categories have emerged across major fields of AI research. Natural language processing tasks, for example, include machine translation, question answering [@hirschberg2015advances], and text classification. Computer vision similarly employs standardized tasks such as object detection, image segmentation, and facial recognition [@everingham2010pascal].

@fig-benchmark-components (Line 1027)

Standardized Datasets

Building directly upon the problem definition established in the previous phase, standardized datasets provide the essential foundation for training and evaluating models. These carefully curated collections ensure all models undergo testing under identical conditions, enabling direct comparisons across different approaches and architectures. In @fig-benchmark-components, the audio anomaly detection example demonstrates how waveform data serves as the standardized input for evaluating detection performance. In computer vision, datasets such as ImageNet [@imagenet_website] [@deng2009imagenet], COCO [@coco_website] [@lin2014microsoft], and CIFAR-10 [@cifar10_website][^fn-cifar-10] [@krizhevsky2009learning] serve as reference standards. For natural language processing, collections such as SQuAD [@squad_website][^fn-squad] [@rajpurkar2016squad], GLUE [@glue_website]1 [@wang2018glue], and WikiText [@wikitext_website] [@merity2016pointer] fulfill similar functions. These datasets encompass a range of complexities and edge cases to thoroughly evaluate machine learning systems.

@fig-benchmark-components (Line 1037)

@fig-benchmark-components (Line 1045)

Model Selection

Following dataset specification, the benchmark process advances systematically to model architecture selection and implementation. This critical phase establishes performance baselines and determines the optimal modeling approach for the specific task at hand. The selection process directly builds upon the architectural foundations established in @sec-dnn-architectures and must account for the framework-specific considerations discussed in @sec-ai-frameworks. Examine the model selection stage in @fig-benchmark-components to see how this phase connects dataset specification to training code development. Baseline models serve as the reference points for evaluating novel approaches. These span from basic implementations, including linear regression for continuous predictions and logistic regression for classification tasks, to advanced architectures with proven success in comparable domains. The choice of baseline depends critically on the deployment framework: a PyTorch implementation may exhibit different performance characteristics than its TensorFlow equivalent due to framework-specific optimizations and operator implementations. In natural language processing applications, advanced language models like BERT[^fn-bert] have emerged as standard benchmarks for comparative analysis. The architectural details of transformers and their performance characteristics are thoroughly covered in @sec-dnn-architectures.

@fig-benchmark-components (Line 1250)

Example Benchmark

To illustrate how these components work together in practice, a complete benchmark run evaluates system performance by synthesizing multiple components under controlled conditions to produce reproducible measurements. @fig-benchmark-components illustrates this integration through an audio anomaly detection system. It shows how performance metrics are systematically measured and reported within a framework that encompasses problem definition, datasets, model selection, evaluation criteria, and standardized run rules. The benchmark measures several key performance dimensions. For computational resources, the system reports a model size of 270 Kparameters and requires 10.4 milliseconds per inference. For task effectiveness, it achieves a detection accuracy of 0.86 AUC (Area Under Curve) in distinguishing normal from anomalous audio patterns. For operational efficiency, it consumes 516 µJ of energy per inference.

@fig-mlperf-training-improve (Line 1397)

From a systems perspective, training machine learning models represents a computationally intensive process that requires careful optimization of resources. Training benchmarks serve as essential tools for evaluating system efficiency, identifying bottlenecks, and ensuring that machine learning systems can scale effectively. They provide a standardized approach to measuring how various system components, including hardware accelerators, memory, storage, and network infrastructure, affect training performance. Training benchmarks allow researchers and engineers to push the state-of-the-art, optimize configurations, improve scalability, and reduce overall resource consumption by evaluating these factors. @fig-mlperf-training-improve demonstrates that performance improvements in progressive versions of MLPerf Training benchmarks have consistently outpaced Moore's Law, with ResNet training speedups exceeding 30x over five years. This exponential improvement shows that what gets measured gets improved, showcasing the rapid evolution of ML computing through standardized benchmarking. ::: {#fig-mlperf-training-improve fig-env="figure" fig-pos="htb" fig-cap="MLPerf Training Progress: Standardized benchmarks reveal that machine learning training performance consistently surpasses Moore's Law, indicating substantial gains from systems-level optimizations. These trends emphasize how focused measurement and iterative improvement drive rapid advancements in ML training efficiency and scalability. Source: [@tschand2024mlperf]." fig-alt="Line chart with nine model benchmarks from 2018 to 2024 showing relative performance gains up to 48x for Mask R-CNN, all exceeding the Moore's Law baseline of 6.6x."}

@fig-power-differentials (Line 1888)

@fig-power-diagram (Line 2352)

Power Measurement Boundaries

To address these measurement challenges, we must understand how power consumption is measured at different system scales, from TinyML devices to full-scale data center inference nodes. @fig-power-diagram illustrates distinct measurement boundaries for each scenario: components shown in green indicate what is included in energy accounting, while components shown with red dashed outlines are excluded from power measurements. ::: {#fig-power-diagram fig-env="figure" fig-pos="htb" fig-cap="Power Measurement Boundaries: MLPerf defines system boundaries for power measurement, ranging from single-chip devices to full data center nodes, to enable fair comparisons of energy efficiency across diverse hardware platforms. These boundaries delineate which components' power consumption is included in reported metrics, impacting the interpretation of performance results. Source: [@tschand2024mlperf]." fig-alt="System diagram showing four measurement boundaries: Tiny SoC with compute units, Inference SoC with accelerators and DRAM, Inference Node with cooling and NIC, and Training Rack with compute nodes."}

The MLPerf Power methodology applies the standardized evaluation principles discussed earlier, adapting to various hardware architectures from general-purpose CPUs to specialized AI accelerators. Meaningful cross-platform comparisons are ensured while maintaining measurement integrity across different computing scales. The benchmark has accumulated thousands of reproducible measurements submitted by industry organizations, demonstrating their latest hardware capabilities and the sector-wide focus on energy-efficient AI technology. @fig-power-trends traces energy efficiency evolution across system scales through successive MLPerf versions, revealing critical performance trends in datacenter, edge, and tiny deployments. ::: {#fig-power-trends fig-env="figure" fig-pos="htb" fig-cap="Energy Efficiency Gains: Successive MLPerf inference benchmark versions show energy efficiency (samples per watt) improving up to 378x for datacenter workloads and 1070x for tinyML deployments across successive releases. Standardized measurement protocols enable meaningful cross-platform comparisons, driving sector-wide progress toward sustainable AI. Source: [@tschand2024mlperf]." fig-alt="Three line charts showing normalized energy efficiency across MLPerf versions: datacenter models up to 378x gain, edge models up to 4x, and tiny models up to 1070x improvement."}

@fig-hw-lottery (Line 2955)

@fig-sciml-graph (Line 3183)

This evolution is evident in the history of AI benchmarks. Early model benchmarks, for instance, focused heavily on image classification and object detection, as these were some of the first widely studied deep learning tasks. However, as AI expanded into natural language processing, recommendation systems, and generative AI, it became clear that these early benchmarks no longer reflected the most important challenges in the field. In response, new benchmarks emerged to measure language understanding [@wang2018glue; @wang2019superglue] and generative AI [@liang2022helm]. Benchmark evolution extends beyond the addition of new tasks to encompass new dimensions of performance measurement. While traditional AI benchmarks emphasized accuracy and throughput, modern applications demand evaluation across multiple criteria: fairness, robustness, scalability, and energy efficiency. @fig-sciml-graph illustrates this complexity through scientific applications, which span orders of magnitude in their performance requirements. For instance, Large Hadron Collider sensors must process data at rates approaching 10$^{14}$ bytes per second (equivalent to about 100 terabytes per second) with nanosecond-scale computation times, while mobile applications operate at 10$^{4}$ bytes per second with longer computational windows. This range of requirements necessitates specialized benchmarks. For example, edge AI applications require benchmarks like MLPerf that specifically evaluate performance under resource constraints and scientific application domains need their own "Fast ML for Science" benchmarks [@duarte2022fastml]. ::: {#fig-sciml-graph fig-env="figure" fig-pos="htb" fig-cap="Performance Spectrum: Scientific applications and edge devices demand vastly different computational resources, spanning multiple orders of magnitude in data rates and latency requirements. Consequently, traditional benchmarks focused solely on accuracy are insufficient; specialized evaluation metrics and benchmarks like MLPerf become essential for optimizing AI systems across diverse deployment scenarios. Source: [@duarte2022fastml]." fig-alt="Log-scale scatter plot of data rate versus computation time, showing scientific applications from LHC sensors at 10^14 B/s and nanoseconds to mobile devices at 10^4 B/s and seconds."}

@fig-imagenet-challenge (Line 3309)

Model benchmarks validate whether compression techniques from @sec-model-compression preserved the properties that matter for deployment. This extends beyond top-line accuracy: a pruned model might maintain ImageNet accuracy while losing robustness to adversarial inputs; a quantized model might preserve average-case performance while degrading on rare but critical edge cases; a distilled model might match the teacher's accuracy while losing calibration. Historically, benchmarks focused almost exclusively on accuracy, but compression makes multi-dimensional evaluation essential. @fig-imagenet-challenge traces the dramatic reduction in error rates on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [@ilsvrc_website] classification task, from 25.8% in 2010 to a mere 3.57% by 2015. Starting from the baseline models in 2010 and 2011, the introduction of AlexNet2 in 2012 marked an improvement, reducing the error rate from 25.8% to 16.4%. Subsequent models like ZFNet, VGGNet, GoogleNet, and ResNet[^fn-bench-resnet] continued this trend, with ResNet achieving an error rate of 3.57% by 2015 [@russakovsky2015imagenet]. This progression established the baselines against which model compression techniques are evaluated—a pruned ResNet must demonstrate how much accuracy it sacrifices for a given efficiency gain.

@fig-model-vs-data (Line 3479)

Shift detection methods identify when production distributions diverge from training distributions. Statistical tests (KS test, MMD) on input features can detect covariate shift. Monitoring model confidence distributions can detect when the model encounters unfamiliar inputs. Early detection enables intervention before performance degrades catastrophically. These distribution alignment challenges highlight a fundamental tension in ML development: should we fix the data and iterate on models, or fix the model and iterate on data? @fig-model-vs-data contrasts model-centric and data-centric AI approaches. The model-centric paradigm treats datasets as fixed while iterating on architectures; the data-centric approach fixes architectures while systematically improving data quality. Research increasingly demonstrates that methodical dataset enhancement can yield superior performance gains compared to model refinements alone—challenging the conventional emphasis on architectural innovation. ::: {#fig-model-vs-data fig-env="figure" fig-pos="htb" fig-cap="Development Paradigms: Model-centric AI prioritizes architectural innovation with fixed datasets, while data-centric AI systematically improves dataset quality (annotations, diversity, and bias) with consistent model architectures to achieve performance gains. Modern research indicates that strategic data enhancement often yields greater improvements than solely refining model complexity." fig-alt="Side-by-side diagrams: model-centric AI shows data cylinders feeding CPU with feedback loop to model, data-centric AI shows feedback loop to data instead. Double arrow indicates complementary approaches."}

@fig-dataset-saturation (Line 3633)

@fig-dataset-saturation (Line 3731)

Dataset Saturation and Dynamic Benchmarks. @fig-dataset-saturation raises a fundamental methodological question: when models surpass human performance on benchmarks, does this reflect genuine capability advances or optimization to static evaluation sets? MNIST illustrates the concern: certain test images, though nearly illegible to humans, were assigned specific labels during dataset creation in 1994. Models correctly predicting these labels may be memorizing dataset artifacts rather than learning digit recognition. The question "Are we done with ImageNet?" [@beyer2020we] generalizes this concern. Dynamic benchmarking approaches like Dynabench[^fn-dynabench] [@kiela2021dynabench] address saturation by continuously evolving test data based on model performance, ensuring that benchmarks remain challenging as capabilities improve. However, dynamic benchmarks complement rather than replace the coverage, quality, and distribution metrics described above: they prevent saturation but do not diagnose its causes.

@fig-ai-triad (Line 3750)

  • Model success + Data failure: High accuracy on held-out test set, but distribution shift in production causes silent degradation This interdependence is precisely the AI Triad introduced in @sec-introduction (@fig-ai-triad): System corresponds to Machine, Model corresponds to Algorithm, and Data remains Data. Holistic evaluation requires not just passing benchmarks in each dimension, but verifying that assumptions made in one dimension hold across the others. The Part III optimization pipeline (data → model → hardware) creates implicit dependencies that benchmarking must validate explicitly. As AI continues to evolve, benchmarking methodologies must advance in tandem. Evaluating AI performance through the lens of systems, models, and data ensures that benchmarks drive improvements not just in accuracy, but also in efficiency, fairness, and robustness. This holistic perspective provides essential validation before deployment. The DAM Taxonomy provides a diagnostic framework for identifying which component limits performance. @tbl-dam-bottleneck formalizes this approach by crossing each AI Triad component with the three fundamental bottleneck types.

data_engineering/data_engineering.qmd

@fig-data-quality-multiplier (Line 89)

::: The implications of this ceiling are visualized in @fig-data-quality-multiplier, which contrasts the scaling laws of clean versus noisy data. While clean data allows accuracy to improve with a power law, noisy data causes performance to plateau, making data cleaning a higher-leverage activity than mere data accumulation.

@fig-cascades (Line 252)

@fig-four-pillars (Line 357)

The Four Foundational Pillars

Every data engineering decision, from choosing storage formats to designing ingestion pipelines, should be evaluated against four principles. @fig-four-pillars illustrates how these pillars interact, with each contributing to system success through systematic decision-making. Data quality provides the foundation for system success. Quality issues compound throughout the ML lifecycle through "Data Cascades" (@sec-data-engineering-ml-data-cascades-systematic-foundations-matter-2efe), where early failures propagate and amplify downstream. Quality includes accuracy, completeness, consistency, and fitness for the intended ML task. The mathematical foundations of this relationship appear in @sec-deep-learning-systems-foundations and @sec-dnn-architectures.

@fig-ds-time (Line 673)

Having examined how each pillar addresses specific concerns and how they manifest across pipeline stages, we now turn to how they function together as a system. Understanding each pillar individually is necessary but not sufficient; effective data engineering requires recognizing their interconnections. These four pillars are not independent components but interconnected aspects of a unified system where decisions in one area affect all others. Quality improvements must account for scalability constraints, reliability requirements influence governance implementations, and governance policies shape quality metrics. This systems perspective guides our exploration of data engineering, examining how each technical topic supports and balances these principles while managing their tensions. The practical stakes of this integrated framework are substantial. According to industry surveys, data scientists spend an estimated 60--80% of their time on data preparation tasks[^fn-data-quality-stats], with data cleaning alone consuming up to 60% of practitioners' effort (see @fig-ds-time in @sec-ai-development-workflow). This imbalance reflects ad-hoc rather than systematic data engineering practices. Applying the four-pillar framework consistently can reduce data preparation time while producing more reliable and maintainable systems. This time allocation motivates knowing the key constants that govern data engineering costs and timelines.

@fig-keywords (Line 783)

@fig-misalignment (Line 979)

Context matters as much as content. Popular benchmarks like ImageNet invite overfitting that inflates performance metrics [@beyer2020we], and curated datasets frequently fail to reflect real-world deployment distributions [@venturebeat_datasets]. Central to these contextual concerns, a key consideration for ML systems is how well pre-existing datasets reflect real-world deployment conditions. Relying on standard datasets can create a concerning disconnect between training and production environments. @fig-misalignment visualizes this risk: when multiple ML systems train on identical datasets, they propagate shared biases and limitations throughout an entire ecosystem of deployed models. ::: {#fig-misalignment fig-env="figure" fig-pos="htb" fig-cap="Shared Dataset Bias Propagation: Five models (A through E) all train on a single central dataset repository. Arrows show how shared limitations, biases, and blind spots propagate from the common dataset to every downstream model, leading to correlated failures across the ecosystem." fig-alt="Five model boxes labeled A through E at center all connect upward to one central training dataset repository. Arrows downward show shared limitations, biases, and blind spots propagating to all models."}

@fig-traffic-light (Line 1041)

While quality-focused approaches excel at creating accurate, well-curated datasets, they face inherent scaling limitations. When scale requirements dominate—needing millions or billions of examples that manual curation cannot economically provide—web scraping and synthetic generation offer paths to massive datasets. Scalability requires understanding the economic models underlying different acquisition strategies: cost per labeled example, throughput limitations, and how these scale with data volume. What proves cost-effective at thousand-example scale often becomes prohibitive at million-example scale, while approaches that require high setup costs amortize favorably across large volumes. Web scraping enables dataset construction at scales that manual curation cannot match. Major vision datasets like ImageNet [@imagenet_website] and OpenImages [@openimages_website] were built through systematic scraping, and large language models depend on web-scale text corpora [@groeneveld2024olmo]. Targeted scraping of domain-specific sources, such as code repositories [@chen2021evaluating], further demonstrates the approach's versatility. However, production systems that rely on continuous scraping face pipeline reliability challenges: website structure changes break extractors, rate limiting throttles collection throughput, and dynamic content introduces inconsistencies that degrade model performance. Scraped data can also contain unexpected noise, such as historical images appearing in contemporary searches (@fig-traffic-light), requiring systematic validation and cleaning stages. Legal and ethical constraints further bound what scraping can achieve. Not all websites permit scraping, and ongoing litigation around training data usage illustrates the consequences of non-compliance [@harvard_law_chatgpt]. Teams must document data provenance, ensure compliance with terms of service and copyright law, and apply anonymization procedures when scraping user-generated content.

@fig-synthetic-data (Line 1049)

Crowdsourcing distributes annotation tasks across a global workforce, enabling rapid labeling at scales that in-house teams cannot match. Platforms like Amazon Mechanical Turk [@mturk_website] demonstrated this at landmark scale with ImageNet, where distributed contributors categorized millions of images into thousands of classes [@krizhevsky2012imagenet]. The approach's primary systems advantage is twofold: scalability through parallel microtask distribution, and diversity through the range of perspectives, cultural contexts, and linguistic variations that a global contributor pool introduces. This diversity directly improves model generalization across populations. Tasks can also be adjusted dynamically based on initial results, enabling iterative refinement of collection strategies as quality gaps emerge. Moving beyond human-generated data entirely, synthetic data generation represents the ultimate scalability solution, creating unlimited training examples through algorithmic generation rather than manual collection. This approach changes the economics of data acquisition by removing human labor from the equation. @fig-synthetic-data shows how synthetic data combines with historical datasets to create larger, more diverse training sets that would be impractical to collect manually. ::: {#fig-synthetic-data fig-env="figure" fig-pos="htb" fig-cap="Synthetic Data Augmentation: A four-node pipeline where historical data and simulation outputs feed into a synthetic data generation process, producing an expanded combined training dataset with greater size and diversity than either source alone. Source: AnyLogic [@anylogic_synthetic]." fig-alt="Diagram showing historical data icon and simulation cloud icon both feeding into synthetic data generation process, producing an expanded combined training dataset."}

@fig-misalignment (Line 1394)

The challenge of edge case collection becomes apparent in autonomous vehicle development. While normal driving conditions are easy to capture through test fleet operation, near-accidents, unusual pedestrian behavior, or rare weather conditions occur infrequently. Synthetic data generation helps address this by simulating rare scenarios, but validating that synthetic examples accurately represent real edge cases requires careful engineering. Some organizations employ targeted data collection where test drivers deliberately create edge cases or where engineers identify scenarios from incident reports that need better coverage. Dataset convergence represents another reliability challenge. @fig-misalignment illustrates how multiple systems training on identical datasets inherit identical blind spots and biases. An entire ecosystem of models may fail on the same edge cases because all trained on data with the same coverage gaps. This systemic risk motivates diverse data sourcing strategies where each organization collects supplementary data beyond common benchmarks, ensuring their models develop different strengths and weaknesses rather than shared failure modes. For our KWS system, reliability manifests as consistent wake word detection across acoustic environments from quiet bedrooms to noisy streets, across accents from various geographic regions, and across age ranges from children to elderly speakers. The data sourcing strategy explicitly addresses these diversity requirements: web scraping captures natural speech variation from diverse video sources, crowdsourcing targets underrepresented demographics and environments, and synthetic data systematically explores the parameter space of acoustic conditions. Without this deliberate diversity in sourcing, the system might achieve high accuracy on test sets while failing unreliably in production deployment.

@fig-pipeline-flow (Line 1530)

::: @fig-pipeline-flow breaks down ML data pipelines into several distinct layers: data sources, ingestion, processing, labeling, storage, and ML training. Each layer plays a specific role in the data preparation workflow. Selecting appropriate technologies requires understanding how our four framework pillars manifest at each stage. Quality requirements at one stage affect scalability constraints at another, reliability needs shape governance implementations, and the pillars interact to determine overall system effectiveness. Data pipeline design is constrained by storage hierarchies and I/O bandwidth limitations rather than CPU capacity. Understanding these constraints enables building efficient systems for modern ML workloads. Storage hierarchy trade-offs, ranging from high-latency object storage (ideal for archival) to low-latency in-memory stores (essential for real-time serving), and bandwidth limitations (spinning disks at 100-200 MB/s versus RAM at 50-200 GB/s) shape every pipeline decision. @sec-data-engineering-ml-strategic-storage-architecture-1a6b covers detailed storage architecture considerations.

@fig-dataloader-choke-point (Line 1790)

If your data pipeline cannot decode images fast enough to keep the GPU busy, your expensive accelerator sits idle. This phenomenon creates a "Choke Point" where adding more GPUs yields zero speedup. The following plot (@fig-dataloader-choke-point) visualizes this Dataloader Choke Point, showing the "Starvation Region" where the CPU limits performance.

@fig-etl-vs-elt (Line 1872)

ETL and ELT Comparison

Beyond choosing ingestion patterns based on timing requirements, designing effective data ingestion pipelines requires understanding the differences between Extract, Transform, Load (ETL)3 and Extract, Load, Transform (ELT)[^fn-elt] approaches. @fig-etl-vs-elt contrasts these two paradigms, showing how ETL transforms data before loading while ELT loads raw data first and transforms within the target system. These paradigms determine when data transformations occur relative to the loading phase, significantly impacting the flexibility and efficiency of ML pipelines. The choice between ETL and ELT affects where computational resources are consumed, how quickly data becomes available for analysis, and how easily transformation logic can evolve as requirements change.

@fig-cascades (Line 2229)

Applying these processing concepts to our KWS system, the audio recordings flowing through our ingestion pipeline, whether from crowdsourcing, synthetic generation, or real-world captures, require careful cleaning to ensure reliable wake word detection. Raw audio data often contains imperfections that our problem definition anticipated: background noise from various environments (quiet bedrooms to noisy industrial settings), clipped signals from recording level issues, varying volumes across different microphones and speakers, and inconsistent sampling rates from diverse capture devices. The cleaning pipeline must standardize these variations while preserving the acoustic characteristics that distinguish wake words from background speech—a quality-preservation requirement that directly impacts our 98% accuracy target. Quality assessment for KWS extends the general principles with audio-specific metrics. Beyond checking for null values or schema conformance, our system tracks background noise levels (signal-to-noise ratio above 20 decibels), audio clarity scores (frequency spectrum analysis), and speaking rate consistency (wake word duration within 500-800 milliseconds). The quality assessment pipeline automatically flags recordings where background noise would prevent accurate detection, where wake words are spoken too quickly or unclearly for the model to distinguish them, or where clipping or distortion has corrupted the audio signal. This automated filtering ensures only high-quality samples reach model development. Recall how @fig-cascades demonstrated the compounding effects of early data quality failures; this filtering prevents precisely those cascade failures by catching issues at the source. Transforming audio data for KWS involves converting raw waveforms into formats suitable for ML models while maintaining training-serving consistency. @fig-spectrogram-example visualizes this transformation pipeline, showing how raw audio waveforms convert into standardized feature representations, typically Mel-frequency cepstral coefficients (MFCCs)4 or spectrograms,[^fn-spectrogram] that emphasize speech-relevant characteristics while reducing noise and variability across different recording conditions.

@fig-spectrogram-example (Line 2231)

Quality assessment for KWS extends the general principles with audio-specific metrics. Beyond checking for null values or schema conformance, our system tracks background noise levels (signal-to-noise ratio above 20 decibels), audio clarity scores (frequency spectrum analysis), and speaking rate consistency (wake word duration within 500-800 milliseconds). The quality assessment pipeline automatically flags recordings where background noise would prevent accurate detection, where wake words are spoken too quickly or unclearly for the model to distinguish them, or where clipping or distortion has corrupted the audio signal. This automated filtering ensures only high-quality samples reach model development. Recall how @fig-cascades demonstrated the compounding effects of early data quality failures; this filtering prevents precisely those cascade failures by catching issues at the source. Transforming audio data for KWS involves converting raw waveforms into formats suitable for ML models while maintaining training-serving consistency. @fig-spectrogram-example visualizes this transformation pipeline, showing how raw audio waveforms convert into standardized feature representations, typically Mel-frequency cepstral coefficients (MFCCs)4 or spectrograms,[^fn-spectrogram] that emphasize speech-relevant characteristics while reducing noise and variability across different recording conditions.

@fig-tfx-pipeline-example (Line 2322)

Beyond tool selection, effective pipeline design involves considerations such as modularity, scalability, and version control. Modular pipelines allow for easy updates and maintenance of individual processing steps. Each transformation stage should be implemented as an independent module with clear inputs and outputs, enabling testing in isolation and replacement without affecting other stages. Version control for pipelines is equally important, ensuring that changes in data processing can be tracked and correlated with changes in model performance. When model accuracy drops, version control enables identifying whether processing changes contributed to the degradation. @fig-tfx-pipeline-example illustrates this modular breakdown using TensorFlow Extended, tracing the complete flow from initial data ingestion through to final model deployment. Data flows through validation, transformation, and feature engineering stages before reaching model training, with each component independently versioned, tested, and scaled while maintaining overall system consistency. ::: {#fig-tfx-pipeline-example fig-env="figure" fig-pos="htb" fig-cap="TFX End-to-End Pipeline: A TensorFlow Extended pipeline traces the complete flow from data ingestion through validation, transformation, training, evaluation, and deployment. Each component is independently versioned, tested, and scaled." fig-alt="Linear flow diagram showing TensorFlow Extended pipeline: data ingestion, validation, transformation, training, evaluation, and deployment stages connected by arrows from left to right."}

@fig-labels (Line 2444)

Data Annotation Granularity: Three versions of the same street scene show increasing annotation detail: a simple classification label, bounding boxes around vehicles and pedestrians, and pixel-level semantic segmentation with distinct colors. Each level increases labeling cost and storage requirements while providing richer training signal.{#fig-labels width=90% fig-alt="Three versions of same street scene showing increasing annotation detail: simple classification label, bounding boxes around vehicles and pedestrians, and pixel-level semantic segmentation with distinct colors."} @fig-labels visualizes these increasing complexity levels, from simple classification through bounding boxes to pixel-level segmentation. The choice of label format depends heavily on our system requirements and resource constraints [@10.1109/ICRA.2017.7989092]. While classification labels might suffice for simple traffic counting, autonomous vehicles need detailed segmentation maps to make precise navigation decisions. Leading autonomous vehicle companies often maintain hybrid systems that store multiple label types for the same data, allowing flexible use across different applications. A single camera frame might have classification labels (scene type: highway, urban, rural), bounding boxes (vehicles and pedestrians for obstacle detection), and segmentation masks (road surface for path planning), with each label type serving distinct downstream models. Extending beyond these basic label types, production systems must also handle rich metadata essential for maintaining data quality and debugging model behavior. The Common Voice dataset [@ardila2020common] exemplifies sophisticated metadata management in speech recognition: tracking speaker demographics for model fairness, recording quality metrics for data filtering, validation status for label reliability, and language information for multilingual support. If our traffic monitoring system performs poorly in rainy conditions, weather condition metadata during data collection helps identify and address the issue. Modern labeling platforms have built sophisticated metadata management systems that efficiently index and query this metadata alongside primary labels, enabling filtering during training data selection and post-hoc analysis when model failures are discovered.

@fig-hard-labels (Line 2454)

In the labeling domain, quality takes on unique challenges centered on ensuring label accuracy despite the inherent subjectivity and ambiguity in many labeling tasks. Even with clear guidelines and careful system design, some fraction of labels will inevitably be incorrect [@northcutt2021pervasive, @thyagarajan2023multilabel]. The challenge is not eliminating labeling errors entirely—an impossible goal—but systematically measuring and managing error rates to keep them within bounds that do not degrade model performance. Labeling failures arise from two distinct sources requiring different engineering responses. @fig-hard-labels presents concrete examples of both failure modes: data quality issues where the underlying data is genuinely ambiguous or corrupted, like the blurred frog image where even expert annotators cannot determine the species with certainty. Other errors require deep domain expertise where the correct label is determinable but only by experts with specialized knowledge, as with the black stork identification. These different failure modes drive architectural decisions about annotator qualification, task routing, and consensus mechanisms. Labeling Ambiguity: How subjective or difficult examples, such as blurry images or rare species, can introduce errors during data labeling, highlighting the need for careful quality control and potentially expert annotation. Source: [@northcutt2021pervasive].{#fig-hard-labels fig-alt="Grid of example images showing labeling challenges: blurred animal photos where species is unclear, rare specimens requiring expert knowledge, and ambiguous object boundaries causing annotator disagreement."}

@fig-weak-supervision (Line 2580)

Scaling with AI-Assisted Labeling

As labeling demands grow exponentially with modern ML systems, scalability becomes critical. The scalability pillar drives AI assistance as a force multiplier for human labeling rather than a replacement. Manual annotation alone cannot keep pace with modern ML systems' data needs, while fully automated labeling lacks the nuanced judgment that humans provide. AI-assisted labeling finds the sweet spot: using automation to handle clear cases and accelerate annotation while preserving human judgment for ambiguous or high-stakes decisions. @fig-weak-supervision illustrates several paths AI assistance offers to scale labeling operations, each requiring careful system design to balance speed, quality, and resource usage. ::: {#fig-weak-supervision fig-env="figure" fig-pos="htb" fig-cap="AI-Augmented Labeling Decision Hierarchy: A top-level question about obtaining labeled data branches into four paths: traditional supervision, semi-supervised learning, weak supervision, and transfer learning, with active learning as a cost-saving alternative. Lower-cost strategies trade labeling precision for throughput. Source: Stanford AI Lab." fig-alt="Hierarchical diagram with question about getting labeled data at top. Four branches: traditional supervision, semi-supervised, weak supervision, and transfer learning. Active learning branches as cost-saving alternative."}

@fig-mswc (Line 2729)

This scale directly reflects our framework pillars in practice. Achieving our quality target of 98% accuracy across diverse environments requires millions of training examples covering acoustic variations we identified during problem definition. Reliability demands representation across varied acoustic conditions—different background noises, speaking styles, and recording environments. Scalability necessitates automation rather than manual labeling because 23.4 million examples would require approximately 2,600 person-years of effort at even 10 seconds per label, making manual annotation economically infeasible. Governance requirements mandate transparent sourcing and language diversity, ensuring voice-activated technology serves speakers of many languages rather than concentrating on only the most commercially valuable markets. @fig-mswc depicts this automated system, which begins with paired sentence audio recordings and corresponding transcriptions from projects like Common Voice or multilingual captioned content platforms, processing these inputs through forced alignment5 to identify precise word boundaries within continuous speech.

@fig-data-card (Line 3144)

Data debt manifests across four distinct categories, each requiring different detection and remediation strategies. Documentation Debt accumulates when data provenance, meaning, and quality characteristics go unrecorded. Datasets without data cards (@fig-data-card), unlabeled columns, and undocumented transformations create debt that compounds when original authors leave organizations. A survey of production ML systems found that 40% of data quality incidents traced to misunderstanding data semantics due to missing documentation [@sambasivan2021everyone]. Documentation debt manifests as: unknown column meanings requiring reverse-engineering, missing provenance preventing compliance audits, undocumented assumptions causing silent failures when assumptions change, and absent quality metrics preventing informed dataset selection. Schema Debt emerges from accumulated schema workarounds and migrations. When upstream systems change data formats, quick fixes (string parsing instead of proper type handling, NULL coercion instead of error handling) accumulate into fragile transformation logic. Schema debt indicators include: multiple date format handlers for the same logical field, defensive null checks scattered throughout pipeline code, version-specific parsing branches that grow with each upstream change, and undocumented enum value mappings that break when new values appear.

@fig-debug-flowchart (Line 3216)

The concepts throughout this chapter (cascading failures, the four pillars, training-serving consistency, drift detection, labeling quality, and data debt) converge when systems exhibit problems in production. Data debt accumulates silently, but its effects eventually surface as system underperformance, manifesting as model accuracy degradation, pipeline failures, or subgroup performance disparities that prove difficult to attribute to specific causes. Effective debugging requires applying the diagnostic principles established earlier: the data cascade pattern (@sec-data-engineering-ml-data-cascades-systematic-foundations-matter-2efe) reminds us that root causes often lie upstream of symptoms, training-serving skew (@sec-data-engineering-ml-ensuring-trainingserving-consistency-c683) explains many deployment failures, and drift detection (@sec-data-engineering-ml-detecting-responding-data-drift-509a) surfaces gradual degradation. When symptoms appear, systematic diagnosis prevents wasted effort debugging the wrong component. @fig-debug-flowchart synthesizes these concepts into an actionable diagnostic sequence, working through the most common failure modes in order of frequency: ::: {#fig-debug-flowchart fig-env="figure" fig-pos="htb" fig-cap="Data Pipeline Debugging Flowchart: Four sequential decision nodes guide root cause diagnosis: (1) accuracy degrades over time leads to Data Drift, (2) training accuracy exceeds validation leads to Overfitting, (3) validation exceeds production accuracy leads to Training-Serving Skew, and (4) subgroup inconsistency leads to Bias. If all answers are no, the issue points to Model Architecture." fig-alt="Vertical flowchart with four blue diamond decision nodes and red result boxes. Top diamond asks if accuracy degrades over time, leading to Data Drift result. Second asks if training accuracy exceeds validation, leading to Overfitting. Third asks if validation exceeds production accuracy, leading to Training-Serving Skew. Fourth asks about subgroup inconsistency, leading to Bias. Gray box at bottom shows Model Architecture issue if all answers are no."}

@fig-four-pillars (Line 3372)

Summary

Data engineering provides the foundational infrastructure that transforms raw information into the basis of machine learning systems, determining model performance, system reliability, ethical compliance, and long-term maintainability. The Four Pillars framework (@fig-four-pillars) and the cascading nature of data quality failures (@fig-cascades) reveal why every stage of the data pipeline requires careful engineering decisions. The task of "getting data ready" encompasses complex trade-offs quantified throughout this chapter: the TCDO cost model for budgeting, storage performance hierarchies (@tbl-storage-performance, @tbl-ml-latencies), and drift detection thresholds that guide production operations. The technical architecture of data systems demonstrates how engineering decisions compound across the pipeline to create either reliable, scalable foundations or brittle, maintenance-heavy technical debt. Data acquisition strategies must navigate the reality that perfect datasets rarely exist in nature, requiring sophisticated approaches ranging from crowdsourcing and synthetic generation to careful curation and active learning. Storage architectures from traditional databases to modern data lakes and feature stores represent fundamental choices about how data flows through the system, affecting everything from training speed to serving latency.

@fig-cascades (Line 3372)

Summary

Data engineering provides the foundational infrastructure that transforms raw information into the basis of machine learning systems, determining model performance, system reliability, ethical compliance, and long-term maintainability. The Four Pillars framework (@fig-four-pillars) and the cascading nature of data quality failures (@fig-cascades) reveal why every stage of the data pipeline requires careful engineering decisions. The task of "getting data ready" encompasses complex trade-offs quantified throughout this chapter: the TCDO cost model for budgeting, storage performance hierarchies (@tbl-storage-performance, @tbl-ml-latencies), and drift detection thresholds that guide production operations. The technical architecture of data systems demonstrates how engineering decisions compound across the pipeline to create either reliable, scalable foundations or brittle, maintenance-heavy technical debt. Data acquisition strategies must navigate the reality that perfect datasets rarely exist in nature, requiring sophisticated approaches ranging from crowdsourcing and synthetic generation to careful curation and active learning. Storage architectures from traditional databases to modern data lakes and feature stores represent fundamental choices about how data flows through the system, affecting everything from training speed to serving latency.


data_selection/data_selection.qmd

@fig-running-out-of-human-data (Line 45)

@fig-optimization-triad (Line 167)

Data engineering (@sec-data-engineering-ml) ensures that data is clean, accessible, and correctly formatted. Data selection asks a different question: how much information does each sample contribute to the model's learning per unit of computation? In the optimization triad (@fig-optimization-triad), data selection plays the role of Input Optimization. While model compression minimizes the math per parameter and hardware acceleration maximizes the math per second, data selection minimizes the total math required to reach convergence. ::: {#fig-optimization-triad fig-cap="The Optimization Triad: Machine learning performance relies on three pillars: Algorithms (models), Systems (hardware/software), and Data Selection. While algorithms and systems have traditionally received the most attention, optimizing data selection (Input Optimization) offers a third, powerful lever for scaling performance." fig-alt="A triangular diagram with three nodes: Algorithms (Model), Systems (Hardware), and Data Selection. Bidirectional arrows connect all three with edge labels: Compute Bound between Algorithms and Systems, I/O Bound between Systems and Data Selection, and Sample Efficiency between Data Selection and Algorithms. Data Selection is highlighted with a bold border. ML Scale appears at the center."}

@fig-data-selection-pipeline (Line 196)

The Iron Law Connection: In the Iron Law of Training Performance (T = \frac{Ops}{P \cdot \eta}), the Total Operations (Ops) term is usually treated as a fixed constant determined by model architecture and dataset size. Data Selection turns Ops into a variable. By maximizing ICR, we reduce the total FLOPs required to reach a target performance level, directly shrinking the numerator of the Iron Law equation. A 2x improvement in ICR is mathematically equivalent to a 2x improvement in hardware Peak Throughput (P), but often much cheaper to achieve. A random batch of raw data often has low ICR because it contains redundant examples, noisy samples, or "easy" examples the model has already mastered. Training on such a batch wastes GPU cycles on zero-information updates. High-efficiency data pipelines (@fig-data-selection-pipeline) filter, order, and synthesize data to maximize ICR, ensuring that every FLOP contributes to learning. To illustrate, consider computing ICR on a concrete coreset selection task. Later in this chapter, @sec-data-selection-measuring-data-selection-7957 provides the complete measurement framework for evaluating these efficiency gains, including the Data Roofline model that diagnoses whether a system is data-bound or compute-bound. ::: {.callout-checkpoint title="Data Selection Efficiency" collapse="false"}

@fig-coreset-selection (Line 387)

: Coreset Selection Algorithm Comparison. N = dataset size, K = coreset size. Gradient-based methods generally outperform geometry-based methods but require proxy model training. {#tbl-coreset-comparison .striped .hover} @fig-coreset-selection illustrates the core insight behind coreset methods: samples near the decision boundary (high uncertainty) are more informative than samples deep within class regions (low uncertainty). Random sampling wastes budget on redundant "easy" examples. ::: {#fig-coreset-selection fig-env="figure" fig-pos="htb" fig-cap="Coreset Selection Strategy: Random sampling (left) selects uniformly, wasting budget on easy samples far from the decision boundary. Coreset selection (right) prioritizes samples near the boundary where the model is uncertain, capturing more information per sample." fig-alt="Two scatter plots with a diagonal decision boundary. Left plot shows random dots selected. Right plot highlights dots near the boundary as selected."}

@fig-active-learning-loop (Line 616)

Active Learning: Human-in-the-Loop

Curriculum learning optimizes the order in which samples are presented but assumes all samples are already labeled. This assumption breaks down in specialized fields such as medical diagnosis, autonomous driving, and scientific research, where labeling requires domain expertise and can cost $5$100 or more per sample. Rather than labeling everything upfront, active learning6 [@settles2009active; @ren2021survey] shifts the optimization target: instead of choosing which labeled samples to train on, it chooses which unlabeled samples are worth labeling at all (@fig-active-learning-loop).

@fig-active-learning-multiplier (Line 673)

::: The following plot (@fig-active-learning-multiplier) quantifies this advantage, showing how Active Learning shifts the learning curve to the left, achieving target accuracy with exponentially fewer samples than random selection.

@fig-amortization-comparison (Line 785)

This explains why the fine-tuning paradigm dominates production ML. The pre-training cost is high but amortized across many downstream applications, while fine-tuning cost remains low on a per-task basis. @fig-amortization-comparison visualizes this cost structure. Training from scratch (left) incurs the full cost for each task independently. The foundation model approach (right) pays a large upfront pre-training cost but then fine-tunes each task at a fraction of the per-task cost. ::: {#fig-amortization-comparison fig-env="figure" fig-pos="htb" fig-cap="Cost Amortization in Foundation Models: Training from scratch (left) requires 1,000 GPU-hours per task (10,000 total for 10 tasks). The foundation model approach (right) pays 10,000 GPU-hours upfront for pre-training but reduces each subsequent task to just 50 GPU-hours. At 10 tasks the totals are comparable (10,000 vs 10,500), but the per-task marginal cost drops by 20x, and the crossover favoring the foundation model occurs around 11 tasks." fig-alt="Two bar charts side by side. Left (Train from Scratch) shows 10 equal bars of 1,000 GPU-hours each, totaling 10,000 hours. Right (Foundation Model) shows one tall pre-training bar of 10,000 GPU-hours followed by 10 short fine-tuning bars of 50 GPU-hours each, totaling 10,500 hours. The per-task marginal cost drops dramatically from 1,000 to 50 GPU-hours."}

@fig-domain-gap (Line 926)

Synthetic data's greatest limitation is the domain gap: the statistical difference between generated and real-world data. A model trained on perfectly rendered simulation images may fail on blurry, poorly-lit real camera footage. This gap can negate the efficiency gains of synthetic data if not addressed. @fig-domain-gap illustrates the problem. Synthetic data (left distribution) and real data (right distribution) occupy different regions of feature space. A model trained only on synthetic data learns a decision boundary that does not transfer to real deployment. ::: {#fig-domain-gap fig-env="figure" fig-pos="htb" fig-cap="The Domain Gap Problem: Synthetic data (blue) and real data (orange) have different distributions. A model trained on synthetic data alone learns a boundary that fails on real data. Domain adaptation techniques aim to align these distributions or learn domain-invariant features." fig-alt="Two overlapping bell curves representing synthetic and real data distributions, with a decision boundary that works for synthetic but misses real data."}

@fig-technique-decision-tree (Line 1058)

Decision Framework: Choosing the Right Technique

With many techniques available, practitioners need a systematic approach to selection. @fig-technique-decision-tree provides a visual flowchart that captures the key branching points: ::: {#fig-technique-decision-tree fig-env="figure" fig-pos="htb" fig-cap="Data Selection Technique Selection Tree: Start at the top by identifying your primary bottleneck, then follow the branches to find the most appropriate technique. Leaf nodes show recommended methods. Multiple paths may apply; combine techniques as needed." fig-alt="A decision tree flowchart with diamond decision nodes and rectangular technique recommendations. Starts with bottleneck identification and branches to specific techniques."}

@fig-selection-inequality (Line 1280)

The Failure Condition: If the cost of selecting data exceeds the cost of training on the discarded data, you have failed. The goal is to spend compute to save more compute. The Fix: Use a Proxy Model (10\times smaller) for selection. T_{selection} = 0.1 \times C_{epoch}. Total time = 0.1 + 10 = 10.1. You preserve the speedup, as @fig-selection-inequality illustrates. :::

@fig-data-roofline (Line 1797)

The Data Roofline Model

Just as the compute Roofline model diagnoses whether a system is compute-bound or memory-bound, we can construct a Data Roofline to diagnose whether training is data-bound or compute-bound (@fig-data-roofline). ::: {#fig-data-roofline fig-env="figure" fig-pos="htb" fig-cap="The Data Roofline Model: Analogous to the compute Roofline, this diagnostic tool shows two regimes. Below the diagonal (data-bound), adding more data improves performance, so invest in data collection. Above the diagonal (compute-bound), more data will not help without more training compute, so invest in GPUs. The optimal operating point is at the knee where data and compute are balanced. Data selection techniques move you along the diagonal by extracting more value per sample." fig-alt="A log-log plot with Data Quality on x-axis and Model Performance on y-axis. A diagonal line separates data-bound (lower) and compute-bound (upper) regimes. Points show system positions."}

@fig-ppd-curve (Line 1863)

Using the Roofline for diagnosis: If your training run is performing below expectations, compute your effective ICR (performance gain per training FLOP) and plot your position. If you're in the data-bound region, the techniques in this chapter (coreset selection, curriculum learning, deduplication) will move you right along the diagonal. If you're compute-bound, focus on hardware acceleration or distributed training instead. @fig-ppd-curve illustrates diminishing returns visually. A data-efficient selection strategy (blue) reaches the performance plateau much faster than random sampling (gray). The gap between the curves at any dataset size represents the efficiency opportunity: compute that could be saved by smarter data curation. ::: {#fig-ppd-curve fig-cap="Diminishing Returns of Data: Random sampling (gray) versus data-efficient selection (blue). The efficient strategy achieves higher performance with less data, reaching the convergence plateau much earlier. The red arrow shows the efficiency gap at a fixed dataset size." fig-alt="A plot with X-axis 'Dataset Size' and Y-axis 'Performance'. Two curves start at 0. The 'Random' curve rises slowly. The 'Efficient' curve rises steeply and plateaus early."}


nn_computation/nn_computation.qmd

@fig-ai-ml-dl (Line 88)

::: Deep learning's defining contribution is this shift from feature engineering to architecture engineering. Classical machine learning required human experts to design feature extractors for each new problem, a labor-intensive process that encoded domain knowledge into handcrafted representations. Deep learning eliminates this bottleneck by learning representations directly from raw data through hierarchical layers of nonlinear transformations. @fig-ai-ml-dl places deep learning within the broader hierarchy of artificial intelligence and machine learning, showing how neural networks form the computational core of this paradigm. AI Hierarchy: Neural networks form a core component of deep learning within machine learning and artificial intelligence by modeling patterns in large datasets. Machine learning algorithms enable systems to learn from data as a subset of the broader AI field.{#fig-ai-ml-dl fig-alt="Nested circles diagram showing AI as outermost circle containing Machine Learning, which contains Deep Learning, which contains Neural Networks at the center. Arrows indicate progression from broad AI concepts to specific neural network implementations."}

@fig-breakout (Line 106)

From Explicit Logic to Learned Patterns

Traditional programming requires developers to explicitly define rules that tell computers how to process inputs and produce outputs. Consider a simple game like Breakout[^fn-breakout-game]. The program needs explicit rules for every interaction: when the ball hits a brick, the code must specify that the brick should be removed and the ball's direction should be reversed (@fig-breakout). While this approach works effectively for games with clear physics and limited states, it hits a wall when dealing with the messy, unstructured data of the real world. ::: {#fig-breakout fig-env="figure" fig-pos="htb" fig-cap="Breakout Collision Rules: The game program uses explicit if-then rules for collision detection, specifying ball direction reversal and brick removal upon contact. While effective for a game with clear physics and limited states, this approach illustrates how rule-based systems must anticipate every possible scenario." fig-alt="Breakout game grid with 3 rows of 5 colored bricks at top, brown paddle at bottom, and ball with trajectory arrow. Code snippet shows explicit if-then rules for collision detection: removeBrick, update ball velocity."}

@fig-traditional (Line 169)

::: Beyond individual applications, this rule-based paradigm extends to all traditional programming. @fig-traditional illustrates this fundamental pattern: the program takes both rules for processing and input data to produce outputs. Early artificial intelligence research explored whether this approach could scale to solve complex problems by encoding sufficient rules to capture intelligent behavior. ::: {#fig-traditional fig-env="figure" fig-pos="htb" fig-cap="Traditional Programming Flow: Rules and data serve as inputs to a traditional program, which produces answers as output. This input-output pattern formed the basis for early AI systems but lacks the adaptability needed for complex pattern recognition tasks." fig-alt="Flow diagram with three boxes: Rules and Data as inputs flowing into central Traditional Programming box, which outputs Answers. Arrows show data flow direction from inputs to output."}

@fig-activity-rules (Line 201)

::: Despite their apparent simplicity, rule-based limitations surface quickly with complex real-world tasks. Recognizing human activities illustrates the challenge. Classifying movement below 4 mph as walking seems straightforward until real-world complexity intrudes. Speed variations, transitions between activities, and boundary cases each demand additional rules, creating unwieldy decision trees (@fig-activity-rules). Computer vision tasks compound these difficulties. Detecting cats requires rules about ears, whiskers, and body shapes while accounting for viewing angles, lighting, occlusions, and natural variations. Early systems achieved success only in controlled environments with well-defined constraints. Activity Classification Decision Tree: A rule-based decision tree classifies human activity by branching on speed thresholds, with values below 4 mph mapped to walking, 4 to 15 mph to running, and above 15 mph to biking. Real-world edge cases and transitions between activities demand increasingly complex branching logic.{#fig-activity-rules fig-alt="Decision tree flowchart for activity classification. Branches split on conditions like speed less than 4 mph leading to walking, 4-15 mph to running, greater than 15 mph to biking. Additional branches handle edge cases and transitions."}

@fig-hog (Line 215)

To address the scalability barriers of rule-based systems, researchers began exploring approaches that could learn from data. Machine learning offered a promising direction: instead of writing rules for every situation, researchers wrote programs that identified patterns in examples. The success of these methods, however, still depended heavily on human insight to define relevant patterns, a process known as feature engineering. This approach introduced feature engineering by transforming raw data into representations that expose patterns to learning algorithms. The Histogram of Oriented Gradients (HOG) [@dalal2005histograms][^fn-hog-method] method exemplifies this approach, identifying edges where brightness changes sharply, dividing images into cells, and measuring edge orientations within each cell (@fig-hog). This transforms raw pixels into shape descriptors robust to lighting variations and small positional changes. HOG Method: Identifies edges in images to create a histogram of gradients, transforming pixel values into shape descriptors that are invariant to lighting changes.{#fig-hog fig-alt="Three-panel image showing HOG feature extraction: original grayscale photo of person on left, gradient magnitude visualization in center, and HOG descriptor grid overlay on right showing edge orientation histograms per cell."}

@fig-deeplearning (Line 233)

Neural networks represent a shift in how we approach problem solving with computers. Rather than following explicit rules, the system learns from data. This shift becomes particularly evident in tasks like computer vision, where identifying objects in images defied decades of rule-based attempts. Deep learning differs by learning directly from raw data. Traditional programming, as we saw earlier, required both rules and data as inputs to produce answers. Machine learning inverts this relationship: instead of writing rules, we provide examples (data) and their correct answers to discover the underlying rules automatically. @fig-deeplearning captures this inversion, showing how data and answers become the inputs while rules emerge as the output. This shift eliminates the need for humans to specify what patterns are important. ::: {#fig-deeplearning fig-env="figure" fig-pos="htb" fig-cap="Data-Driven Rule Discovery: The flow diagram inverts the traditional programming pattern: data and answers serve as inputs to the machine learning process, which produces learned rules as output. This inversion eliminates the need for manually specified rules and enables automated feature extraction from raw inputs." fig-alt="Flow diagram with three boxes: Answers and Data as inputs flowing into central Machine Learning box, which outputs Rules. Arrows show inverted flow compared to traditional programming, with rules as output rather than input."}

@fig-bio_nn2ai_nn (Line 412)

The Artificial Neuron as a Computing Primitive

The basic unit of neural computation, the artificial neuron (or node), serves as a simplified mathematical abstraction designed for efficient digital implementation. This building block enables complex networks to emerge from simple components working together through four functional stages: input reception, weighted modulation, signal aggregation, and nonlinear activation. As illustrated in @fig-bio_nn2ai_nn, this computational model abstracts biological complexity into a standardized processing unit. @tbl-neuron_structure formalizes these structural components, mapping the mathematical functions to their role in the overall processing pipeline.

Computational Growth: Log-scale scatter plot showing training compute in FLOPS from 1952 to 2022. Computational power grew at a 1.4x rate from 1952 to 2010, then accelerated to a doubling every 3.4 months from 2012 to 2022. Large-scale models after 2015 followed an even faster 10-month doubling cycle, addressing the historical bottleneck of training complex neural networks.{#fig-trends fig-pos='htb' fig-alt="Log-scale scatter plot showing training compute in FLOPS from 1950 to 2022. Points represent AI models, with different colors for pre-deep-learning era, deep learning era, and large-scale models. Trend lines show 1.4x growth before 2010 and 3.4-month doubling after 2012."} While the preceding sections established the technical foundations of deep learning, the term itself gained prominence in the 2010s, coinciding with advances in computational power and data accessibility. Two remarkable trends emerge from @fig-trends: computational capabilities measured in floating-point operations per second (FLOPS) initially followed a 1.4\times improvement pattern from 1952 to 2010, then accelerated to a 3.4-month doubling cycle from 2012 to 2022. The emergence of large-scale models between 2015 and 2022 (not explicitly shown or easily seen in @fig-trends) is perhaps more striking still: these scaled 2 to 3 orders of magnitude faster than the general trend, following an aggressive 10-month doubling cycle. @tbl-historical-performance grounds these trends in concrete systems, showing how parameters, compute, and hardware co-evolved across four decades of neural network development.

Computational Growth: Log-scale scatter plot showing training compute in FLOPS from 1952 to 2022. Computational power grew at a 1.4x rate from 1952 to 2010, then accelerated to a doubling every 3.4 months from 2012 to 2022. Large-scale models after 2015 followed an even faster 10-month doubling cycle, addressing the historical bottleneck of training complex neural networks.{#fig-trends fig-pos='htb' fig-alt="Log-scale scatter plot showing training compute in FLOPS from 1950 to 2022. Points represent AI models, with different colors for pre-deep-learning era, deep learning era, and large-scale models. Trend lines show 1.4x growth before 2010 and 3.4-month doubling after 2012."} While the preceding sections established the technical foundations of deep learning, the term itself gained prominence in the 2010s, coinciding with advances in computational power and data accessibility. Two remarkable trends emerge from @fig-trends: computational capabilities measured in floating-point operations per second (FLOPS) initially followed a 1.4\times improvement pattern from 1952 to 2010, then accelerated to a 3.4-month doubling cycle from 2012 to 2022. The emergence of large-scale models between 2015 and 2022 (not explicitly shown or easily seen in @fig-trends) is perhaps more striking still: these scaled 2 to 3 orders of magnitude faster than the general trend, following an aggressive 10-month doubling cycle. @tbl-historical-performance grounds these trends in concrete systems, showing how parameters, compute, and hardware co-evolved across four decades of neural network development.

@fig-virtuous-cycle (Line 622)

::: Parallel advances across three dimensions drove these evolutionary trends: data availability, algorithmic innovations, and computing infrastructure. @fig-virtuous-cycle captures this reinforcing cycle: more powerful computing infrastructure enabled processing larger datasets, larger datasets drove algorithmic innovations, and better algorithms demanded more sophisticated computing systems. This reinforcing cycle continues to drive progress today. ::: {#fig-virtuous-cycle fig-env="figure" fig-pos="htb" fig-cap="Deep Learning Virtuous Cycle: Three mutually reinforcing factors, data availability, algorithmic innovations, and computing infrastructure, form a self-reinforcing loop where breakthroughs in one area create opportunities in the others." fig-alt="Three connected boxes in a cycle: green Data Availability flows to blue Algorithmic Innovations, which flows to red Computing Infrastructure, which loops back to Data Availability. Yellow background box labeled Key Breakthroughs contains all three elements."}

@fig-perceptron (Line 802)

::: @fig-perceptron illustrates how weighted inputs combine with activation functions to enable pattern recognition: each input x_i multiplies by its corresponding weight w_{ij}, the products sum with a bias term, and the activation function produces the final output. Scaling beyond individual units, layers of perceptrons work in concert, with each layer's output serving as the input for the subsequent layer. This hierarchical arrangement creates deep learning models capable of tackling increasingly sophisticated tasks, from image recognition to natural language processing. Breaking down the computational mechanics, each input x_i has a corresponding weight w_{ij}, and the perceptron simply multiplies each input by its matching weight. The intermediate output, z, is computed as the weighted sum of inputs:

@fig-activation-functions (Line 814)

This mathematical formulation directly drives the hardware requirements we discussed earlier. The summation requires accumulator units, the multiplications demand high-throughput arithmetic units, and the memory accesses necessitate high-bandwidth memory systems. Understanding this connection between mathematical operations and hardware requirements is crucial for designing efficient ML systems. Activation functions are critical nonlinear transformations that enable neural networks to learn complex patterns by converting linear weighted sums into nonlinear outputs. Without activation functions, multiple linear layers would collapse into a single linear transformation, severely limiting the network's expressive power. Four commonly used activation functions each exhibit distinct mathematical characteristics that shape their effectiveness in different contexts (@fig-activation-functions). ::: {#fig-activation-functions fig-env="figure" fig-pos="htb" fig-cap="Common Activation Functions: Four nonlinear activation functions plotted with their output ranges. Sigmoid maps inputs to (0,1) with smooth gradients, tanh provides zero-centered outputs in (-1,1), ReLU introduces sparsity by outputting zero for negative inputs, and softmax converts logits into probability distributions." fig-alt="Four plots arranged in 2x2 grid. Top-left: Sigmoid S-curve from 0 to 1. Top-right: Tanh S-curve from -1 to 1. Bottom-left: ReLU showing zero for negative x, linear for positive x. Bottom-right: Softmax showing exponential curve approaching small positive values."}

@fig-activation-functions (Line 981)

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}

For a vector of K values (often called logits7 ), softmax transforms them into K probabilities that sum to 1. One component of the softmax output appears in @fig-activation-functions (bottom-right); in practice, softmax processes entire vectors where each element's output depends on all input values.

@fig-nonlinear (Line 1069)

\hat{y} = \sigma(z) = \sigma\left(\sum (x_i \cdot w_{ij}) + b\right)

@fig-nonlinear demonstrates this principle visually: the left panel shows a linear decision boundary that fails to separate the two classes, while the right panel shows how nonlinear activation functions enable the network to learn a curved boundary that correctly classifies the data. The universal approximation theorem[^fn-universal-approximation] establishes that neural networks with activation functions can approximate arbitrary functions. This theoretical foundation, combined with the computational and optimization characteristics of specific activation functions like ReLU and sigmoid, explains neural networks' practical effectiveness in complex tasks.

@fig-layers (Line 1087)

  1. Output Layer: Produces the final prediction or decision @fig-layers visualizes this hierarchical processing: data enters at the input layer, flows through multiple hidden layers that progressively extract more abstract features, and emerges at the output layer as a prediction. Each successive layer transforms the data representation, building increasingly complex and abstract features. This hierarchical processing gives deep neural networks their ability to learn complex patterns. ::: {#fig-layers fig-env="figure" fig-pos="htb" fig-cap="Layered Network Architecture: Deep neural networks transform data through successive layers, enabling the extraction of increasingly complex features and patterns. Each layer applies non-linear transformations to the outputs of the previous layer, ultimately mapping raw inputs to desired outputs." fig-alt="Neural network diagram showing input layer on left with multiple nodes, two hidden layers in middle with interconnected nodes, and output layer on right. Arrows show data flow from left to right through fully connected layers."}

@fig-connections (Line 1185)

In the simplest and most common case, each neuron in a layer is connected to every neuron in the previous layer, forming what we call a "dense" or "fully-connected" layer. This pattern means that each neuron has the opportunity to learn from all available features from the previous layer. Fully-connected layers establish foundational principles, but alternative connectivity patterns (explored in @sec-dnn-architectures) can dramatically improve efficiency for structured data by restricting connections based on problem characteristics. @fig-connections illustrates this dense connectivity pattern with explicit weight values: for a network with layers of sizes (n_1, n_2, n_3), every input connects to every hidden neuron via distinct weights (ihWeight), and every hidden neuron connects to every output (hoWeight). The weight matrices have these dimensions:

  • Between first and second layer: \mathbf{W}^{(1)} \in \mathbb{R}^{n_1 \times n_2}

@fig-mnist-topology-1 (Line 1348)

Feedforward Network Architecture

Applying the three-layer architecture to MNIST reveals how data characteristics and task requirements constrain network design. @fig-mnist-topology-1 illustrates both perspectives on this architecture: panel (a) shows a 28\times 28 pixel grayscale image of a handwritten digit connected to the hidden and output layers, while panel (b) demonstrates how the 2D image flattens into a 784-dimensional vector. The input layer's width is directly determined by our data format. For a 28\times 28 pixel image, each pixel becomes an input feature, requiring {python} mnist_input_str input neurons (28\times 28 = {python} mnist_input_str). We can think of this either as a 2D grid of pixels or as a flattened vector of 784 values, where each value represents the intensity of one pixel.

@fig-forward-propagation (Line 1847)

Forward Pass Computation

Forward propagation is the core computational process in a neural network: input data flows through the network's layers to generate predictions. @fig-forward-propagation traces the complete process. Inputs enter from the left, pass through weighted connections to hidden layers, generate a prediction that is compared against the true value, and produce a loss score that drives parameter updates through the optimizer. This process underlies both inference and training. We examine how it works using our MNIST digit recognition example. ::: {#fig-forward-propagation fig-env="figure" fig-pos="htb" fig-cap="Training Loop Architecture: Complete neural network training flow showing forward propagation through layers to generate prediction, comparison with true value via loss function, and backward propagation of gradients through optimizer to update weights and biases." fig-alt="Neural network training diagram. Left side shows input X flowing through blue, red, and green node layers via forward propagation (red arrow). Right side shows prediction and true value boxes feeding into loss function, which outputs loss score to optimizer, which updates weights and biases. Orange arrow shows backward propagation path."}

@fig-training-vs-inference (Line 2488)

Operational Phase Differences

Neural network operation divides into two distinct phases with markedly different computational requirements. Training requires both forward and backward passes to compute gradients and update weights; inference involves only the forward pass. @fig-training-vs-inference contrasts these phases: the top panel shows training with bidirectional data flow (forward for prediction, backward for error correction), while the bottom panel shows inference with only the streamlined forward path. This simplification means each layer performs only one set of operations during inference, transforming inputs to outputs using learned weights without tracking intermediate values for gradient computation. These computational differences manifest directly in hardware requirements and deployment strategies. Training clusters typically employ high-memory GPUs[^fn-training-gpu-specs] with substantial cooling infrastructure. Inference deployments prioritize latency and energy efficiency across diverse platforms: mobile devices utilize low-power neural processors (typically 24 W), edge servers deploy specialized inference accelerators[^fn-edge-accelerators], and cloud services employ inference-optimized instances with reduced numerical precision for increased throughput[^fn-inference-precision]. Production inference systems serving millions of requests daily require sophisticated infrastructure including load balancing, auto-scaling, and failover mechanisms typically unnecessary in training environments.

@fig-usps-digit-examples (Line 2812)

The United States Postal Service (USPS) processes over 100 million pieces of mail daily, each requiring accurate routing based on handwritten ZIP codes. In the early 1990s, human operators primarily performed this task, making it one of the largest manual data entry operations worldwide. Automating this process through neural networks represented an early, successful large-scale deployment of artificial intelligence. The complexity of this task becomes evident: a ZIP code recognition system must process images of handwritten digits captured under varying conditions. @fig-usps-digit-examples displays representative samples from this challenging dataset, showing the wide variation in writing styles, pen types, stroke thickness, and character formation that the system must handle. The system must make accurate predictions within milliseconds to maintain mail processing speeds, yet errors in recognition can lead to significant delays and costs from misrouted mail. This real-world constraint meant the system needed not just high accuracy, but also reliable measures of prediction confidence to identify when human intervention was necessary. Handwritten Digit Variability: Real-world handwritten digits exhibit significant variations in stroke width, slant, and character formation, posing challenges for automated recognition systems like those used by the USPS. These examples demonstrate the need for effective feature extraction and model generalization to achieve high accuracy in optical character recognition (OCR) tasks.{#fig-usps-digit-examples width=90% fig-alt="Grid of handwritten digit samples from USPS dataset showing digits 0-9 in multiple rows. Each digit appears in several variations demonstrating different handwriting styles, stroke widths, slants, and character formations that OCR systems must recognize."}

@fig-usps-inference-pipeline (Line 2832)

Production System Architecture

Following a single piece of mail through the USPS recognition system illustrates how the concepts in this chapter integrate into a complete solution. The journey from physical mail to sorted letter demonstrates the interplay between traditional computing, neural network inference, and physical machinery. @fig-usps-inference-pipeline visualizes this hybrid architecture, with the neural network operating as one component within a broader system of conventional preprocessing and post-processing stages. ::: {#fig-usps-inference-pipeline fig-env="figure" fig-pos="htb" fig-cap="USPS Inference Pipeline: The mail sorting pipeline combines traditional computing stages (green) with neural network inference (blue). Raw envelope images undergo preprocessing, including thresholding, segmentation, and normalization, before the neural network classifies individual digits. Post-processing applies confidence thresholds and formats sorting instructions for the physical sorting machinery." fig-alt="Linear pipeline with 6 boxes connected by arrows. From left: Raw Input and Pre-processing in green Traditional Computing section, Neural Network in orange Deep Learning section, then Raw Output, Post-processing, and Final Output in green Traditional Computing section."}


nn_architectures/nn_architectures.qmd

@fig-efficiency-frontier (Line 97)

Throughout this book, we will use five specific model architectures as recurring Lighthouse Examples. These serve as consistent reference points to ground abstract concepts in concrete systems reality. We introduce them here with both their qualitative roles and quantitative characteristics, then define their architectures in detail within their respective sections. These examples are concrete implementations of the Workload Archetypes (Compute Beast, Bandwidth Hog, etc.) introduced in @sec-ml-system-architecture. To understand why these specific models were chosen, it is helpful to look at the history of model evolution through the lens of the Efficiency Frontier (@fig-efficiency-frontier).

@fig-mlp (Line 294)

The pattern processing needs established above demand an architecture capable of relating any input to any output. MLPs solve this with complete connectivity between all nodes. This connectivity requirement manifests through a series of fully-connected layers, where each neuron connects to every neuron in adjacent layers, the "dense" connectivity pattern introduced in @sec-deep-learning-systems-foundations. This architectural principle translates the dense connectivity pattern into matrix multiplication operations8 , establishing the mathematical foundation that makes MLPs computationally tractable. @fig-mlp illustrates how each layer transforms its input through the fundamental operation introduced in @sec-deep-learning-systems-foundations:

@fig-cnn-spatial-processing (Line 583)

This hierarchical spatial pattern processing appears across many domains. In computer vision, local pixel patterns form edges and textures that combine into recognizable objects. Speech processing relies on patterns across nearby time segments to identify phonemes and words. Sensor networks analyze correlations between physically proximate sensors to understand environmental patterns. Medical imaging depends on recognizing tissue patterns that indicate biological structures. This hierarchical approach succeeds not because it mimics the brain, but because it mirrors the compositional structure of the data itself. Focusing on image processing to illustrate these principles, if we want to detect a cat in an image, certain spatial patterns must be recognized: the triangular shape of ears, the round contours of the face, the texture of fur. Importantly, these patterns maintain their meaning regardless of where they appear in the image. A cat is still a cat whether it appears in the top-left or bottom-right corner. This indicates two key requirements for spatial pattern processing: the ability to detect local patterns and the ability to recognize these patterns regardless of their position9 . @fig-cnn-spatial-processing shows convolutional neural networks achieving this through hierarchical feature extraction, where simple patterns compose into increasingly complex representations at successive layers.

@fig-cnn-spatial-processing (Line 801)

::: @fig-cnn-spatial-processing illustrates how the CNN architecture introduced earlier in this chapter implements these spatial processing principles in practice. As pioneered by Yann LeCun10 and @lecun1989backpropagation, the key innovations that make this possible are parameter sharing[^fn-parameter-sharing], local connectivity, and translation invariance[^fn-translation-invariance].

@fig-cnn (Line 1055)

The architectural efficiency of CNNs allows further optimization through specialized techniques like depthwise separable convolutions and pruning, detailed in @sec-model-compression. These optimization strategies build on spatial locality principles, with @sec-ai-acceleration detailing how modern processors exploit convolution's inherent data reuse patterns. @fig-cnn visualizes the core convolution operation: a small filter slides over the input image to generate a feature map11 , capturing local structures while maintaining translation invariance. For an interactive visual exploration of convolutional networks, the CNN Explainer [@cnn_explainer] project provides an insightful demonstration of how these networks are constructed.

@fig-rnn (Line 1284)

\mathbf{h}t = f(\mathbf{W}{hh}\mathbf{h}{t-1} + \mathbf{W}{xh}\mathbf{x}_t + \mathbf{b}_h)


where $\mathbf{h}_t$ denotes the hidden state at time $t$, $\mathbf{x}_t$ denotes the input at time $t$, $\mathbf{W}_{hh}$ contains the recurrent weights, and $\mathbf{W}_{xh}$ contains the input weights. @fig-rnn visualizes the unfolded network structure, making explicit the temporal dependencies that this recurrence creates.
In word sequence processing, each word may be represented as a 100-dimensional vector ($\mathbf{x}_t$), with a hidden state of 128 dimensions ($\mathbf{h}_t$). At each time step, the network combines the current input with its previous state to update its sequential understanding, establishing a memory mechanism capable of capturing patterns across time steps.

@fig-transformer-attention-visualized (Line 1481)

Expanding beyond language, this requirement for dynamic processing appears across many domains. In protein structure prediction, interactions between amino acids depend on their chemical properties and spatial arrangements. In graph analysis, node relationships vary based on graph structure and node features. In document analysis, connections between different sections depend on semantic content rather than just proximity. Synthesizing these requirements, dynamic processing demands specific capabilities: the system must compute relationships between all pairs of elements, weigh these relationships based on content, and use the resulting weights to selectively combine information. Unlike previous architectures with fixed connectivity patterns, dynamic processing requires the flexibility to modify its computation graph based on the input itself. These capabilities lead to the attention mechanism, which serves as the foundation for the Transformer architecture examined in detail in the following sections. @fig-transformer-attention-visualized shows attention enabling this dynamic information flow. ::: {#fig-transformer-attention-visualized fig-env="figure" fig-pos="htb" fig-cap="Attention Weights Visualization: Attention head (layer 4, head 2) resolving the pronoun "they" in the sentence. Line thickness indicates attention weight magnitude: "student", "The", and "finish" receive equally strong attention (bold connections), demonstrating that attention learns to link pronouns with their referents across arbitrary distances. This dynamic routing replaces RNN sequential processing with O(1) information flow depth, enabling parallel computation across all 12 positions simultaneously." fig-alt="Sentence tokens listed vertically with cyan attention lines from highlighted word they connecting to all other tokens. Thick lines to student and finish show high attention weights. Demonstrates pronoun-referent linking across arbitrary distances."}

@fig-attention (Line 1548)

@fig-attention-weightcalc (Line 1668)

::: Unlike the fixed weight matrices found in previous architectures, attention weights are computed dynamically for each input. @fig-attention-weightcalc demonstrates this dynamic computation, showing how the model adapts its processing based on the specific content. ::: {#fig-attention-weightcalc fig-env="figure" fig-pos="htb" fig-cap="QKV Projection Computation: The embedding matrix (6 \times 768) multiplies with QKV weight matrices (768 \times 2304) plus bias to produce combined projections (6 \times 2304). The 2304 output dimension contains concatenated query, key, and value projections (each 768-dimensional). This single batched matrix multiplication, requiring 6 \times 768 \times 2304 = 10.6 million MACs, replaces three separate projection operations for efficiency. Source: Transformer Explainer [@transformer_explainer]." fig-alt="Matrix multiplication: 6x768 embedding times 768x2304 QKV weights plus 2304 bias equals 6x2304 output. Blue and red regions show concatenated query, key, value projections. Token labels Data, visualization, em, powers, users, to."}

@fig-transformer (Line 1985)

Self-attention learns dynamic activation patterns across the input sequence. Unlike CNNs which apply fixed filters or RNNs which use fixed recurrence patterns, attention learns which elements should activate together based on their content. This creates a form of adaptive connectivity where the effective network topology changes for each input. Recent research has shown that attention heads in trained models often specialize in detecting specific linguistic or semantic patterns [@clark2019what], suggesting that the mechanism naturally discovers interpretable structural regularities in data. The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections. @fig-transformer illustrates this complete architecture, showing how these components combine to process input sequences in parallel while capturing complex dependencies without sequential computation. Transformers have demonstrated significant effectiveness across a wide range of tasks, from natural language processing to computer vision, transforming deep learning architectures across domains. ::: {#fig-transformer fig-env="figure" fig-pos="htb" fig-cap="Transformer Architecture (Encoder-Decoder): Complete architecture from Vaswani et al. The encoder (left, repeated N times) consists of multi-head attention followed by feed-forward layers, each with residual connections (arrows bypassing blocks) and layer normalization. The decoder (right) adds masked attention to prevent attending to future tokens during autoregressive generation. Positional encodings (sine waves) inject sequence order information absent from the permutation-invariant attention operation. This design enables training parallelism across all positions while the decoder maintains autoregressive causality during inference. Source: Vaswani et al. [@vaswani2017attention]." fig-alt="Encoder-decoder architecture. Encoder: multi-head attention, add-norm, feed-forward, add-norm, repeated Nx. Decoder adds masked attention. Positional encoding sine waves at inputs. Skip connections bypass sublayers. Linear and softmax at top."}

@fig-context-explosion (Line 2143)

Fifth, self-attention generates memory-intensive intermediate results. The attention weights matrix (N\times N) and per-head intermediate results create large memory requirements, especially for long sequences, requiring careful memory management for deployment on constrained devices. The parallel nature of Transformers makes them well-suited for modern hardware, but quadratic complexity with sequence length constrains long-sequence processing. Much research has therefore focused on optimization techniques such as sparse attention patterns and low-rank approximations. Each technique presents its own trade-offs between computational efficiency and model expressiveness, a balance that must be considered in practical applications. As shown in @fig-context-explosion, this architectural capability has driven an exponential growth in context windows, enabling models to reason over entire books or codebases in a single pass.

@fig-example-skip-connection (Line 2488)

Modern Architectures: Synthesis and Unification

The progression from perceptrons to attention mechanisms shows how each architectural generation contributed building blocks that modern designs combine and refine. Transformers, in particular, represent a sophisticated synthesis of these fundamental components. Rather than introducing entirely new patterns, they innovate through strategic combination and refinement of existing components. The Transformer architecture exemplifies this approach: at its core, MLP-style feedforward networks process features between attention layers. The attention mechanism itself builds on sequence model concepts while eliminating recurrent connections, instead employing position embeddings inspired by CNN intuitions. Skip connections inherited from ResNets (@fig-example-skip-connection) enable gradient flow through deep stacks, while layer normalization, evolved from CNN batch normalization, stabilizes optimization [@ba2016layer]. The transition to Transformers represents a fundamental engineering shift from sequential to parallel state management. While RNNs processed tokens one by one, creating a computational bottleneck that limited hardware parallelism, Transformers introduced the "Attention" mechanism as an architectural solution to a systems constraint. By replacing time-step dependencies with global, data-dependent routing, we moved from O(n) sequential complexity to O(1) depth for information flow, enabling full use of the massive parallel processing power of modern accelerators.

@fig-im2col-diagram (Line 2580)

Three operations serve as the fundamental building blocks for all deep learning computations: matrix multiplication, sliding window operations, and dynamic computation. These operations are primitive because they cannot be further decomposed without losing their essential computational properties and efficiency characteristics. Matrix multiplication represents the basic form of transforming sets of features. Multiplying a matrix of inputs by a matrix of weights computes weighted combinations, the core operation of neural networks. For example, in our MNIST network, each 784-dimensional input vector multiplies with a 784\times 100 weight matrix. This pattern appears everywhere: MLPs use it directly for layer computations, CNNs reshape convolutions into matrix multiplications (@fig-im2col-diagram shows this transformation of a 3\times 3 convolution into a matrix operation), and Transformers use it extensively in their attention mechanisms.

Computational Building Blocks

@fig-im2col-diagram (Line 2697)

::: The im2col12 (image to column) technique accomplishes matrix reshaping by unfolding overlapping image patches into columns of a matrix (@fig-im2col-diagram). Each sliding window position in the convolution becomes a column in the transformed matrix, while the filter kernels are arranged as rows. This allows the convolution operation to be expressed as a standard GEMM (General Matrix Multiply) operation. The transformation trades memory consumption (duplicating data where windows overlap) for computational efficiency, enabling CNNs to leverage decades of BLAS optimizations and achieving 5-10x speedups on CPUs. In modern systems, these matrix multiplications map to specific hardware and software implementations. Datacenter accelerators can deliver on the order of hundreds of TFLOPS on mixed-precision matrix operations, and software frameworks like PyTorch and TensorFlow automatically map these high-level operations to optimized matrix libraries (for example, NVIDIA cuBLAS and Intel oneMKL) that exploit available hardware capabilities.

@fig-collective-comm (Line 2793)

Memory access patterns describe where data resides, but a complementary question determines system performance: how does information flow between components? Data movement primitives characterize these flows. These patterns are key because data movement often consumes more time and energy than computation itself, as moving data from off-chip memory typically requires 100-1000$\times$ more energy than performing a floating-point operation. Four data movement patterns are prevalent in deep learning architectures: broadcast, scatter, gather, and reduction. @fig-collective-comm visualizes these fundamental patterns and their relationships, showing how data flows between processing elements in each case. Broadcast operations send the same data to multiple destinations simultaneously. In matrix multiplication with batch size 32, each weight must be broadcast to process different inputs in parallel. Modern hardware supports this through specialized interconnects and hardware multicast capabilities, with bandwidth on the order of hundreds of GB/s in high-end accelerator interconnects, while some accelerators also use dedicated on-chip broadcast fabrics. Software frameworks optimize broadcasts by restructuring computations (like matrix tiling) to maximize data reuse. ::: {#fig-collective-comm fig-env="figure" fig-pos="htb" fig-cap="Data Movement Primitives: Four fundamental patterns govern information flow in neural network computation. Broadcast (top-left) replicates a single value to all destinations, used when sharing weights across batch elements. Scatter (top-right) distributes distinct elements to different destinations, enabling work partitioning. Gather (bottom-left) collects distributed values to a single location, as in attention pooling. Reduction (bottom-right) combines multiple values through aggregation (sum, max), appearing in gradient synchronization and attention scoring. Moving data typically costs 100-1000x more energy than computation, making these patterns critical optimization targets." fig-alt="Four diagrams with nodes and arrows. Broadcast: one red square to four nodes. Scatter: four colored squares to four nodes. Gather: four nodes with colored squares to one. Reduction: four colored nodes combine through aggregation to one."}

@fig-dnn-fm-framework (Line 3188)

::: The decision flowchart in @fig-dnn-fm-framework guides systematic architecture selection by first matching data characteristics to architectural strengths, then validating against practical constraints. The process is inherently iterative: resource limitations or performance gaps often necessitate reconsidering earlier choices, ensuring consideration of all relevant factors while avoiding selection based on novelty or perceived sophistication. The framework applies through four key steps. First, data analysis: pattern types in data provide the strongest initial signal. Spatial data naturally aligns with CNNs, sequential data with RNNs. Second, progressive constraint validation: each constraint check (memory, computational budget, inference speed) acts as a filter. Failing any constraint requires either scaling down the current architecture or considering a fundamentally different approach.


frameworks/frameworks.qmd

@fig-mlfm-timeline (Line 130)

  1. Automatic Differentiation (2015present): While NumPy required engineers to manually derive and code gradients, modern frameworks like TensorFlow [@abadi2016tensorflow] and PyTorch automated this through the computational graph. This architectural shift, separating the definition of the model from the computation of its derivatives, enabled the scaling of deep learning. This evolution highlights a critical engineering lesson: scaling ML development required turning the mathematical chain rule into a software primitive. The transition from manual gradients to static graphs (Theano [@al2016theano], TensorFlow 1.x), and eventually to dynamic graphs (PyTorch), reflects the industry's search for the optimal balance between performance and developer velocity. @fig-mlfm-timeline visualizes this progression from foundational libraries to modern frameworks. ::: {#fig-mlfm-timeline fig-env="figure" fig-pos="htb" fig-cap="Computational Library Evolution: Modern machine learning frameworks build upon decades of numerical computing advancements, transitioning from low-level routines like BLAS and LAPACK to high-level abstractions in NumPy, SciPy,[^fn-scipy-date] and finally to deep learning frameworks such as Theano [@bergstra2010theano], TensorFlow, and PyTorch." fig-alt="Horizontal timeline from 1979 to 2018 with colored boxes marking key years. Dashed arrows connect to milestones below: 1979 BLAS introduced, 1992 LAPACK extends BLAS, 2006 NumPy becomes Python's numerical backbone, 2007 SciPy and Theano introduce computational graphs, 2015 TensorFlow revolutionizes distributed ML, 2016 PyTorch introduces dynamic graphs, 2018 JAX introduces functional paradigms."}

@fig-comp-graph (Line 206)

The Computational Graph

At the heart of this problem is a representation called the computational graph, a directed acyclic graph (DAG) where nodes represent operations and edges represent data dependencies. This graph is the framework's internal model of your computation. @fig-comp-graph illustrates a simple example: computing z = x \times y requires two input nodes (x and y), one operation node (multiplication), and one output node (z). The execution problem asks: when is this graph constructed, and when is it executed? ::: {#fig-comp-graph fig-env="figure" fig-pos="htb" fig-cap="Simple Computational Graph. A directed acyclic graph representing the computation z = x \\times y, where nodes define operations and edges specify the flow of data between them." fig-alt="Simple directed graph with nodes x and y flowing into function f(x,y) which outputs z."}

@fig-mlfm-comp-graph (Line 236)

::: Real machine learning models require much more complex graph structures. @fig-mlfm-comp-graph shows how a neural network computation graph involves interconnected operation nodes with system-level interactions including memory management and device placement, demonstrating how the graph representation enables pre-execution analysis and resource allocation. ::: {#fig-mlfm-comp-graph fig-env="figure" fig-pos="htb" fig-cap="Computation Graph with System Interactions. A neural network represented as a directed acyclic graph (left), with system components including memory management and device placement (right) that interact with the graph to optimize resource allocation before execution." fig-alt="Left side shows computational graph with 6 operation nodes connected by data flow edges. Right side shows system components box with Memory Management and Device Placement nodes that interact with the computational graph."}

@fig-mlfm-dynamic-graph-flow (Line 383)

After backward() completes, the autograd tape is destroyed to free memory. The next forward pass builds a completely new tape. This design enables memory-efficient training: you only pay for gradient computation storage during the backward pass. @fig-mlfm-dynamic-graph-flow illustrates this "define-by-run" execution model: each operation is defined, executed, and completed before moving on to define the next operation. This contrasts sharply with static graphs, where all operations must be defined upfront. When an operation is defined, it is immediately executed, and its results become available for subsequent operations or for inspection during debugging. ::: {#fig-mlfm-dynamic-graph-flow fig-env="figure" fig-pos="htb" fig-cap="Dynamic Graph Execution Flow: In eager execution, each operation is defined and immediately executed before the next operation begins. This define-by-run model enables natural debugging and data-dependent control flow at the cost of optimization opportunities." fig-alt="Flow diagram showing Start to Operation 1 to Operation 1 Executed to Operation 2 to Operation 2 Executed to End. Above arrows show Define Operation, Execute Operation, Define Next Operation, Execute Operation, Repeat Until Done."}

@fig-mlfm-static-graph (Line 482)

::: @fig-mlfm-static-graph illustrates this two-phase approach: first, the complete computational graph is constructed and optimized; then, during the execution phase, actual data flows through the graph to produce results. This separation enables the framework to perform thorough analysis and optimization of the entire computation before any execution begins. ::: {#fig-mlfm-static-graph fig-env="figure" fig-pos="htb" fig-cap="Static Graph: Define then Execute. The two phases of static graph execution. The definition phase (left) declares operations and builds the graph. The execution phase (right) loads data, runs the optimized graph, and produces results." fig-alt="Flow diagram showing two phases. Definition Phase: Define Operations, Declare Variables, Build Graph. Execution Phase: Load Data, Run Graph, Get Results. Arrows connect boxes left to right."}

@fig-python-tax (Line 742)

::: @fig-python-tax demonstrates this tax in action. Eager execution (top) suffers from "gaps" where the GPU sits idle while Python dispatches the next kernel. Compilation (bottom) fuses these operations into a single kernel launch, eliminating the gaps.

@fig-compilation-continuum (Line 1057)

Production inference (N_{\text{dev}} \approx 0, N_{\text{prod}} \rightarrow \infty): Maximize compilation. With no development iterations and potentially millions of requests, every optimization matters. Use mode='max-autotune' despite hour-long compilation; the cost is amortized over the deployment lifetime. @fig-compilation-continuum visualizes the decision space: ::: {#fig-compilation-continuum fig-cap="The Compilation Continuum: Optimal execution strategy depends on development-to-production ratio. Left region (high dev iterations): eager mode dominates. Right region (high prod executions): compilation dominates. The crossover point depends on compilation cost and per-execution speedup." fig-alt="Graph with x-axis 'Production Executions' (log scale) and y-axis 'Total Time'. Three lines: Eager (steep slope), JIT (moderate slope with offset), Static (gentle slope with larger offset). Lines cross at different points showing when compilation becomes beneficial."}

@fig-tensor-data-structure-a (Line 1888)

::: Tensor Structure and Dimensions. A tensor generalizes scalars, vectors, and matrices to arbitrary dimensions. The hierarchy is straightforward: a scalar is a rank-0 tensor (single value), a vector is rank-1 (sequence of values), and a matrix is rank-2 (rows and columns). Higher ranks extend this pattern through nesting, so a rank-3 tensor is a stack of matrices, as @fig-tensor-data-structure-a illustrates. ::: {#fig-tensor-data-structure-a fig-env="figure" fig-pos="htb" fig-cap="Tensor Rank Hierarchy. Four shapes illustrating tensor ranks from left to right: a single value (rank 0, scalar), a column of values (rank 1, vector), a grid of values (rank 2, matrix), and a cube of values (rank 3, three-dimensional tensor)." fig-alt="Four shapes showing tensor ranks left to right: single box labeled Rank 0, vertical column of numbers labeled Rank 1, 2D grid of numbers labeled Rank 2, and 3D cube labeled Rank 3."}

@fig-tensor-data-structure-b (Line 1937)

::: This rank hierarchy maps directly onto ML data. A color image is a rank-3 tensor: height x width x 3 channels (red, green, blue), as @fig-tensor-data-structure-b illustrates. Stacking a batch of N images adds a fourth dimension, producing a rank-4 tensor of shape [N, 3, H, W]. Every convolutional layer in a vision model consumes and produces tensors of exactly this shape, which is why the tensor abstraction is so central to framework design. ::: {#fig-tensor-data-structure-b fig-env="figure" fig-pos="htb" fig-cap="Image as RGB Tensor. Three stacked grids representing the red, green, and blue color channels of an image, with dimension labels showing width, height, and channel depth forming a rank-3 tensor. Credit: Niklas Lang https://towardsdatascience.com/what-are-tensors-in-machine-learning-5671814646ff." fig-alt="Three stacked 3x3 grids in red, green, and blue representing RGB color channels. Dimension labels show width 3 pixels, height 3 pixels, and 3 color channels forming a 3D tensor for image data."}

@fig-tensor-memory-layout (Line 1991)

Framework tensors carry more than raw numbers. Each tensor stores metadata that the runtime uses to validate operations and select fast execution paths: a shape tuple (e.g., [64, 3, 224, 224] for a batch of images), a dtype (float32, float16, int8), and a device tag (CPU, cuda:0). A matrix multiplication, for instance, checks shape compatibility at dispatch time and uses the dtype to route to the correct hardware kernel, whether a standard FP32 GEMM or a Tensor Core FP16 path. Memory layout implementation introduces distinct challenges in tensor design. While tensors provide an abstraction of multi-dimensional data, physical computer memory remains linear. Stride patterns address this disparity by creating mappings between multi-dimensional tensor indices and linear memory addresses. These patterns significantly impact computational performance by determining memory access patterns during tensor operations. @fig-tensor-memory-layout demonstrates this concept using a 2×3 tensor, showing both row-major and column-major memory layouts with their corresponding stride calculations. ::: {#fig-tensor-memory-layout fig-env="figure" fig-pos="htb" fig-cap="Tensor Memory Layout: A 2×3 tensor can be stored in linear memory using either row-major (C-style) or column-major (Fortran-style) ordering. Strides define the number of elements to skip in each dimension when moving through memory, enabling frameworks to calculate memory addresses for tensor[i,j] as base_address + i×stride[0] + j×stride[1]. The choice of memory layout significantly impacts cache performance and computational efficiency." fig-alt="Left: 2x3 tensor grid with values 1-6. Right: two linear arrays showing row-major layout (1,2,3,4,5,6) and column-major layout (1,4,2,5,3,6). Below: stride calculations for row-major [3,1] and column-major [1,2]."}

@fig-3d-parallelism (Line 2476)

When training scales beyond a single machine, these same abstractions extend to manage process groups and communication primitives. Frameworks use constructs like ProcessGroup (PyTorch) or Mesh (JAX) to define how devices communicate, maintaining state for collective operations such as AllReduce that synchronize gradients across thousands of GPUs. This includes partitioning computational graphs, synchronizing gradients, and redistributing data as needed. We introduce these concepts here because they shape framework API design even for single-node code. The implementation details of distributed training, including gradient compression, communication topologies, and fault tolerance, are covered in Volume II (Distributed Training Systems). @fig-3d-parallelism provides an overview of how large-scale training distributes computation across multiple dimensions [@mcmahan2023communicationefficient]. ::: {#fig-3d-parallelism fig-env="figure" fig-pos="htb" fig-cap="3D Parallelism. A grid of eight accelerator clusters arranged in two rows and four columns, each containing stacked computational units. Distinct colors encode the three parallelism dimensions: data parallelism across columns, pipeline parallelism across rows, and model parallelism within each cluster." fig-alt="Grid of 8 GPU clusters in 2 rows and 4 columns. Each cluster contains 4 stacked cubes. Colors vary: blue, red, green, orange in bottom row; olive, yellow, brown, pink in top row."}

@fig-mlfm-core-ops (Line 2585)

Core Operations

When you write y = torch.matmul(x, w), what actually happens between Python and the GPU? The answer involves three distinct layers working in coordination. @fig-mlfm-core-ops illustrates how hardware abstraction operations manage computing platform complexity, basic numerical operations implement mathematical computations, and system-level operations coordinate resources and execution. ::: {#fig-mlfm-core-ops fig-env="figure" fig-pos="htb" fig-cap="Core Operations Stack. Three grouped layers showing how frameworks bridge Python code to hardware. The top layer contains system-level operations (scheduling, memory management, resource optimization), the middle layer holds numerical operations (GEMM, BLAS, element-wise), and the bottom layer provides hardware abstraction (kernel management, memory abstraction, execution control)." fig-alt="Three grouped boxes connected by arrows. System-Level: Scheduling, Memory Management, Resource Optimization. Numerical: GEMM, BLAS, Element-wise Operations. Hardware: Kernel Management, Memory Abstraction, Execution Control."}

@fig-tensorflow-architecture (Line 2903)

TensorFlow's architecture reflects a comprehensive solution to the Abstraction Problem: how do you target diverse hardware—from cloud TPUs to microcontrollers—using a single interface? Its design philosophy is built on the Static Graph (or "Define-and-Run") principle. By requiring the model to be represented as a complete computational graph before execution, TensorFlow enables ahead-of-time (AOT) compilation and optimization. This approach prioritizes the Deployment Spectrum. Because the framework sees the entire graph, it can perform aggressive optimizations like constant folding, operator fusion, and memory layout optimization before the first byte of data is processed. This is why TensorFlow remains the standard for complex production ecosystems, as illustrated in @fig-tensorflow-architecture. ::: {#fig-tensorflow-architecture fig-env="figure" fig-pos="htb" fig-cap="TensorFlow Training-to-Deployment Pipeline. Two-column diagram showing the training path (left) from data preprocessing through tf.keras and distribution strategy across CPU, GPU, and TPU, and the deployment path (right) from SavedModel export to TensorFlow Serving, Lite, JS, and language bindings. Source: TensorFlow.." fig-alt="Two-column diagram. Training: data preprocessing, tf.keras, TensorFlow Hub, Premade Estimators, Distribution Strategy across CPU/GPU/TPU. Deployment via SavedModel to TensorFlow Serving, Lite, JS, and language bindings."}

@fig-onnx (Line 3193)

: Framework Selection by Deployment Target. Recommended frameworks, optimization techniques, and key constraints for each deployment tier, from cloud servers to microcontrollers. {#tbl-deployment-frameworks} The table above reveals a fragmented landscape: different deployment targets favor different frameworks. This fragmentation creates a practical problem when organizations train in one framework but deploy on a target best served by another. Interoperability through ONNX. The Open Neural Network Exchange (ONNX)[^fn-onnx] format addresses this by enabling model portability across frameworks: train in PyTorch, deploy via TensorFlow Lite or ONNX Runtime. This standardization eliminates manual conversion when moving between development and production environments. @fig-onnx illustrates this hub-and-spoke interoperability model. Detailed deployment optimization (quantization, pruning, hardware-specific compilation) appears in @sec-model-compression and @sec-model-serving-systems. Framework Interoperability: ONNX enables model portability across frameworks, allowing training in one framework and deployment in another.{#fig-onnx fig-pos="htb" width="70%" fig-alt="Hub diagram with ONNX logo at center. Left side: PyTorch, TensorFlow, Keras with arrows pointing inward. Right side: TF Lite, ONNX Runtime with arrows outward."}


hw_acceleration/hw_acceleration.qmd

@fig-iron-law-heatmap (Line 129)

::: Amdahl's Law is not merely theoretical: it explains why many GPU upgrades disappoint in practice. The following heatmap (@fig-iron-law-heatmap) visualizes this "Acceleration Wall," showing that unless your workload is highly parallelizable (P > 0.99), investing in faster hardware yields diminishing returns.

@fig-timeline (Line 242)

This progression from specialization to integration has shaped modern computing. Each domain (graphics, signal processing, machine learning) introduced specialized architectures that were later absorbed into general-purpose platforms. @fig-timeline traces this trajectory: each era produced accelerators addressing the dominant computational bottleneck of the period. The capabilities enabling today's real-time translation, recommendations, and on-device inference build directly on principles established in earlier specialization waves. ::: {#fig-timeline fig-env="figure" fig-pos="htb" fig-cap="Hardware Specialization Timeline. Computing architectures progressively incorporate specialized accelerators to address emerging performance bottlenecks, from floating-point units to graphics processors and machine learning accelerators. Each era produced hardware tailored to the dominant computational patterns of its period." fig-alt="Timeline spanning 1980s to 2020s showing hardware evolution: floating-point units, GPUs with hardware transform and lighting, media codecs, TPUs with tensor cores, and application-specific AI engines."}

@fig-accelerator-anatomy (Line 441)

  1. Tolerance for Reduced Precision: Neural networks typically remain robust even when using 8-bit or 4-bit integers instead of 64-bit floating-point numbers. This flexibility allows architects to fit 10$\times$ more compute units in the same silicon area. The primary engineering challenge is no longer "how fast can we calculate?" but "how close can we keep the data to the calculation?" In modern accelerators, accessing data from external memory (DRAM) can consume 100\times more energy than the actual arithmetic operation. This disparity drives the "Anatomy of an Accelerator" (@fig-accelerator-anatomy), prioritizing high-bandwidth memory (HBM)13 and large on-chip scratchpads to minimize data movement.

@fig-accelerator-anatomy (Line 455)

The host interface establishes connectivity between the specialized accelerator and the broader computing system, enabling coordination between general-purpose CPUs that manage program control flow and the accelerator that executes computationally intensive neural network operations. This architectural partitioning reflects specialization at the system level: CPUs address control flow, conditional logic, and system coordination, while accelerators focus on the regular, massively parallel arithmetic operations that dominate neural network execution. @fig-accelerator-anatomy shows how specialized compute units, hierarchical memory subsystems, and host connectivity integrate to form a system optimized for AI workloads. ::: {#fig-accelerator-anatomy fig-env="figure" fig-pos="htb" fig-cap="Anatomy of a Modern AI Accelerator: AI accelerators integrate specialized processing elements containing tensor cores, vector units, and special function units, supported by a hierarchical memory system from high-bandwidth memory down to local caches. This architecture maximizes data reuse and parallel execution while minimizing energy-intensive data movement, forming the foundation for 100-1000× performance improvements over general-purpose processors." fig-alt="Block diagram showing AI accelerator architecture: CPU connects to DRAM stacks and processing element grid containing tensor cores, vector units, and local caches in hierarchical arrangement."}

@fig-accelerator-anatomy (Line 634)

AI Compute Primitives

The accelerator architecture presented in @fig-accelerator-anatomy raises an immediate question: why these specific components? The tensor cores, vector units, and hierarchical memory exist not by accident but because neural network computations repeatedly invoke a small set of operations. Understanding these computational patterns, which we call compute primitives, reveals why specialized hardware achieves 100-1000x improvements over general-purpose processors. The transition from CPUs achieving approximately 100 GFLOPS to accelerators delivering 100,000+ GFLOPS reflects architectural optimization for these specific patterns, which appear repeatedly across all neural network architectures regardless of application domain or model size. These patterns manifest in a small number of core computational operations. Regardless of the layer type, whether fully connected, convolutional, or attention-based layers, the underlying operation typically involves multiplying input values by learned weights and accumulating the results. This repeated multiply-accumulate process dominates neural network execution and defines the arithmetic foundation of AI workloads. The regularity and frequency of these operations have led to the development of AI compute primitives: hardware-level abstractions optimized to execute these core computations with high efficiency.

@fig-ai-performance (Line 1081)

@fig-sparse-formats (Line 1111)

  1. Result: This effectively doubles the FLOPs/byte ratio, providing a theoretical 2\times speedup over dense matrix multiplication with minimal accuracy loss, provided the model is fine-tuned to respect the 2:4 constraint. To understand why "Structured" patterns are required for hardware speedup, consider how sparse matrices are actually stored in memory. As shown in @fig-sparse-formats, a sparse format (like CSR or Block Sparse) must store indices alongside values. If the sparsity is random, the index overhead and irregular access kill performance. Structured sparsity—whether at the Large Block scale or the Fine-Grained N:M scale—makes this indexing predictable and compact, allowing hardware to fetch data efficiently. ::: {#fig-sparse-formats fig-env="figure" fig-pos="htb" fig-cap="Sparse Storage Formats: Hardware efficiency depends on how sparse matrices are stored. Dense storage (top left) is simple but wasteful for zeros. Block Sparse (top right) and CSR (bottom) compress the matrix by storing only non-zero values and their indices. Structured sparsity (like N:M or Blocks) makes this indexing predictable, allowing hardware to fetch data and skip zeros efficiently." fig-alt="Grid of 3x3 matrix blocks. Top left: Dense Matrix. Top right: Block Sparse Matrix showing dense sub-blocks. Bottom: Sparse Matrix (CSR) and Block Sparse (BSR) representations showing values and index arrays."}

@fig-systolic-array (Line 1519)

A systolic array arranges processing elements in a grid pattern, where data flows rhythmically between neighboring units in a synchronized manner, enabling each operand to participate in multiple computations as it propagates through the array. This structured movement minimizes external memory accesses by maximizing local data reuse. A single weight value can contribute to dozens of operations as it moves through the processing elements, transforming the energy profile from memory-bound to compute-efficient execution. Kung and Leiserson14 [@kung1979systolic] first introduced systolic arrays, formalizing their use in parallel computing architectures for efficient matrix operations [@Kung1982]. Unlike general-purpose execution units, systolic arrays exploit spatial and temporal locality by reusing operands as they propagate through the grid. Google's TPU exemplifies this architectural approach: in the TPUv4, a 128\times128 systolic array of multiply-accumulate units processes matrix operations by streaming data through the array in a pipelined manner [@jouppi2017datacenter]. @fig-systolic-array captures this dataflow architecture.

@fig-energy-hierarchy (Line 1798)

::: The underlying cause of this wall is physical: moving data costs orders of magnitude more energy than processing it. As visualized in @fig-energy-hierarchy, the "Horowitz Numbers" reveal the fundamental energy constants of silicon.

@fig-compute-memory-imbalance (Line 1825)

@fig-rising-ridge (Line 1888)

The impact of this divergence is quantified in @fig-rising-ridge, which shows how the hardware's "Ridge Point" has skyrocketed, making sparse operations increasingly inefficient.
```{python}

@fig-memory-wall (Line 1958)

with the number of floating-point operations (FLOPs) divided by the peak hardware throughput, P_{\text{peak}}. When T_{\text{mem}} > T_{\text{compute}}, the system becomes memory-bound, meaning that the processing elements spend more time waiting for data than performing computations. This imbalance demonstrates the need for memory-optimized architectures and efficient data movement strategies to sustain high performance. @fig-memory-wall quantifies this disparity for specific models and hardware generations, showing how model parameter counts have outpaced memory bandwidth improvements. The gap between these curves, from AlexNet to trillion-parameter models, represents the engineering challenge that drives accelerator memory system design.

Irregular Memory Access

@fig-host-accelerator-data-movement (Line 2174)

Machine learning accelerators, such as GPUs and TPUs, achieve high computational throughput through parallel execution. However, their efficiency is often constrained by data movement between the host (CPU) and accelerator memory. Compared to many traditional workloads that keep most data within a single memory domain, AI workloads can require frequent transfers between CPU memory and accelerator memory, introducing latency, consuming bandwidth, and affecting overall performance. Host-accelerator data movement follows a structured sequence. Before computation begins, data is copied from CPU memory to the accelerator's memory. The CPU then issues execution instructions, and the accelerator processes the data in parallel. Once computation completes, the results are stored in accelerator memory and transferred back to the CPU. @fig-host-accelerator-data-movement details these sequential steps, each introducing potential inefficiencies that must be managed to optimize performance. ::: {#fig-host-accelerator-data-movement fig-env="figure" fig-pos="htb" fig-cap="Host-Accelerator Data Transfer: AI workloads require frequent data movement between CPU memory and accelerators. The four sequential steps of copying input data, issuing execution instructions, parallel computation, and transferring results each introduce potential performance bottlenecks." fig-alt="Four-step data flow diagram: (1) copy data from main memory to GPU memory, (2) CPU instructs GPU, (3) GPU executes in parallel, (4) results copy back to main memory."}

@fig-systolic-array (Line 2894)

Memory Allocation. While computation placement determines where operations execute, memory allocation defines where data resides and how it flows through the memory hierarchy during execution. The primary goal is to keep frequently accessed data as close as possible to the processing elements, minimizing latency and power consumption. GPUs achieve this through a mix of global memory, shared memory, and registers with careful tiling strategies [@nvidia2020ampere]. TPUs use on-chip SRAM scratchpads where activations and weights must be preloaded to sustain systolic array execution (@fig-systolic-array), with weights streamed in perfect synchronization with input activations to maintain pipelined computation flow [@jouppi_tpu_2017]. Wafer-scale processors demand sophisticated memory partitioning to avoid excessive interconnect traffic [@Cerebras2021]. Unlike general-purpose computing, where caches abstract memory management, AI accelerators require explicit data placement strategies because poor allocation leads to three compounding penalties: increased memory latency when data must be fetched from higher-latency tiers, higher power consumption from off-chip accesses that cost orders of magnitude more energy than on-chip storage, and reduced computational throughput when processing elements stall waiting for data. The severity of these penalties varies by workload. CNNs rely on structured, localized access patterns and benefit from well-defined memory layouts that facilitate predictable reuse [@chen2016eyeriss]. Transformer models require frequent access to large parameter sets and intermediate activations, making them highly sensitive to memory bandwidth constraints. GNNs introduce the greatest challenge, as their irregular and sparse data structures produce unpredictable access patterns that resist static allocation strategies. @tbl-memory-allocation summarizes these allocation challenges. As model sizes continue to grow, accelerators must dynamically manage memory resources rather than relying on static allocation schemes, and memory capacity increasingly dictates how large a model can be deployed on a given accelerator.

@fig-tiling-diagram (Line 3389)

Each iteration requires loading elements from matrices A and B multiple times from memory, causing excessive data movement. As the size of the matrices increases, the memory bottleneck worsens, limiting performance. Tiling addresses this problem by ensuring that smaller portions of matrices are loaded into fast memory, reused efficiently, and only written back to main memory when necessary. This technique is especially important in AI accelerators, where memory accesses dominate execution time. @fig-tiling-diagram visualizes how breaking large matrices into smaller tiles enables computation to proceed efficiently by maximizing data reuse in fast memory. The following sections examine the principles of tiling, its different strategies, and the key trade-offs involved in selecting an effective tiling approach. ::: {#fig-tiling-diagram fig-env="figure" fig-pos="htb" fig-cap="Matrix Tiling: Partitioning large matrices into smaller tiles optimizes data reuse and reduces memory access overhead during computation. This technique improves performance on AI accelerators by enabling efficient loading and processing of data in fast memory, minimizing transfers from slower main memory." fig-alt="Three matrices A, B, C with highlighted tiles showing how matrix multiplication partitions into smaller blocks. Dimensions labeled M, N, K with corresponding tile sizes Mtile, Ntile, Ktile."}


introduction/introduction.qmd

@fig-ai-timeline (Line 192)

@fig-alexnet (Line 450)

Deep Learning Era: The Infrastructure Bottleneck

Deep learning (2012 to present) removed the human feature engineering requirement. Neural networks learn representations directly from raw data (pixels, audio waveforms), enabling "end-to-end" learning. This shift was unlocked not by new algorithms (CNNs existed in the 1980s) but by Systems Co-design. The 2012 AlexNet breakthrough [@krizhevsky2012imagenet], illustrated in @fig-alexnet, occurred because algorithmic structure (parallel matrix operations) matched hardware capabilities (GPUs). ::: {#fig-alexnet fig-env="figure" fig-pos="htb" fig-cap="AlexNet Architecture. The network that launched the deep learning revolution at ImageNet 2012. Two parallel GPU streams process 224x224 input images through convolutional layers (green blocks) that extract spatial features at decreasing resolutions, converging through three fully connected layers to 1,000 output classes. With 60 million parameters trained across two GTX 580 GPUs, AlexNet achieved 15.3% top-5 error, a 42% relative improvement over the second-place entry." fig-alt="3D diagram of AlexNet with two parallel GPU streams. Green blocks show convolutional layers decreasing from 224x224 input. Red kernels overlay green blocks. Right side shows three dense layers converging to 1000 outputs."}

@fig-ai-triad (Line 970)

::: These three components form the AI Triad: Data, Algorithm, and Machine. We formalize this interaction as the DAM Taxonomy, a diagnostic framework for identifying bottlenecks and understanding system constraints. @fig-ai-triad illustrates how these three components form an interdependent system. Each element shapes the possibilities of the others. The algorithm dictates both the computational demands for training and inference and the volume and structure of data required for effective learning. The data's scale and complexity influence what machines are needed for storage and processing while determining which algorithms are feasible. The machine's capabilities establish practical limits on both model scale and data processing capacity, creating a framework within which the other components must operate. ::: {#fig-ai-triad fig-env="figure" fig-pos="htb" fig-cap="The AI Triad. A triangle diagram with three vertices, Data, Algorithm, and Machine, connected by bidirectional arrows. Each node depicts its domain (database cylinders, neural network graph, and cloud infrastructure), illustrating how limitations in any one component constrain the capabilities of the others." fig-alt="Triangle diagram with three circles at vertices labeled Model, Data, and Machine. Double-headed purple arrows connect all three nodes, showing bidirectional dependencies. Icons inside circles depict neural network, database cylinders, and cloud."}

@fig-evolution-efficiency (Line 1359)

  • Compute Efficiency (Physics): Maximizes hardware utilization by aligning the "Algorithm" logic with the "Machine" physics. This dimension encompasses the evolution from general-purpose CPUs to specialized accelerators (GPUs, TPUs) and the hardware-software co-design principles that translate theoretical TFLOPS into real-world speedups. Together, these dimensions provide the engineering tools to overcome the Data, Algorithm, and Machine walls that pure scaling alone cannot address. @fig-evolution-efficiency traces how these three pillars have co-evolved historically, each progressing through distinct eras at different rates. ::: {#fig-evolution-efficiency fig-env="figure" fig-pos="htb" fig-cap="Historical Efficiency Trends. A three-track timeline from 1980 to 2023 shows parallel progress in Algorithmic Efficiency (blue), Compute Efficiency (yellow), and Data Selection (green). Each track progresses through distinct eras: algorithms advance from early methods through deep learning to modern efficiency techniques; compute evolves from general-purpose CPUs through accelerated hardware to sustainable computing; data practices shift from scarcity through big data to data-centric AI." fig-alt="Timeline with three horizontal tracks from 1980 to 2023. Blue track shows Algorithmic Efficiency progressing through Deep Learning Era to Modern Efficiency. Yellow shows Compute Efficiency from General-Purpose through Accelerated to Sustainable Computing. Green shows Data Selection from Scarcity through Big Data to Data-Centric AI."}

@fig-algo-efficiency (Line 1459)

@fig-algo-efficiency visualizes the algorithmic efficiency trajectory, tracing how innovations from AlexNet through EfficientNet achieved the 44× improvement.
::: {#fig-algo-efficiency fig-env="figure" fig-pos="htb" fig-cap="**Algorithmic Efficiency Trajectory.** Training efficiency factor relative to AlexNet (2012 baseline) for ImageNet classification. Each point represents a model architecture that achieves comparable accuracy with fewer computational resources. The trajectory from AlexNet (1x) through VGG, ResNet, MobileNet, and ShuffleNet to EfficientNet (44x) demonstrates that algorithmic innovation has delivered a 44-fold reduction in required compute over eight years, independent of hardware improvements." fig-alt="Scatter plot showing training efficiency factor from 2012 to 2020. Red dots mark models from AlexNet at 1x to EfficientNet at 44x. Dashed trend line curves upward. Labels identify VGG, ResNet, MobileNet, ShuffleNet versions at their positions."}

@fig-ai-training-compute-growth (Line 1569)

::: @fig-ai-training-compute-growth shows the countervailing trend: the exponential growth in compute demand that makes efficiency optimization essential rather than optional.

@fig-algo-efficiency (Line 1573)

Taken together, @fig-algo-efficiency and @fig-ai-training-compute-growth reveal a seeming contradiction that defines the economics of modern AI development. ::: {.callout-perspective title="The Efficiency Paradox"}

@fig-ai-training-compute-growth (Line 1573)

Taken together, @fig-algo-efficiency and @fig-ai-training-compute-growth reveal a seeming contradiction that defines the economics of modern AI development. ::: {.callout-perspective title="The Efficiency Paradox"}

@fig-algo-efficiency (Line 1576)

::: {.callout-perspective title="The Efficiency Paradox"} Together, @fig-algo-efficiency and @fig-ai-training-compute-growth reveal a paradox central to ML systems engineering: algorithmic efficiency improved 44× while compute usage grew 300,000×. Efficiency gains enabled larger experiments, which demanded more compute, which motivated further efficiency research. This feedback loop, where efficiency enables scale and scale demands efficiency, defines the modern AI engineering landscape. Understanding this dynamic is essential for making informed decisions about where to invest optimization effort. :::

@fig-ai-training-compute-growth (Line 1576)

::: {.callout-perspective title="The Efficiency Paradox"} Together, @fig-algo-efficiency and @fig-ai-training-compute-growth reveal a paradox central to ML systems engineering: algorithmic efficiency improved 44× while compute usage grew 300,000×. Efficiency gains enabled larger experiments, which demanded more compute, which motivated further efficiency research. This feedback loop, where efficiency enables scale and scale demands efficiency, defines the modern AI engineering landscape. Understanding this dynamic is essential for making informed decisions about where to invest optimization effort. :::

@fig-ml_lifecycle_overview (Line 1617)

Machine learning systems depart from this paradigm. While traditional systems execute explicit programming logic, ML systems derive their behavior from data patterns discovered through training. This shift from code to data as the primary behavior driver introduces complexities that existing software engineering practices cannot address. We address these challenges and specialized workflows in @sec-ai-development-workflow. Unlike traditional software's linear progression from design through deployment, ML systems operate in continuous cycles. @fig-ml_lifecycle_overview illustrates this iterative pattern, where performance degradation triggers data collection, which feeds model training, evaluation, and redeployment. ::: {#fig-ml_lifecycle_overview fig-env="figure" fig-pos="htb" fig-cap="ML System Lifecycle. A six-box flowchart depicting Data Collection, Preparation, Model Training, Evaluation, Deployment, and Monitoring. Two feedback loops distinguish this cycle from linear software development: evaluation returns to preparation when results is insufficient, and monitoring triggers new data collection when performance degrades." fig-alt="Flowchart showing cyclical ML lifecycle. Six boxes: Data Collection, Preparation, Model Training, Evaluation, Deployment, Monitoring. Two loops: evaluation returns to preparation; monitoring triggers collection."}

@fig-pillars (Line 1773)

The challenges we have explored, from silent performance degradation and data drift to model complexity and ethical concerns, reveal why ML systems engineering has emerged as a distinct discipline. Traditional software engineering practices cannot address systems that degrade quietly rather than failing visibly. These challenges require systematic engineering practices spanning the entire system lifecycle, from initial data collection through continuous operation and evolution. This work organizes ML systems engineering around five interconnected disciplines that directly address the challenge categories we have identified. @fig-pillars visualizes these pillars, which represent the core engineering capabilities required to bridge the gap between research prototypes and production systems capable of operating reliably at scale. While these pillars organize the practice of ML engineering, they are supported by the foundational technical imperatives of Performance Optimization and Hardware Acceleration (covered in Part III), which provide the efficiency required to make large-scale training and deployment economically and physically viable. Five-Pillar Framework. Five labeled columns represent Data Engineering, Training Systems, Deployment Infrastructure, Operations and Monitoring, and Ethics and Governance. The pillars rest on a shared foundation labeled Performance Optimization and Hardware Acceleration, indicating the technical imperatives that support all five disciplines.{#fig-pillars fig-alt="Five pillars diagram: Data Engineering, Training Systems, Deployment Infrastructure, Operations and Monitoring, Ethics and Governance. Pillars rest on foundation labeled Performance Optimization and Hardware Acceleration."}


ml_systems/ml_systems.qmd

@fig-systems-gap (Line 55)

::: The following plot (@fig-systems-gap) visualizes this challenge as the Systems Gap: the divergence between what models demand and what hardware naturally provides.

@fig-cloud-edge-TinyML-comparison (Line 100)

TinyML pushes intelligence to the smallest devices—microcontrollers costing dollars and consuming milliwatts. When you need always-on sensing (wake-word detection, motion recognition) that runs for months on a coin-cell battery, TinyML is the only viable option. Why do these four paradigms exist? The answer lies not in engineering choices but in physical laws that no amount of optimization can overcome. The four paradigms span a continuous spectrum from centralized cloud infrastructure to distributed ultra-low-power devices, as @fig-cloud-edge-TinyML-comparison illustrates. ::: {#fig-cloud-edge-TinyML-comparison fig-env="figure" fig-pos="t" fig-cap="Distributed Intelligence Spectrum: Machine learning deployment spans from centralized cloud infrastructure to resource-constrained TinyML devices, each balancing processing location, device capability, and network dependence. Source: [@abiresearch2024tinyml]." fig-alt="Horizontal spectrum showing 5 deployment tiers from left to right: ultra-low-power devices and sensors, intelligent device, gateway, on-premise servers, and cloud. Arrows indicate TinyML, Edge AI, and Cloud AI spans across the spectrum."}

@fig-cloud-ml (Line 697)

::: @fig-cloud-ml breaks down Cloud ML across several key dimensions that define its computational paradigm. ::: {#fig-cloud-ml fig-env="figure" fig-pos="t" fig-cap="Cloud ML Decomposition. Characteristics, benefits, challenges, and representative applications of cloud machine learning, where centralized infrastructure and specialized hardware address scale, complexity, and resource management for large datasets and complex computations." fig-alt="Tree diagram with Cloud ML branching to four categories: Characteristics, Benefits, Challenges, and Examples. Each lists items like computational power, scalability, vendor lock-in, and virtual assistants."}

@fig-cloudml-example (Line 797)

Cloud Infrastructure and Scale

Cloud ML aggregates computational resources in data centers at unprecedented scale. @fig-cloudml-example captures Google's Cloud TPU15 data center, exemplifying the massive infrastructure that enables petaflop-scale training. @tbl-representative-systems quantifies how cloud systems provide orders-of-magnitude more compute and memory bandwidth than mobile devices, at correspondingly higher power and operational cost. Modern cloud accelerator systems operate at petaflops to exaflops of peak reduced-precision throughput and require megawatt-scale facility power when deployed in large clusters. These data center facilities enable computational workloads that are impractical on resource-constrained devices. The remote location of cloud resources introduces critical trade-offs: network round-trip latency of 100--500 ms eliminates real-time applications, while operational costs scale linearly with usage.

@fig-edge-ml (Line 926)

::: @fig-edge-ml breaks down edge deployment across four operational dimensions: characteristics that define the paradigm, benefits that justify adoption, challenges that constrain implementation, and representative applications that demonstrate real-world value. ::: {#fig-edge-ml fig-env="figure" fig-pos="t" fig-cap="Edge ML Decomposition. Characteristics, benefits, challenges, and representative applications of edge machine learning, where decentralized processing on nearby hardware reduces latency and network dependence at the cost of constrained compute and memory." fig-alt="Tree diagram with Edge ML branching to four categories: Characteristics, Benefits, Challenges, and Examples, listing items like decentralized processing, reduced latency, security concerns, and industrial IoT."}

@fig-edgeml-example (Line 1060)

Edge ML Benefits and Deployment Challenges

Distributed Processing Architecture. Edge ML spans wearables, industrial sensors, and smart home appliances that process data locally16 without depending on central servers. @fig-edgeml-example illustrates this diversity of edge devices that occupy the middle ground between cloud systems and mobile devices in computational resources, power consumption, and cost. Memory bandwidth at 25--100 GB/s enables models requiring 100 MB--1 GB parameters. The optimization techniques covered in @sec-model-compression achieve 2--4$\times$ speedup compared to cloud models. Local processing eliminates network round-trip latency, enabling <100 ms response times while generating substantial bandwidth savings: processing 1000 camera feeds locally avoids 1 Gbps uplink costs and reduces cloud expenses by $10,000--100,000 annually.

@fig-mobile-ml (Line 1188)

::: @fig-mobile-ml contrasts the unique characteristics of mobile deployment against its enabling benefits and inherent challenges: on-device processing and sensor integration enable real-time responsiveness and enhanced privacy, while limited computational resources and battery constraints demand aggressive model optimization. ::: {#fig-mobile-ml fig-env="figure" fig-pos="t" fig-cap="Mobile ML Decomposition. Characteristics, benefits, challenges, and representative applications of mobile machine learning, where on-device processing and hardware acceleration balance computational efficiency, battery life, and model performance on smartphones and tablets." fig-alt="Tree diagram with Mobile ML branching to four categories: Characteristics, Benefits, Challenges, and Examples. Each lists items like on-device processing, real-time response, battery constraints, and voice recognition."}

@fig-TinyML-example (Line 1360)

Where mobile ML requires sophisticated hardware with gigabytes of memory and multi-core processors, Tiny Machine Learning operates on microcontrollers with kilobytes of RAM and single-digit dollar price points. This radical constraint forces a shift in machine learning deployment, prioritizing ultra-low power consumption and minimal cost over computational sophistication. TinyML brings intelligence to the smallest devices, from microcontrollers17 to embedded sensors, enabling real-time computation in severely resource-constrained environments [@banbury2021mlperftiny]. This paradigm excels in applications requiring ubiquitous sensing, autonomous operation, and maximal energy efficiency. TinyML systems power applications such as predictive maintenance, environmental monitoring, and simple gesture recognition while optimized for energy efficiency[^fn-energy-efficiency], often running for months or years on limited power sources such as coin-cell batteries[^fn-coin-cell], as exemplified by the device kits in @fig-TinyML-example. These systems deliver actionable insights in remote or disconnected environments where power, connectivity, and maintenance access are impractical.

@fig-tiny-ml (Line 1401)

::: @fig-tiny-ml maps TinyML's unique position at the resource-constrained extreme of the deployment spectrum, where milliwatt power budgets and kilobyte memory limits transform algorithmic possibilities: the paradigm achieves extremely low latency and always-on operation but demands specialized model compression techniques that fundamentally reshape what ML can accomplish. ::: {#fig-tiny-ml fig-env="figure" fig-pos="t" fig-cap="TinyML Decomposition. Characteristics, benefits, challenges, and representative applications of TinyML, where milliwatt power budgets and kilobyte memory limits enable always-on sensing and localized intelligence in embedded applications." fig-alt="Tree diagram with TinyML branching to four categories: Characteristics, Benefits, Challenges, and Examples, listing items like low-power operation, always-on capability, resource limitations, and predictive maintenance."}

@fig-op_char (Line 1546)

These quantitative trade-offs map directly to the Workload Archetypes introduced earlier. Compute Beasts and Sparse Scatter workloads naturally gravitate toward cloud deployment, where raw TFLOPS and memory capacity are abundant. Bandwidth Hogs span cloud and edge depending on latency requirements — cloud for batch processing, edge for interactive applications. Tiny Constraint workloads are exclusively TinyML, where the joules-per-inference metric dominates all other considerations. Mobile deployment occupies the middle ground, hosting efficiency-optimized variants of Compute Beast workloads (e.g., MobileNet instead of ResNet) that trade peak performance for sustainable power consumption. @fig-op_char visualizes performance and operational characteristics across paradigms through radar plots. Plot a) contrasts compute power and scalability where Cloud ML excels against latency and energy efficiency where TinyML dominates, with Edge and Mobile ML occupying intermediate positions. ::: {#fig-op_char fig-env="figure" fig-pos="t" fig-cap="Paradigm Comparison Radar Plots. Two radar plots quantify performance and operational characteristics across cloud, edge, mobile, and TinyML paradigms. The left plot contrasts compute power, latency, scalability, and energy efficiency; the right plot contrasts connectivity, privacy, real-time capability, and offline operation." fig-alt="Two radar plots with four overlapping polygons each. Left plot axes: compute power, latency, scalability, energy. Right plot axes: connectivity, privacy, real-time, offline capability."}

@fig-mlsys-playbook-flowchart (Line 1650)

Decision Framework

Selecting the appropriate deployment paradigm requires systematic evaluation of application constraints rather than organizational biases or technology trends. @fig-mlsys-playbook-flowchart provides a hierarchical decision framework that filters options through critical requirements: privacy, latency, computational demands, and cost constraints. ::: {#fig-mlsys-playbook-flowchart fig-env="figure" fig-pos="t" fig-cap="Deployment Decision Logic: This flowchart guides selection of an appropriate machine learning deployment paradigm by systematically evaluating privacy requirements and processing constraints, ultimately balancing performance, cost, and data security. Navigating the decision tree helps practitioners determine whether cloud, edge, mobile, or tiny machine learning best suits a given application." fig-alt="Decision flowchart with four layers: Privacy, Performance, Compute Needs, and Cost. Each layer filters toward deployment options: Cloud ML, Edge ML, Mobile ML, or TinyML based on constraints."}

@fig-mlsys-playbook-flowchart (Line 1819)

Hybrid Architectures: Combining Paradigms

The decision framework (@fig-mlsys-playbook-flowchart) helps select the best single paradigm for a given application. In practice, however, production systems rarely use just one paradigm. Voice assistants combine TinyML wake-word detection with mobile speech recognition and cloud natural language understanding. Autonomous vehicles pair edge inference for real-time perception with cloud training for model updates. These hybrid architectures leverage the strengths of multiple paradigms while mitigating their individual weaknesses. This section formalizes the integration strategies that make such combinations effective. ::: {.callout-definition title="Hybrid ML"}

@fig-hybrid (Line 1864)

Production System Integration

Real-world implementations integrate multiple design patterns into cohesive solutions. @fig-hybrid illustrates key interactions through specific connection types: "Deploy" paths show how models flow from cloud training to various devices, "Data" and "Results" show information flow from sensors through processing stages, and "Sync" demonstrates device coordination. Data generally flows upward from sensors through processing layers to cloud analytics, while model deployments flow downward from cloud training to inference points. ::: {#fig-hybrid fig-env="figure" fig-pos="t" fig-cap="Hybrid System Interactions: Data flows upward from sensors through processing layers to cloud analytics, while trained models deploy downward to edge, mobile, and TinyML inference points. Five connection types (deploy, data, results, assist, and sync) establish a distributed architecture where each paradigm contributes unique capabilities." fig-alt="System diagram with four ML paradigms: TinyML sensors, Edge inference, Mobile processing, and Cloud training. Arrows show deploy, data, results, sync, and assist flows between tiers."}

@fig-ml-systems-convergence (Line 1948)

Why Hybrid Approaches Work

The success of hybrid architectures stems from a deeper truth: despite their diversity, all ML deployment paradigms share core principles. @fig-ml-systems-convergence illustrates how implementations spanning cloud to tiny devices converge on core system challenges: managing data pipelines, balancing resource constraints, and implementing reliable architectures. ::: {#fig-ml-systems-convergence fig-env="figure" fig-pos="t" fig-cap="Convergence of ML Systems: Three-layer structure showing how diverse deployments converge. The top layer lists four paradigms (Cloud, Edge, Mobile, TinyML); the middle layer identifies shared foundations (data pipelines, resource management, architecture principles); and the bottom layer presents cross-cutting concerns (optimization, operations, trustworthy AI) that apply across all paradigms." fig-alt="Three-layer diagram. Top: Cloud, Edge, Mobile, TinyML implementations. Middle: data pipeline, resource management, architecture principles. Bottom: optimization, operations, trustworthy AI. Arrows connect layers."}


ops/ops.qmd

@fig-mlops-diagram (Line 87)

MLOps builds on DevOps but addresses the specific demands of ML system development and deployment. DevOps achieved remarkable success for traditional software by assuming deterministic behavior: the same code with the same inputs produces the same outputs. Machine learning systems violate this assumption fundamentally because they depend on training data distributions, learned parameters, and environmental conditions that shift over time. Where DevOps focuses on integrating and delivering deterministic software, MLOps must manage non-deterministic, data-dependent workflows spanning data acquisition, preprocessing, model training, evaluation, deployment, and continuous monitoring through an iterative cycle connecting design, model development, and operations (@fig-mlops-diagram). The following definition captures this discipline's scope: ::: {.callout-definition title="MLOps"}

@fig-technical-debt (Line 342)

The Broader Pattern: A fundamental law of systems engineering is that the cost of maintaining a system over its lifetime dwarfs the cost of building it. Dave Patterson frequently emphasizes that "measuring everything" is the only way to manage this complexity. In ML, technical debt is especially dangerous because it is often data-driven rather than code-driven: a perfect piece of code can still fail if the data it processes shifts. The Systems Conclusion: Automation is not just about speed; it is about Capacity Cap. A manual team hits a ceiling where they cannot deploy new models because they are drowning in the maintenance of old ones. MLOps is the engineering response: it replaces the manual "craft" of model maintenance with a systematic "factory" of observability and automation. If you don't build the monitoring infrastructure to make silent failures visible, you aren't just taking on debt; you are creating a system that is unmanageable by design. @fig-technical-debt illustrates how the ML code itself represents only a small fraction of a production ML system's complexity. :::

@fig-technical-debt-taxonomy (Line 384)

The calculation above reveals a fundamental truth: manual operations hit a capacity ceiling. But the cost of automation extends beyond engineering time to include the hidden complexity that accumulates in ML systems. These operational challenges manifest in several distinct patterns. Rather than cataloging every debt pattern, we focus on representative examples that illustrate MLOps engineering approaches. Each challenge emerges from ML's unique characteristics: reliance on data rather than deterministic logic, statistical rather than exact behavior, and implicit dependencies through data flows rather than explicit interfaces. @fig-technical-debt-taxonomy captures these technical debt patterns, demonstrating why traditional DevOps practices require extension for ML systems and motivating the infrastructure solutions presented in subsequent sections. ::: {#fig-technical-debt-taxonomy fig-env="figure" fig-pos="htb" fig-cap="ML Technical Debt Taxonomy. Machine learning systems accumulate distinct forms of technical debt from data dependencies, model interactions, and evolving requirements. Six primary debt patterns radiate from a central hub: boundary erosion undermines modularity, correction cascades propagate fixes through dependencies, feedback loops create hidden coupling, while data, configuration, and pipeline debt reflect poorly managed artifacts and workflows." fig-alt="Hub-and-spoke diagram with Hidden Technical Debt at center. Six debt categories radiate outward: Configuration Debt, Feedback Loops, Data Debt, Pipeline Debt, Correction Cascades, and Boundary Erosion, each annotated with specific failure patterns."}

@fig-correction-cascades-flowchart (Line 453)

If boundary erosion describes how ML systems lose their structural integrity, correction cascades describe what happens when teams attempt repairs. As machine learning systems evolve, they undergo iterative refinement to address performance issues, accommodate new requirements, or adapt to environmental changes. In well-engineered systems, such updates are localized. In ML systems, however, even small adjustments can trigger correction cascades: a sequence of dependent fixes that propagate backward and forward through the workflow. @fig-correction-cascades-flowchart visualizes how these cascading effects propagate through ML system development. Understanding the structure of these cascades helps teams anticipate and mitigate their impact. The diagram illustrates how cascades emerge across different stages of the ML lifecycle, from problem definition and data collection to model development and deployment. Each arc represents a corrective action, and the colors indicate different sources of instability: inadequate domain expertise, brittle real-world interfaces, misaligned incentives, and insufficient documentation. Red arrows represent cascading revisions, while the dotted arrow at the bottom highlights a full system restart, a drastic but sometimes necessary outcome.

@fig-ops-layers (Line 616)

+------------------------------+------------------------------------+---------------------------------------------+ @fig-ops-layers visualizes how these components form a layered architecture spanning ML models, frameworks, orchestration, infrastructure, and hardware. Understanding how these components interact enables practitioners to design systems that systematically address the technical debt patterns identified earlier while maintaining operational sustainability. ::: {#fig-ops-layers fig-env="figure" fig-pos="htb" fig-cap="MLOps Stack Layers. Five tiers organize the ML system stack: ML Models at the top, followed by Frameworks, Orchestration, Infrastructure, and Hardware. MLOps spans orchestration tasks (data management through model serving) and infrastructure tasks (job scheduling through monitoring), enabling automation, reproducibility, and scalable deployment." fig-alt="Layered architecture diagram. Top row: ML Models, Frameworks, Orchestration, Infrastructure, Hardware. MLOps section spans orchestration tasks (data management through model serving) and infrastructure tasks (job scheduling through monitoring)."}

@fig-ops-cicd (Line 840)

@fig-data-drift (Line 1332)

Training pipelines produce model candidates; validation determines which candidates merit production deployment. Before deployment, a machine learning model must undergo rigorous evaluation to ensure it meets predefined performance, robustness, and reliability criteria. MLOps reframes evaluation as a structured, repeatable process for validating operational readiness, incorporating pre-deployment assessment, post-deployment monitoring, and automated regression testing. The evaluation process begins with performance testing against a holdout test set sampled from the same distribution as production data. Core metrics such as accuracy, AUC, precision, recall, and F1 score [@rainio2024evaluation] are computed and tracked longitudinally to detect degradation from data drift [@ibm_data_drift]. @fig-data-drift demonstrates this degradation pattern, showing how changes in feature distributions correlate with declining model quality. ::: {#fig-data-drift fig-env="figure" fig-pos="htb" fig-cap="Data Drift Impact: Declining model performance over time results from data drift, where the characteristics of production data diverge from the training dataset. Monitoring key metrics longitudinally allows MLOps engineers to detect this drift and trigger model retraining or data pipeline adjustments to maintain accuracy." fig-alt="Three-panel visualization over time. Top: incoming data samples coded green or orange. Middle: feature distribution shifting from online to offline sales channel. Bottom: line graph showing model accuracy declining as distribution shifts increase."}

@fig-rotting-asset-curve (Line 1923)

::: Because of drift, a deployed model behaves less like software (which doesn't break unless changed) and more like inventory (which decays over time). This is the Statistical Drift Invariant at work: the Degradation Equation (\text{Accuracy}(t) \approx \text{Accuracy}_0 - \lambda \cdot D(P_t \| P_0)) introduced in @sec-deploy-invariants predicts that accuracy erodes in proportion to the distributional divergence D, regardless of code quality. Every monitoring strategy in this chapter exists to detect D before it compounds into business impact. The following "Rotting Asset" plot (@fig-rotting-asset-curve) visualizes this entropy and compares two maintenance strategies: scheduled "Sawtooth" retraining versus trigger-based "Flatline" retraining.

@fig-business-cost-curve (Line 2538)

Translating technical metrics into business impact requires consistent frameworks connecting model performance to operational outcomes. A 5% accuracy improvement appears modest in isolation, but contextualizing this as "reducing false fraud alerts from 1,000 to 800 daily customer friction incidents" provides actionable business context. This connection is not linear. As @fig-business-cost-curve illustrates, the optimal operating point for a model is rarely the point of highest accuracy. It is the point where the combined cost of False Positives (e.g., blocking a legitimate user) and False Negatives (e.g., missing fraud) is minimized.

@fig-uptime-iceberg (Line 2719)

At high levels of maturity, ML systems exhibit properties commonly found in production-grade software systems: stateless services, contract-driven interfaces, environment isolation, and observable execution. Design patterns such as feature stores, model registries, and infrastructure-as-code become foundational. System behavior is not inferred from static assumptions but monitored in real time and adapted as needed. This enables feedback-driven development and supports closed-loop systems where data, models, and infrastructure co-evolve. In each case, operational maturity functions as an architectural force: it governs how complexity is managed, how change is absorbed, and how the system scales in the face of threats to service uptime. @fig-uptime-iceberg depicts this dependency stack as an iceberg, with visible uptime floating above hidden threats including data drift, concept drift, broken pipelines, schema changes, model bias, and underperforming segments. Design decisions that disregard these constraints may function under ideal conditions, but fail under real-world pressures such as latency requirements, drift, outages, or regulatory audits. Understanding this relationship between maturity and design is essential for building resilient machine learning systems that sustain performance over time. ::: {#fig-uptime-iceberg fig-env="figure" fig-pos="htb" fig-cap="Uptime Dependency Stack. An iceberg visualization where visible service uptime floats above the waterline, supported by hidden threats below: model accuracy degradation, data drift, concept drift, broken pipelines, schema changes, model bias, data outages, and underperforming segments. Labels group these threats into data health, model health, and service health categories." fig-alt="Iceberg diagram with uptime visible above waterline. Hidden below: model accuracy, data drift, concept drift, broken pipelines, schema changes, model bias, data outages, underperforming segments. Labels indicate data, model, and service health."}

@fig-clinaiops (Line 3010)

Feedback Loops

Three interlocking feedback loops enable safe, adaptive integration of machine learning into clinical practice. @fig-clinaiops visualizes these as a cyclical framework where patients contribute monitoring data, clinicians provide therapy regimens and approval limits, and AI systems generate alerts and recommendations. ::: {#fig-clinaiops fig-env="figure" fig-pos="htb" fig-cap="ClinAIOps Feedback Loops: The cyclical framework coordinates data flow between patients, clinicians, and AI systems to support continuous model improvement and safe clinical integration. These interconnected loops enable iterative refinement of AI models based on real-world performance and clinical feedback, fostering trust and accountability in healthcare applications. Source: [@chen2023framework]." fig-alt="Circular diagram with three nodes: patient, clinician, and AI system. Arrows form cyclic flow: patient provides monitoring data, clinician sets therapy regimen, AI generates alerts and recommendations. Inner and outer loops show feedback pathways."}

@fig-interactive-loop (Line 3185)


optimizations/model_compression.qmd

@fig-3-sections (Line 83)

Optimization Framework

Model optimization is not a single technique but a framework with three complementary dimensions, each addressing different bottlenecks. @fig-3-sections illustrates how these dimensions span from pure software concerns to hardware-level execution. ::: {#fig-3-sections fig-env="figure" fig-pos="htb" fig-cap="Optimization Stack: Model optimization progresses through three layers: efficient model representation, efficient numerics representation, and efficient hardware implementation." fig-alt="Three stacked rectangular boxes labeled from top to bottom: Efficient Model Representation, Efficient Numerics Representation, Efficient Hardware Implementation. A vertical arrow spans the stack with More software at top and More hardware at bottom."}

@fig-quantization-free-lunch (Line 138)

::: The following plot (@fig-quantization-free-lunch) visualizes this robustness, showing the "Free Lunch" where reducing precision has minimal impact on accuracy—until you hit the "Cliff."

@fig-sparse-matrix (Line 339)


where $\|\hat{W}\|_0$ is the **L0-norm** (the count of non-zero parameters). Since minimizing the L0-norm is NP-hard[^fn-np-hard], we use heuristics[^fn-heuristic] like **magnitude-based pruning**. @lst-pruning_example demonstrates this approach, removing weights with small absolute values to transform a dense weight matrix into the sparse representation shown in @fig-sparse-matrix.
[^fn-np-hard]: **NP-hard**: From computational complexity theory, "NP" stands for "nondeterministic polynomial time." A problem is NP-hard if solving it in polynomial time would imply P=NP, widely believed false. Finding the optimal sparse subnetwork requires examining exponentially many subsets, making exact solutions infeasible for networks with millions of parameters.

@fig-channel-layer-pruning (Line 518)

  • Layer pruning removes entire layers from the network, significantly reducing depth. While this approach can yield significant efficiency gains, it requires careful balance to ensure the model retains sufficient capacity to capture complex patterns. @fig-channel-layer-pruning illustrates the differences between channel pruning and layer pruning. When a channel is pruned, the model's architecture must be adjusted to accommodate the structural change. Specifically, the number of input channels in subsequent layers must be modified, requiring alterations to the depths of the filters applied to the layer with the removed channel. In contrast, layer pruning removes all channels within a layer, necessitating more significant architectural modifications. In this case, connections between remaining layers must be reconfigured to bypass the removed layer. Regardless of the pruning approach, fine-tuning is important to adapt the remaining network and restore performance. ::: {#fig-channel-layer-pruning fig-env="figure" fig-pos="htb" fig-cap="Channel vs. Layer Pruning. Channel pruning adjusts filter sizes within layers, while layer pruning removes entire layers and necessitates reconnection of remaining network components. These approaches reduce model size and computational cost, but require fine-tuning to mitigate performance loss due to reduced model capacity." fig-alt="Side-by-side diagrams showing channel pruning (left) and layer pruning (right). Each shows three-stage CNN with feature maps as 3D blocks connected by dashed lines. Red highlights indicate pruned channels or layers."}

@fig-structured-unstructured (Line 817)

Hardware-aware pruning strategies, such as N:M structured sparsity, enforce specific patterns (e.g., ensuring 2 out of every 4 weights are zero) to align with specialized accelerator capabilities. The hardware implementation details of these patterns, including how they leverage sparse tensor cores, are covered in @sec-ai-acceleration. @fig-structured-unstructured illustrates the key differences between unstructured and structured pruning. On the left, unstructured pruning removes individual weights (depicted as dashed connections), creating a sparse weight matrix. This can disrupt the original network structure, as shown in the fully connected network where certain connections have been randomly pruned. While this reduces the number of active parameters, the resulting sparsity requires specialized execution kernels to fully utilize computational benefits. In contrast, structured pruning (depicted in the middle and right sections of @fig-structured-unstructured) removes entire neurons or filters while preserving the network's overall structure. In the middle section, a pruned fully connected network retains its fully connected nature but with fewer neurons. On the right, structured pruning is applied to a CNN by removing convolutional kernels or entire channels (dashed squares). This method maintains the CNN's core convolutional operations while reducing the computational load, making it more compatible with hardware accelerators.

@fig-structured-unstructured (Line 819)

@fig-structured-unstructured illustrates the key differences between unstructured and structured pruning. On the left, unstructured pruning removes individual weights (depicted as dashed connections), creating a sparse weight matrix. This can disrupt the original network structure, as shown in the fully connected network where certain connections have been randomly pruned. While this reduces the number of active parameters, the resulting sparsity requires specialized execution kernels to fully utilize computational benefits. In contrast, structured pruning (depicted in the middle and right sections of @fig-structured-unstructured) removes entire neurons or filters while preserving the network's overall structure. In the middle section, a pruned fully connected network retains its fully connected nature but with fewer neurons. On the right, structured pruning is applied to a CNN by removing convolutional kernels or entire channels (dashed squares). This method maintains the CNN's core convolutional operations while reducing the computational load, making it more compatible with hardware accelerators. ::: {#fig-structured-unstructured fig-env="figure" fig-pos="htb" fig-cap="Unstructured vs. Structured Pruning. Unstructured pruning (left) achieves sparsity by removing individual weights, requiring specialized hardware, while structured pruning (middle, right) removes entire neurons or filters, preserving network structure for standard hardware acceleration. Source: [@qi2021efficient]." fig-alt="Three-panel diagram. Left shows unstructured pruning with dashed connections in a neural network. Middle and right show structured pruning: fully connected network with pruned neurons and CNN with pruned filters shown as dashed squares."}

@fig-iterative-pruning (Line 1016)

Iterative pruning removes structure gradually through multiple cycles of pruning followed by fine-tuning. During each cycle, the algorithm removes a small subset of structures based on predefined importance metrics. The model then undergoes fine-tuning to adapt to these structural modifications before proceeding to the next pruning iteration. This gradual approach helps prevent sudden drops in accuracy while allowing the network to progressively adjust to reduced complexity. @fig-iterative-pruning illustrates this process using a convolutional neural network where six channels are pruned. Rather than removing all channels simultaneously, iterative pruning eliminates two channels per iteration over three cycles. Following each pruning step, the model undergoes fine-tuning to recover performance. The first iteration, which removes two channels, results in an accuracy decrease from 0.995 to 0.971, but subsequent fine-tuning restores accuracy to 0.992. After completing two additional pruning-tuning cycles, the final model achieves 0.991 accuracy, which represents only a 0.4% reduction from the original, while operating with 27% fewer channels. By distributing structural modifications across multiple iterations, the network maintains its performance capabilities while achieving improved computational efficiency. ::: {#fig-iterative-pruning fig-env="figure" fig-pos="htb" fig-cap="Iterative Pruning Performance: Three rows depict successive prune-then-fine-tune cycles, each removing two of the original 22 channels. Accuracy drops from 0.995 to 0.971 after the first prune, recovers to 0.992 after fine-tuning, and settles at 0.991 after all three cycles, a 0.4% loss with 27% fewer channels." fig-alt="Three-row workflow showing iterative pruning. Each row displays CNN architecture, prune step with red arrow, accuracy drop box, fine-tune gears icon, and accuracy recovery. Values progress from 0.995 to 0.991 final accuracy."}

@fig-oneshot-pruning (Line 1227)

One-shot pruning removes multiple architectural components in a single step, followed by an extensive fine-tuning phase to recover model accuracy. This aggressive approach compresses the model quickly but risks greater accuracy degradation, as the network must adapt to significant structural changes simultaneously. Consider applying one-shot pruning to the same network from the iterative pruning example. Instead of removing two channels at a time over multiple iterations, one-shot pruning eliminates all six channels simultaneously, as illustrated in @fig-oneshot-pruning. Removing 27% of the network's channels simultaneously causes the accuracy to drop significantly, from 0.995 to 0.914. Even after fine-tuning, the network only recovers to an accuracy of 0.943, which is a 5% degradation from the original unpruned network. While both iterative and one-shot pruning ultimately produce identical network structures, the gradual approach of iterative pruning better preserves model performance. ::: {#fig-oneshot-pruning fig-env="figure" fig-pos="htb" fig-cap="One-Shot Pruning Impact: All six channels (27%) are removed simultaneously, causing accuracy to drop from 0.995 to 0.914. Fine-tuning recovers only to 0.943, a 5% degradation compared to the 0.4% loss from iterative pruning, illustrating why gradual removal preserves accuracy more effectively." fig-alt="Single-row workflow showing one-shot pruning. CNN with six red-highlighted channels to prune, followed by accuracy drop from 0.995 to 0.914, fine-tuning gears, and partial recovery to 0.943."}

@fig-winning-ticket (Line 1436)

@fig-kd-targets (Line 1704)

Knowledge Distillation

Knowledge distillation18 trains a smaller student model using guidance from a larger, pre-trained teacher model. The core insight is that a well-trained teacher provides a richer learning signal than simple ground-truth labels. While a hard label is binary (e.g., [1, 0, 0] for cat), a teacher's probability distribution (e.g., [0.85, 0.10, 0.05]) reveals inter-class similarity, showing that a cat shares more features with a dog than a fox. @fig-kd-targets visualizes how this "dark knowledge" embedded in the teacher's learned representations guides the student to generalize better.

@fig-kd-overview (Line 1764)

Principles and Formalism

@fig-kd-overview illustrates the distillation workflow, which trains the student model to minimize a combination of two loss functions:

  1. Distillation Loss: Typically the Kullback-Leibler (KL) divergence[^fn-kl-divergence] between the teacher's softened output distribution and the student's distribution.

@fig-matrix-factorization (Line 1975)

@fig-tensor-decomposition (Line 2249)

::: @fig-tensor-decomposition illustrates how a 3D tensor can be decomposed into factor matrices. Common decomposition methods include:

  • CP decomposition: Expresses a tensor as a sum of rank-one components: \mathcal{A} \approx \sum_{r=1}^{k} u_r \otimes v_r \otimes w_r

@fig-nas-flow (Line 2290)

Pruning, knowledge distillation, and other techniques explored in previous sections rely on human expertise to determine optimal model configurations. While these manual approaches have led to significant advancements, selecting optimal architectures requires extensive experimentation, and even experienced practitioners may overlook more efficient designs [@elsken2019neural]. Neural Architecture Search (NAS) automates this process by systematically exploring large spaces of possible architectures to identify those that best balance accuracy, computational cost, memory efficiency, and inference latency. @fig-nas-flow illustrates the NAS process. NAS19 operates through three interconnected stages: defining the search space (architectural components and constraints), applying search strategies (reinforcement learning[@zoph2017neural], evolutionary algorithms, or gradient-based methods) to explore candidate architectures, and evaluating performance to ensure discovered designs satisfy accuracy and efficiency objectives. This automation enables the discovery of novel architectures that often match or surpass human-designed models while requiring substantially less expert effort.

@fig-quantized-energy (Line 2495)

Energy Costs

Beyond computational and memory benefits, the energy costs associated with different numerical precisions reinforce the case for reduced precision. @fig-quantized-energy quantifies these energy differences: a 32-bit floating-point addition (FAdd) consumes approximately 0.9 pJ, whereas a 16-bit floating-point addition requires only 0.4 pJ. Similarly, a 32-bit integer addition costs 0.1 pJ, while an 8-bit integer addition is just 0.03 pJ. These savings compound across large-scale models operating over billions of operations, supporting both cost reduction and sustainability goals. These energy savings take on a different character for models where memory capacity, not compute, is the binding constraint. DLRM embedding quantization illustrates this distinction.

@fig-quantization_impact (Line 2616)

Performance Gains

@fig-quantization_impact illustrates the impact of quantization on both inference time and model size. The left bars in each category show inference time improvements when moving from FP32 to INT8, while the right bars depict the corresponding reduction in model size. Quantized models achieve up to 4\times faster inference while reducing storage requirements by the same factor, making them well suited for deployment in resource-constrained environments. ::: {#fig-quantization_impact fig-env="figure" fig-pos="htb" fig-cap="Quantization Impact: Moving from FP32 to INT8 reduces inference time by up to 4 times while decreasing model size by a factor of 4, making models more efficient for resource-constrained environments." fig-alt="Two stacked bar charts comparing FP32 and INT8. Left chart shows inference time in milliseconds for Inception, MobileNet, and ResNet. Right chart shows model size in megabytes. INT8 consistently smaller and faster."}

@fig-quantization (Line 2761)

Quantization Error Distribution: Histogram of quantization error weighted by probability density p(x), showing a bell-shaped curve centered near zero with tails that introduce quantization noise affecting model accuracy.{#fig-quantization width=80% fig-alt="Histogram showing quantization error distribution weighted by probability density. Bell-shaped curve centered near zero with tails extending to positive and negative errors, illustrating typical quantization noise pattern."} @fig-quantization reveals how the quantization error distribution varies across numerical formats, with each format introducing different levels of quantization noise that directly influence model accuracy and stability. Understanding this noise is essential, but practitioners ultimately care about end-to-end speedup, and the magnitude of that speedup depends on whether a workload is compute-bound or memory-bound. ::: {.callout-notebook title="The Quantization Speedup: Compute-Bound vs. Memory-Bound"}

@fig-3float (Line 2803)

FP16 and bfloat16 formats provide moderate efficiency gains while preserving model accuracy. Many AI accelerators, such as NVIDIA Tensor Cores and TPUs, include dedicated support for FP16 computations, enabling 2\times faster matrix operations compared to FP32. BFloat16, in particular, retains the same 8-bit exponent as FP32 but with a reduced 7-bit mantissa, allowing it to maintain a similar dynamic range (10^{-38} to 10^{38}) while sacrificing precision. In contrast, FP16, with its 5-bit exponent and 10-bit mantissa, has a significantly reduced dynamic range (10^{-5} to 10^5), making it more suitable for inference rather than training. Since BFloat16 preserves the exponent size of FP32, it better handles extreme values encountered during training, whereas FP16 may struggle with underflow or overflow. This makes BFloat16 a more robust alternative for deep learning workloads that require a wide dynamic range. @fig-3float illustrates how bit-width allocations impact the trade-offs between precision and numerical range20 .

@fig-quantization-roadmap (Line 2899)

Three approaches form a complexity ladder. Post-training quantization (PTQ) reduces precision after training, requiring no retraining and minimal engineering effort. Quantization-aware training (QAT) incorporates quantization effects into the training loop, enabling models to adapt to lower precision and retain higher accuracy. Mixed-precision training assigns different precision levels to different operations, matching precision to each layer's sensitivity. @fig-quantization-roadmap organizes quantization techniques into three progressive tiers based on implementation complexity, resource requirements, and target use cases. ::: {#fig-quantization-roadmap fig-env="figure" fig-pos="htb" fig-cap="Quantization Complexity Roadmap: Three progressive tiers of quantization techniques, from foundational approaches suitable for quick deployment to research frontier methods for extreme resource constraints, reflecting increasing implementation effort, resource requirements, and potential accuracy trade-offs." fig-alt="Tiered diagram with three levels. Foundation tier includes PTQ and basic INT8. Production tier adds QAT and mixed precision. Research frontier tier shows INT4, binary, and ternary quantization with icons for increasing complexity."}

@fig-ptq-calibration (Line 3156)

An important aspect of PTQ is the calibration step, which selects the clipping range [\alpha, $\beta$] for quantizing model weights and activations. The effectiveness of precision reduction depends heavily on this chosen range: without proper calibration, quantization may cause significant accuracy degradation. Calibration ensures that the chosen range minimizes information loss and preserves model performance. @fig-ptq-calibration illustrates the post-training quantization workflow. A calibration dataset, a representative subset of training or validation data, is passed through the pre-trained model to estimate the numerical distribution of activations and weights. This distribution then defines the clipping range for quantization. The quantization step converts model parameters to a lower-precision format, producing the final quantized model. ::: {#fig-ptq-calibration fig-env="figure" fig-pos="htb" fig-cap="Post-Training Quantization: Calibration with a representative dataset determines optimal quantization ranges for model weights and activations, minimizing information loss during quantization to create efficient, lower-precision models. This process converts a pre-trained model into a quantized version suitable for deployment on resource-constrained devices." fig-alt="Vertical flowchart with four boxes connected by arrows. Pre-trained model and Calibration data feed into Calibration step, which feeds into Quantization step, producing final Quantized model output."}

@fig-resnet-activations-histogram (Line 3191)

For example, consider quantizing activations that originally range between -6 and 6 to 8-bit integers. Simply using the full integer range of -128 to 127 might not be the most effective approach. Calibration passes a representative dataset through the model, observes the actual activation range, and uses that observed range to set a tighter quantization range, reducing information loss. Common calibration methods include Max (uses maximum absolute value, simple but susceptible to outliers), Entropy (minimizes KL divergence between original and quantized distributions, TensorRT's default), and Percentile (clips to a percentile, e.g., 99%, avoiding outlier impact). @fig-resnet-activations-histogram shows why outlier handling matters: ResNet50 activations exhibit long tails where outliers can skew the quantization range. Activation Distribution: Resnet50 layer activations exhibit a long tail, with outlier values that can lead to inefficient precision use if not handled carefully. Source: [@wu2020integer].{#fig-resnet-activations-histogram width=85% fig-alt="Histogram of ResNet50 activation values showing right-skewed distribution. Most values cluster near zero with long tail extending to outliers around 2.1, demonstrating challenge for quantization range selection."}

@fig-calibration-ranges (Line 3199)

Tuning Quantization Ranges

A key challenge in post-training quantization is selecting the appropriate calibration range [\alpha, \beta] to map floating-point values into a lower-precision representation. The choice of this range directly affects the quantization error and, consequently, the accuracy of the quantized model. @fig-calibration-ranges contrasts the two primary calibration strategies: symmetric calibration and asymmetric calibration. ::: {#fig-calibration-ranges fig-env="figure" fig-pos="htb" fig-cap="Calibration Range Selection: Symmetric calibration uses a fixed range around zero, while asymmetric calibration adapts the range to the data distribution, potentially minimizing quantization error and preserving model accuracy. Choosing an appropriate calibration strategy balances precision with the risk of saturation for outlier values." fig-alt="Two side-by-side mapping diagrams. Left shows symmetric calibration with range from -1 to 1 mapping to -127 to 127 with zero aligned. Right shows asymmetric calibration with range -0.5 to 1.5 mapping with shifted zero point."}

@fig-calibration-ranges (Line 3279)

::: @fig-calibration-ranges illustrates both approaches. Symmetric calibration (left) maps [-1, 1] to [-127, 127] with zero preserved, making it simpler to implement and well suited for zero-centered weight distributions. Asymmetric calibration (right) uses different ranges (\alpha = -0.5, \beta = 1.5), better utilizing the quantized range for skewed distributions at the cost of additional complexity. Most frameworks (TensorRT, PyTorch) support both modes. Granularity.

@fig-quantization-granularity (Line 3284)

After determining the clipping range, the next optimization step is adjusting the granularity of that range to retain as much accuracy as possible. In CNNs, the input activations of a layer undergo convolution with multiple filters, each of which may have a unique range of values. The quantization process must account for these differences to preserve model performance. @fig-quantization-granularity demonstrates this variation: the range for Filter 1 is significantly smaller than that for Filter 3. The precision with which the clipping range [\alpha, $\beta$] is determined becomes an important factor in effective quantization. This variability in ranges is why different granularity-based quantization strategies are employed. ::: {#fig-quantization-granularity fig-env="figure" fig-pos="htb" fig-cap="Quantization Range Variation: Different convolutional filters exhibit unique activation ranges, necessitating per-filter quantization to minimize accuracy loss during quantization. Adjusting the granularity of clipping ranges, as shown by the differing scales for each filter, optimizes the trade-off between model size and performance. Source: [@gholami2021survey]." fig-alt="Four rows showing CNN filters with Gaussian weight distributions. Each filter has different clipping ranges shown as red and blue dashed lines. Layer-wise clipping uses same range; channel-wise uses per-filter ranges."}

@fig-weight-activations-quantization (Line 3518)

Weights vs. Activations

Weight Quantization involves converting the continuous, high-precision weights of a model into lower-precision values, such as converting 32-bit floating-point (Float32) weights to 8-bit integer (INT8) weights. @fig-weight-activations-quantization illustrates how weight quantization occurs in the second step (red squares) during the multiplication of inputs. This process significantly reduces the model size, decreasing both the memory required to store the model and the computational resources needed for inference. For example, a weight matrix in a neural network layer with Float32 weights like [0.215, -1.432, 0.902,\ldots] might be mapped to INT8 values such as [27, -183, 115, \ldots], leading to a significant reduction in memory usage. ::: {#fig-weight-activations-quantization fig-env="figure" fig-pos="htb" fig-cap="Quantization and Weight Precision: Color-coded matrix multiplication diagram showing three steps: blue squares represent input activations, red squares represent quantized weights, and green squares represent output activations. Reducing precision from float32 to INT8 lowers model size and computational cost at the potential expense of accuracy. Source: HarvardX." fig-alt="Matrix multiplication diagram with three steps. Blue squares show input activations, red squares show quantized weights, and green squares show output activations. Arrows indicate computation flow through multiply-accumulate operations."}

@fig-qat (Line 3663)

Quantization-Aware Training

QAT integrates quantization constraints directly into the training process, simulating low-precision arithmetic during forward passes to allow the model to adapt to quantization effects [@jacob2018quantization]. Production systems requiring less than 1% accuracy loss benefit most from this approach, which recovers accuracy through fine-tuning with quantization simulation at the cost of 20-50% additional training time. This approach is particularly important for models requiring fine-grained numerical precision, such as transformers used in NLP and speech recognition systems [@nagel2021whitepaper]. @fig-qat illustrates the QAT process: quantization is applied to a pre-trained model, followed by fine-tuning to adapt weights to low-precision constraints. ::: {#fig-qat fig-env="figure" fig-pos="htb" fig-cap="Quantization-Aware Training: Vertical flowchart showing the QAT pipeline: a pre-trained model passes through a quantization step that simulates low-precision arithmetic, then undergoes retraining with training data to adapt weights to quantization constraints, producing a final quantized model optimized for efficient inference." fig-alt="Vertical flowchart showing QAT process. Pre-trained model feeds into Quantization step, then Retraining/Finetuning step with Training data input, producing final Quantized model output."}

@fig-ptq-qat (Line 3702)

::: In many cases, QAT can also build off PTQ (discussed in detail in the previous section), as shown in @fig-ptq-qat. Instead of starting from a full-precision model, PTQ is first applied to produce an initial quantized model using calibration data. This quantized model then serves as the starting point for QAT, where additional fine-tuning with training data helps the model better adapt to low-precision constraints. This hybrid approach combines PTQ's efficiency with QAT's accuracy preservation, reducing the degradation typically associated with post-training approaches alone. ::: {#fig-ptq-qat fig-env="figure" fig-pos="htb" fig-cap="PTQ-to-QAT Pipeline. Two grouped stages: the PTQ stage quantizes and calibrates a pretrained model using calibration data, then the QAT stage fine-tunes the result with training data. This hybrid approach combines PTQ's efficiency with QAT's accuracy preservation." fig-alt="Vertical flowchart with two grouped stages. PTQ stage shows pretrained model through quantize and calibrate steps. QAT stage shows fine-tuning step. Calibrate data feeds PTQ; Training data feeds QAT."}

@fig-early-exit-transformers (Line 4324)

Early exit models prove advantageous for resource-constrained devices, such as mobile processors and edge accelerators. By dynamically adjusting computational effort, these architectures reduce power consumption and processing latency, making them ideal for real-time decision-making [@hu2021triple]. When deployed on hardware accelerators such as GPUs and TPUs, early exit architectures can be further optimized by exploiting parallelism. For instance, different exit paths can be evaluated concurrently, thereby improving throughput while preserving the benefits of adaptive computation [@yu2023efficient]. This approach is illustrated in @fig-early-exit-transformers, where each transformer layer is followed by a classifier and an optional early exit mechanism based on confidence estimation or latency-to-accuracy trade-offs (LTE). At each stage, the system may choose to exit early if sufficient confidence is achieved, or continue processing through deeper layers, enabling dynamic allocation of computational resources. ::: {#fig-early-exit-transformers fig-env="figure" fig-pos="htb" fig-cap="Early Exit Architecture: Transformer layers dynamically adjust computation by classifying each layer's output and enabling early termination if sufficient confidence is reached, reducing latency and power consumption for resource-constrained devices. This approach allows for parallel evaluation of different exit paths, improving throughput on hardware accelerators like GPUs and TPUs. Source: [@xin-etal-2021-berxit]." fig-alt="Flowchart with input feeding n transformer layers in sequence. Each layer connects to a classifier, confidence estimator, and exit point. Arrows show continue paths for low confidence."}

@fig-switch-transformer (Line 4724)

::: As @fig-switch-transformer illustrates, the Switch Transformer replaces the traditional feedforward layer with a Switching FFN Layer. For each token, a lightweight router selects a single expert from a pool of feedforward networks. The router outputs a probability distribution over available experts, and the highest-probability expert is activated per token. This design enables large models to scale parameter count without proportionally increasing inference cost. Gate-based conditional computation proves effective for multi-task and transfer learning settings, where inputs may benefit from specialized processing pathways. By enabling fine-grained control over model execution, such mechanisms allow for adaptive specialization across tasks while maintaining efficiency.

@fig-block-sparse-gemm (Line 4811)

Structured Patterns

Various sparsity formats have been developed, each with unique structural characteristics and implications. Two of the most prominent are block sparse matrices and N:M sparsity patterns. Block sparse matrices generally have isolated blocks of zero and non-zero dense submatrices such that a matrix operation on the large sparse matrix can be easily re-expressed as a smaller (overall arithmetic-wise) number of dense operations on submatrices. This sparsity allows more efficient storage of the dense submatricies while maintaining shape compatibility for operations like matrix or vector products. For example, @fig-block-sparse-gemm shows how NVIDIA's cuSPARSE [@nvidia_cusparse_block] library supports sparse block matrix operations and storage. Several other works, such as Monarch matrices [@dao2022monarchexpressivestructuredmatrices], have extended on this block-sparsity to strike an improved balance between matrix expressivity and compute/memory efficiency. ::: {#fig-block-sparse-gemm fig-env="figure" fig-pos="htb" fig-cap="Block Sparse Representation: NVIDIA's cusparse library efficiently stores block sparse matrices by exploiting dense submatrix structures, enabling accelerated matrix operations while maintaining compatibility with dense matrix computations through block indexing. This approach reduces memory footprint and arithmetic complexity for sparse linear algebra, important for scaling machine learning models. Source: NVIDIA." fig-alt="Grid of 3x3 matrix blocks with varying shades indicating dense submatrices. Adjacent index array shows non-zero block positions. Gray blocks represent zeros, colored blocks represent dense submatrices stored separately."}

@fig-2-4-gemm (Line 5438)

::: Similarly, the N:$M$ sparsity pattern is a structured sparsity format where, in every set of M consecutive elements (e.g., weights or activations), exactly N are non-zero, and the other two are zero [@zhou2021learningnmfinegrainedstructured]. This deterministic pattern facilitates efficient hardware acceleration, as it allows for predictable memory access patterns and optimized computations. By enforcing this structure, models can achieve a balance between sparsity-induced efficiency gains and maintaining sufficient capacity for learning complex representations. @fig-2-4-gemm below shows a comparison between accelerating dense versus 2:4 sparsity matrix multiplication, a common sparsity pattern used in model training. Later works like STEP [@lu2023steplearningnmstructured] have examined learning more general N:$M$ sparsity masks for accelerating deep learning inference under the same principles. ::: {#fig-2-4-gemm fig-env="figure" fig-pos="htb" fig-cap="2:4 Structured Sparsity GEMM. Left: standard dense matrix multiplication on Tensor Cores using full 8-element rows. Right: 2:4 sparse multiplication where each group of four elements retains only two non-zeros, with 2-bit indices selecting matching elements from the dense B matrix, halving compute. Source: PyTorch blog [@pytorch_sparsity_blog]." fig-alt="Side-by-side comparison of dense and 2:4 sparse GEMM on Tensor Cores. Left shows 8-element row multiplication. Right shows 4-element sparse row with 2-bit indices selecting matching elements from dense B matrix."}

@fig-compression-methods (Line 6023)

Neural architecture search enables co-design approaches that optimize model structures specifically for quantization constraints, identifying architectures that maintain accuracy under low-precision operations. This co-design produces models inherently suited for subsequent optimization, improving the effectiveness of both quantization and pruning techniques. @fig-compression-methods compares how different compression strategies exhibit varying trade-offs between model size and accuracy loss. Pruning combined with quantization (red circles) achieves high compression ratios with minimal accuracy loss, while quantization alone (yellow squares) provides a reasonable balance. In contrast, SVD (green diamonds) requires a larger model size to maintain accuracy, illustrating how different techniques impact compression effectiveness. ::: {#fig-compression-methods fig-env="figure" fig-pos="htb" fig-cap="Combined Compression Effectiveness: Pruning combined with quantization (red circles) achieves the highest compression ratio at near-zero accuracy loss, followed by pruning alone and quantization alone, while SVD (green diamonds) requires the largest model size to maintain accuracy. Source: [@han2015deep]." fig-alt="Line graph of accuracy loss versus model size ratio. Four curves show pruning plus quantization achieving smallest size at near-zero loss, followed by pruning only, quantization only, and SVD requiring largest size to maintain accuracy."}

@fig-automl-comparison (Line 6231)

Automated Machine Learning (AutoML) aims to streamline this process by automating the search for optimal model configurations, building on the training methodologies from @sec-ai-training. AutoML frameworks leverage machine learning algorithms to optimize architectures, hyperparameters, model compression techniques, and other important parameters, reducing the need for human intervention [@Hutter2019]. By systematically exploring the vast design space of possible models, AutoML can improve efficiency while maintaining competitive accuracy, often discovering novel solutions that may be overlooked through manual tuning [@zoph2016neural]. AutoML does not replace human expertise but enhances it by providing a structured and scalable approach to model optimization. Rather than manually adjusting pruning thresholds, quantization strategies, or architecture designs, practitioners define high-level objectives (latency constraints, memory limits, accuracy targets) and allow AutoML systems to explore configurations that best satisfy these constraints [@Feurer2015]. As illustrated in @fig-automl-comparison, the key difference between traditional workflows and AutoML is that preprocessing, training, and evaluation are automated in the latter. ::: {#fig-automl-comparison fig-env="figure" fig-pos="htb" fig-cap="Traditional vs. AutoML Workflows. Left: a traditional ML cycle with five manual steps (data collection, preprocessing, training, evaluation, deployment). Right: an AutoML cycle where preprocessing, training, and evaluation are consolidated into a single automated node, reducing manual effort to problem definition and deployment." fig-alt="Two circular workflow diagrams side by side. Left shows traditional ML with five manual steps. Right shows AutoML with three steps where preprocessing, training, and evaluation are automated in a single AutoML node."}

@fig-color-mapping (Line 6411)

Beyond APIs and hardware-specific libraries, visualization tools help practitioners understand how pruning, quantization, and other optimizations affect model behavior. Quantization visualization: Error histograms reveal whether quantization errors are Gaussian or contain problematic outliers. Activation visualizations help detect overflow and saturation issues. @fig-color-mapping shows color-mapped AlexNet kernels. TensorFlow's Quantization Debugger, PyTorch's FX Graph Mode, and TensorRT Inspector provide these capabilities. Convolutional Kernel Weights: Color mapping reveals learned feature patterns in convolutional filters. Analyzing weight distributions helps diagnose issues like dead or saturated filters. Source: [@alexnet2012].{#fig-color-mapping width=70% fig-alt="Grid of 96 small color images showing AlexNet first-layer convolutional kernels. Patterns include oriented edges, color blobs, and Gabor-like filters learned from ImageNet training."}

@fig-sparse-heat-map (Line 6415)

Convolutional Kernel Weights: Color mapping reveals learned feature patterns in convolutional filters. Analyzing weight distributions helps diagnose issues like dead or saturated filters. Source: [@alexnet2012].{#fig-color-mapping width=70% fig-alt="Grid of 96 small color images showing AlexNet first-layer convolutional kernels. Patterns include oriented edges, color blobs, and Gabor-like filters learned from ImageNet training."} Sparsity visualization: Heat maps show sparsity distribution across layers (@fig-sparse-heat-map). Darker regions indicate higher sparsity. Trend plots track sparsity progression across pruning iterations. TensorBoard, Netron, and SparseML provide these tools. Sparsity Distribution: Darker shades indicate higher sparsity where more weights were removed. Source: Numenta [@numenta_sparsity]{#fig-sparse-heat-map fig-alt="Heatmap visualization of a pruned neural network with weight matrix blocks. Darker regions indicate higher sparsity where more weights have been removed. Lighter regions show retained weights."}


responsible_engr/responsible_engr.qmd

@fig-fairness-frontier (Line 208)

@fig-governance-layers (Line 296)

The timing of responsibility interventions determines their effectiveness. An ethics review conducted before deployment can identify problems but faces limited remediation options. If the model has already been trained without fairness constraints, if the architecture cannot support interpretability requirements, if the data pipeline lacks demographic attributes for monitoring, the ethics review can only recommend rejection or acceptance of the existing system. Engineering involvement from project inception enables proactive design rather than reactive assessment. This engineering-centered approach does not diminish the importance of diverse perspectives in identifying potential harms. Product managers, user researchers, affected communities, and policy experts contribute essential knowledge about how systems fail socially despite technical success. Engineers translate these concerns into measurable requirements and testable properties that can be verified throughout the development lifecycle. Effective responsibility requires engineers who both listen to stakeholder concerns and possess the technical capability to implement appropriate safeguards. @fig-governance-layers illustrates how engineering practices nest within broader organizational, industry, and regulatory governance structures. Responsible AI Governance Layers. Nested governance structures surround engineering practice. At the center, engineering teams implement technical safeguards. Successive layers represent organizational safety culture, industry certification and external review, and government regulation. Technical excellence at the center enables compliance with requirements flowing inward from outer layers.{#fig-governance-layers fig-alt="Nested oval diagram showing governance layers from innermost to outermost: Team (reliable systems, software engineering), Organization (safety culture, organizational design), Industry (trustworthy certification, external reviews), and Government Regulation."}

@fig-fairness-threshold (Line 595)

: Fairness Metrics Summary: Comparison of fairness metrics across demographic groups reveals substantial disparities in how the model treats qualified applicants from each group. {#tbl-fairness-metrics-summary} @fig-fairness-threshold visualizes why aggregate metrics hide these disparities. When a single threshold is applied to populations with different score distributions, the same decision boundary produces vastly different outcomes for each group [@barocas2016big]. Threshold Effects on Subgroup Outcomes. A single classification threshold (vertical lines) applied to two subgroups with different score distributions produces disparate outcomes. Circles represent positive outcomes (loan repayment), crosses represent negative outcomes (default). The 75% threshold approves most of Subgroup A but rejects most of Subgroup B, even when qualified individuals exist in both groups. The 81.25% threshold shows how threshold adjustment changes the fairness-accuracy tradeoff. This visualization explains why aggregate accuracy can mask severe subgroup disparities.{#fig-fairness-threshold fig-alt="Diagram showing two subgroups A and B with different score distributions. Vertical threshold lines at 75% and 81.25% show how the same threshold produces different approval rates for each group."}

@fig-interpretability-spectrum (Line 660)

: Explainability Requirements by Domain: Different applications require different levels of decision transparency. Credit and medical applications face regulatory requirements for individual explanations. Fraud detection may intentionally limit explainability to prevent gaming. The engineering challenge is matching explainability mechanisms to domain requirements. {#tbl-explainability-requirements} Engineering teams should select explainability approaches based on these domain requirements. Post-hoc explanation methods (LIME, SHAP) generate feature importance scores for individual predictions without requiring model architecture changes.21 Inherently interpretable models (linear models, decision trees, attention mechanisms) provide explanations as part of their structure but may sacrifice predictive performance. Concept-based explanations map model behavior to human-understandable concepts rather than raw features. The choice involves tradeoffs between explanation fidelity, computational cost, and model flexibility. @fig-interpretability-spectrum illustrates this spectrum of model interpretability.

@fig-data-governance-pillars (Line 1031)

::: Our Lighthouse KWS system illustrates how the privacy and consent concerns raised during acquisition intensify at the governance level: always-listening devices continuously process audio in users' homes, feature stores maintain voice pattern histories across millions of users, and edge storage caches models derived from population-wide training data. These capabilities create governance obligations around consent management, data minimization, access auditing, and deletion rights. @fig-data-governance-pillars maps these interconnected challenges across the ML lifecycle. ::: {#fig-data-governance-pillars fig-env="figure" fig-pos="htb" fig-cap="Data Governance Pillars: Robust data governance establishes ethical and reliable machine learning systems by prioritizing privacy, fairness, transparency, and accountability throughout the data lifecycle. These interconnected pillars address unique challenges in ML workflows, ensuring responsible data usage and auditable decision-making processes." fig-alt="Central stacked database icon surrounded by four governance elements: privacy shield, security lock, compliance checklist, and transparency document. Gear icons show interconnections between all elements."}

@fig-data-card (Line 1296)

Compliance requirements transform from legal obligations into system architecture constraints that shape pipeline design, storage choices, and operational procedures. GDPR's data minimization principle requires limiting collection and retention to what is necessary for stated purposes. For KWS systems, this means justifying why voice samples need retention beyond training, documenting retention periods in system design documents, and implementing automated deletion once periods expire. The "right to access" requires systems to retrieve all data associated with a user, consolidating results from distributed storage systems. Voice assistants operating globally face complex compliance landscapes because regulatory requirements vary by jurisdiction and apply differently based on user age and data sensitivity. European requirements for cross-border data transfer restrict storing EU users' voice data on servers outside designated countries unless specific safeguards exist, driving architectural decisions about regional data lakes, feature store replication strategies, and processing localization. Standardized documentation frameworks like data cards [@pushkarna2022data] translate these compliance requirements into operational artifacts. @fig-data-card demonstrates the structured format that makes compliance executable. Training pipelines check that input datasets have valid data cards before processing, and serving systems enforce that only models trained on compliant data can deploy to production. ::: {#fig-data-card fig-env="figure" fig-pos="t!" fig-cap="Data Governance Documentation: Data cards standardize critical dataset information, enabling transparency and accountability required for regulatory compliance with laws like GDPR and HIPAA. By providing a structured overview of dataset characteristics, intended uses, and potential risks, data cards facilitate responsible AI practices and support data subject rights." fig-alt="Sample data card template showing structured fields: dataset name and description at top, authorship and funding details in middle sections, and intended uses with potential risks at bottom."}


serving/serving.qmd

@fig-tail-latency-explosion (Line 82)

::: As shown in @fig-tail-latency-explosion, this physics dictates why production systems must run at relatively low utilization (40-60%) to guarantee stable tail latency (p99).

@fig-intelligence-deflation (Line 101)

Beyond the technical limits of latency, the economics of serving have undergone a radical transformation. As models become more efficient and hardware becomes more specialized, the cost of "intelligence" is collapsing. This trend is visualized in @fig-intelligence-deflation, which tracks the plummeting price of token generation across model generations.
```{python}

@fig-serving-inference-pipeline (Line 128)

::: Serving systems must execute a complete inference pipeline under latency constraints, not just the neural network computation. @fig-serving-inference-pipeline illustrates this pipeline: raw inputs flow through preprocessing (traditional computing), neural network inference (deep learning), and postprocessing (traditional computing) before producing final outputs. Each stage contributes to total latency, and bottlenecks can occur anywhere in the pipeline. ::: {#fig-serving-inference-pipeline fig-env="figure" fig-pos="htb" fig-cap="The Inference Pipeline: ML serving systems transform raw inputs into final outputs through sequential stages: preprocessing, neural network computation, and postprocessing. The neural network represents just one component; preprocessing and postprocessing rely on traditional computing and often dominate total latency in optimized systems." fig-alt="Flow diagram showing six connected boxes: Raw Input, Preprocessing, Neural Network, Raw Output, Postprocessing, Final Output. Preprocessing and postprocessing are labeled Traditional Computing; neural network is labeled Deep Learning."}

@fig-server-anatomy (Line 376)

The internal anatomy of these servers reveals how they bridge the gap between irregular user traffic and the highly regular, batch-oriented requirements of accelerators. The Request Pipeline. Every request traverses a multi-stage pipeline designed to maximize hardware throughput while minimizing latency overhead. @fig-server-anatomy visualizes this internal flow. ::: {#fig-server-anatomy fig-env="figure" fig-pos="htb" fig-cap="Inference Server Anatomy: A modern inference server decouples network handling from accelerator execution through a staged pipeline. Each stage isolates a concern, from absorbing bursty traffic to forming efficient batches, so the hardware accelerator stays highly utilized despite irregular arrival patterns." fig-alt="Flowchart showing 6-stage inference server pipeline: Client to Network Ingress to Request Queue (cylinder) to Dynamic Batcher, then down to Inference Runner to Accelerator. Arrows connect stages sequentially."}

@fig-serving-pipeline-timing (Line 688)

In a serialized serving system, the hardware sits idle during network I/O and CPU-based preprocessing. High-performance serving systems use Request Pipelining to overlap these stages, ensuring the GPU is fed a continuous stream of tensors. Overlapping I/O and Compute. @fig-serving-pipeline-timing contrasts serial execution with pipelined execution. In the serial case (A), each request must complete its entire lifecycle (Network \rightarrow CPU Preprocessing \rightarrow GPU Inference \rightarrow Postprocessing) before the next request begins. Even with a fast GPU, the system throughput is limited by the slowest stage, and the GPU remains idle for more than 50% of the time. ::: {#fig-serving-pipeline-timing fig-env="figure" fig-pos="htb" fig-cap="Request Pipelining: Pipelining hides latency by overlapping independent operations across different hardware resources. In pipelined execution (B), the CPU processes the next request's data while the GPU executes the current request's inference. This increases the GPU duty cycle toward 100%, effectively doubling or tripling throughput on the same hardware without changing the model." fig-alt="Two timing diagrams. A (Serial): alternating CPU preprocessing, GPU inference, and idle blocks in sequence. B (Pipelined): two parallel rows where CPU preprocessing overlaps with GPU inference, eliminating idle time."}

@fig-throughput-latency-knee (Line 1230)

::: The following plot (@fig-throughput-latency-knee) visualizes this trade-off, showing the "Knee" of the curve. This is the optimal operating point where throughput is maximized before latency spikes due to queuing.


training/training.qmd

@fig-communication-tax (Line 121)

::: The challenge of maintaining high utilization at scale is illustrated in @fig-communication-tax. While compute-bound workloads scale efficiently, bandwidth-bound workloads like LLM training suffer from synchronization overhead that grows with cluster size, creating a "tax" that degrades effective throughput.

@fig-activation-perf (Line 433)

Benchmarking Activation Functions

The selection of an activation function directly influences training throughput and hardware efficiency. @fig-activation-perf quantifies these performance differences through CPU benchmarks on Apple M2 hardware, revealing that Tanh executes in 0.61 seconds compared to Sigmoid's 1.10 seconds, a 1.8$\times$ speedup. ::: {#fig-activation-perf fig-env="figure" fig-pos="htb" fig-cap="Activation Function Execution Time: CPU benchmarks on Apple M2 hardware reveal significant variation: Tanh completes in 0.61 seconds, ReLU in 0.78 seconds, Softmax in 0.91 seconds, and Sigmoid in 1.10 seconds. These differences directly affect training throughput and real-time inference latency, making activation function selection a system-level design decision." fig-alt="Bar chart comparing CPU execution times: Sigmoid at 1.1 seconds, Tanh at 0.61 seconds, ReLU at 0.78 seconds, and Softmax at 0.91 seconds."}

@fig-training-roofline (Line 1092)

: Training Operation Classifications. Different operations in the training pipeline have vastly different arithmetic intensities, determining whether they are limited by compute throughput or memory bandwidth. {#tbl-training-arithmetic-intensity} @fig-training-roofline visualizes these relationships on a roofline diagram. Operations to the left of the ridge point (the "knee" where the sloped memory-bound region meets the flat compute-bound region) are limited by memory bandwidth; operations to the right are limited by compute throughput. The figure shows how GPT-2 training operations distribute across this landscape. ::: {#fig-training-roofline fig-env="figure" fig-pos="htb" fig-cap="Training Roofline Model: GPT-2 training operations mapped against arithmetic intensity on a log-log roofline diagram. Matrix multiplications operate in the compute-bound regime (right of the ridge point), while normalization and activation operations fall in the memory-bound region (left). FlashAttention shifts standard attention from below to above the ridge point, demonstrating how algorithmic redesign can move operations into a more efficient regime." fig-alt="Log-log plot showing roofline model with memory-bound slope and compute-bound ceiling. Points show different training operations: MatMul above ridge point, LayerNorm and Softmax below. Arrow shows FlashAttention improvement."}

@fig-training-roofline (Line 1178)

This analysis guides optimization strategy selection. For memory-bound operations, reducing data movement through operator fusion, reduced precision, or algorithmic improvements like FlashAttention provides the largest gains. For compute-bound operations, increasing throughput through Tensor Cores, parallelism, or quantization matters more. See @sec-ai-acceleration for detailed roofline model analysis and hardware-specific optimization strategies. @fig-training-roofline shows standard attention in the memory-bound region while FlashAttention appears in the compute-bound region—this shift represents the core insight of IO-aware algorithm design. By never materializing the full N \times N attention matrix and instead processing in tiles that fit in fast SRAM, FlashAttention reduces memory traffic from O(N^2) to O(N), achieving 2-4× speedups [@dao2022flashattention]. We examine the algorithm, its implementation, and when to use it in detail in @sec-ai-training-flash-attention-ioaware-attention-optimization-3da0. The arithmetic intensity analysis above reveals which operations constrain training performance and why: matrix multiplications are compute-bound while normalization and activation functions are memory-bound, each requiring different optimization strategies. FlashAttention exemplifies how understanding these bottlenecks enables algorithmic solutions that shift operations from one regime to another. But optimizing individual operations is insufficient. Training systems must orchestrate data loading, computation, and parameter updates as a unified pipeline, and the architecture of this pipeline determines whether optimizations like FlashAttention translate into actual throughput gains.

@fig-training-pipeline (Line 1186)

The mathematical operations examined above define what training systems must compute—for GPT-2, approximately 10 trillion FLOPs per training step distributed across attention, feedforward, and normalization operations. @sec-ai-frameworks introduced how frameworks like PyTorch and TensorFlow provide APIs for defining models and executing forward passes; here we examine the system-level orchestration that makes those API calls efficient. Pipeline architecture determines how to coordinate these computations across real hardware with finite memory and bandwidth constraints, managing data loading, preprocessing, GPU transfers, and parameter updates as a unified system rather than isolated operations. @fig-training-pipeline maps the complete training pipeline architecture, showing how three main components interconnect: the data pipeline for ingestion and preprocessing, the training loop that handles model updates, and the evaluation pipeline for assessing performance. Processed batches flow from the data pipeline to the training loop, and evaluation metrics provide feedback to guide the training process. ::: {#fig-training-pipeline fig-env="figure" fig-pos="htb" fig-cap="Training System Overview: Machine learning systems organize training through interconnected data, training, and evaluation pipelines. Data flows sequentially through these components, with evaluation metrics providing feedback to guide iterative model refinement and ensure reproducible results." fig-alt="Block diagram with three connected boxes: Data Pipeline, Training Loop, and Evaluation Pipeline. Arrows show data flow with feedback from evaluation."}

@fig-training-pipeline (Line 1228)

Architectural Overview

As @fig-training-pipeline illustrates, the system-level orchestration described above organizes into three interconnected components. The data pipeline ingests raw data and transforms it into a format suitable for the model. This data passes to the training loop, where the model performs its core computations. Periodically, the evaluation pipeline assesses performance using a separate validation dataset. This modular organization enables efficient resource utilization and clear separation of concerns. Data Pipeline.

@fig-training-loop (Line 1234)

Training Loop. The training loop is the computational core of the pipeline, where the model learns from the prepared data. @fig-training-loop illustrates how this process unfolds through three sequential steps on a single GPU: the forward pass generates predictions from input data, gradient computation propagates error signals backward through the network, and parameter updates apply the optimizer to minimize the loss function. ::: {#fig-training-loop fig-env="figure" fig-pos="htb" fig-cap="Single-GPU Training Loop: The three sequential steps of one training iteration: the forward pass generates predictions, gradient computation propagates error signals backward, and the optimizer applies parameter updates. GPUs parallelize the underlying matrix operations, accelerating both the forward and backward passes." fig-alt="Neural network diagram showing data cylinders feeding into a network of connected nodes. A GPU box at bottom processes the forward and backward pass computations."}

@fig-data-pipeline (Line 1451)

::: The data pipeline running on the CPU bridges raw data storage and GPU computation. @fig-data-pipeline breaks down this architecture into three distinct zones: the storage zone houses raw data on disk, the CPU preprocessing zone handles format conversion, processing, and batching, and the GPU training zone distributes preprocessed batches across multiple accelerators for parallel computation. In the storage zone, raw data resides on disk, typically in formats like image files for computer vision tasks or text files for natural language processing. The CPU preprocessing zone handles the transformation of this raw data through multiple stages. For example, in an image recognition model, these stages include:

@fig-galore-llm-memory-breakdown (Line 1798)

The memory scaling analysis from @sec-ai-training-optimization-algorithm-system-implications-f9f2—where SGD requires 1\times, momentum requires 2\times, and Adam requires 3\times the parameter memory—manifests concretely during each training iteration. Each parameter update involves reading current values, accessing gradients, computing the update rule, and writing modified parameters back to memory. For Adam, this includes updating and accessing the momentum and variance buffers, creating substantial memory traffic for large models. At billion-parameter scale, optimizer state dominates the memory budget. As quantified in the GPT-2 worked example (@sec-ai-training-optimization-algorithm-system-implications-f9f2), a 1.5B parameter model requires 24 GB for optimizer state alone in FP32—before accounting for activations. This challenge has motivated memory-efficient optimizer variants. @fig-galore-llm-memory-breakdown demonstrates how GaLoRE addresses this constraint: by computing updates in a compressed space [@zhao2024galorememoryefficientllmtraining], the technique reduces the memory footprint dominated by optimizer states to a fraction of its original size, enabling training of larger models on fixed hardware. ::: {#fig-galore-llm-memory-breakdown fig-env="figure" fig-pos="htb" fig-cap="Memory Footprint Breakdown: Memory usage of LLaMA-7B across four optimizer configurations, decomposed into weights, activations, optimizer state, weight gradients, and other components. The dashed red line marks the RTX 4090 24 GB memory limit, illustrating how standard Adam exceeds single-GPU capacity while GaLoRE compression reduces optimizer state enough to fit within this budget." fig-alt="Stacked horizontal bar chart comparing memory usage across four optimizers for LLaMA-7B. Shows components: others, weight gradient, optimization, activation, and weight. Dashed red line marks RTX 4090 memory limit at 30 GB."}

@fig-linear-scaling-failure (Line 1872)

A larger batch size provides a more accurate estimate of the true gradient, allowing for larger learning steps. However, simply increasing the batch size without adjusting the learning rate leads to the "Linear Scaling Failure". If you double the batch size, you perform half as many updates per epoch. If the learning rate remains constant, the model effectively travels "half the distance" in weight space, causing underfitting. The following plot (@fig-linear-scaling-failure) visualizes this Generalization Gap and how the Linear Scaling Rule (\text{LR}_{new} = k \times \text{LR}_{base}) corrects it.

@fig-tf-bottleneck-trace (Line 2006)

Profiling to Identify Bottlenecks

Profiling tools reveal which bottleneck dominates your workload. @fig-tf-bottleneck-trace captures a data-bound pathology through TensorFlow's profiler: the gaps in GPU activity (white regions between compute blocks) reveal that the device frequently waits for input data, with utilization dropping to zero during data loading phases. Data-Bound Profiler Trace: TensorFlow profiler output capturing a data loading bottleneck during training. The gaps in GPU activity (white regions between compute blocks) indicate periods where the device idles while waiting for input data, with utilization dropping to zero during data loading phases.{#fig-tf-bottleneck-trace fig-alt="TensorFlow profiler screenshot showing GPU activity timeline. Colored blocks indicate computation periods with white gaps revealing idle time when GPU waits for data loading to complete."}

@fig-optimization-flowchart (Line 2055)

This systematic framework—profile, select, compose—applies three core optimization techniques to the primary bottleneck categories. Prefetching and overlapping targets data movement latency by coordinating data transfer with computation. Mixed-precision training addresses both computational throughput and memory constraints through reduced precision arithmetic. Gradient accumulation and checkpointing manages memory constraints by trading computation for memory usage. These techniques are not mutually exclusive; effective optimization often combines multiple approaches to achieve cumulative benefits. In practice, high-impact, low-complexity optimizations like data prefetching should be implemented first, while complex optimizations such as gradient checkpointing require cost-benefit analysis that accounts for development effort and debugging complexity. @fig-optimization-flowchart provides a visual decision tree that operationalizes this systematic framework. Starting from profiling results, the flowchart guides practitioners through bottleneck identification to technique selection, ensuring optimization effort targets the actual constraint rather than perceived issues. ::: {#fig-optimization-flowchart fig-env="figure" fig-pos="htb" fig-cap="Training Optimization Decision Flowchart: Systematic approach to optimization selection based on profiling results. Begin by measuring GPU utilization, then follow the decision path to identify whether the bottleneck is data-bound, memory-bound, or compute-bound. Each path leads to specific techniques that address the identified constraint." fig-alt="Flowchart showing optimization decision tree starting from Profile Training Run, branching based on GPU utilization and memory pressure to different optimization techniques."}

@fig-fetching-naive (Line 2134)

Prefetching and overlapping techniques illustrate the systematic framework in action, targeting data movement latency bottlenecks by coordinating data transfer with computation. This optimization proves most effective when profiling reveals that computational units remain idle while waiting for data transfers to complete. Training machine learning models involves significant data movement between storage, memory, and computational units. The data pipeline consists of sequential transfers: from disk storage to CPU memory, CPU memory to GPU memory, and through the GPU processing units. @fig-fetching-naive exposes the inefficiency of sequential data transfer: the GPU remains idle during file operations (Open 1, Open 2), and training steps cannot begin until read operations complete, leaving expensive compute resources underutilized for significant portions of each epoch. ::: {#fig-fetching-naive fig-env="figure" fig-pos="htb" fig-cap="Sequential Data Fetching: File open, read, and train operations execute serially across two epochs, with the GPU remaining idle during all file operations. The full sequential pipeline spans approximately 90 seconds, establishing the baseline that overlapped prefetching improves upon." fig-alt="Gantt chart showing sequential data pipeline over two epochs. Four rows: Open, Read, Train, and Epoch. Operations execute serially with gaps between phases, spanning from 00:00 to 01:30."}

@fig-fetching-naive (Line 2231)

Prefetching addresses these inefficiencies by loading data into memory before its scheduled computation time. During the processing of the current batch, the system loads and prepares subsequent batches, maintaining a consistent supply of ready data [@tensorflow_data_2015]. Overlapping builds upon prefetching by coordinating multiple pipeline stages to execute concurrently. The system processes the current batch while simultaneously preparing future batches through data loading and preprocessing operations. Compare @fig-fetching-naive with @fig-fetching-optimized: the optimized pipeline completes two epochs in approximately 55 seconds compared to 90 seconds with sequential fetching, a 40% speedup achieved by overlapping read and train operations within each time slice. ::: {#fig-fetching-optimized fig-env="figure" fig-pos="htb" fig-cap="Overlapped Data Prefetching: Read and train operations execute concurrently, with each time slice overlapping data loading for the next batch with computation on the current batch. Two epochs complete in approximately 55 seconds compared to 90 seconds with sequential fetching, a 40% speedup." fig-alt="Gantt chart showing optimized pipeline with overlapping operations. Read and Train execute in parallel across time slices. Two epochs complete in approximately 55 seconds total."}

@fig-fetching-optimized (Line 2231)

Prefetching addresses these inefficiencies by loading data into memory before its scheduled computation time. During the processing of the current batch, the system loads and prepares subsequent batches, maintaining a consistent supply of ready data [@tensorflow_data_2015]. Overlapping builds upon prefetching by coordinating multiple pipeline stages to execute concurrently. The system processes the current batch while simultaneously preparing future batches through data loading and preprocessing operations. Compare @fig-fetching-naive with @fig-fetching-optimized: the optimized pipeline completes two epochs in approximately 55 seconds compared to 90 seconds with sequential fetching, a 40% speedup achieved by overlapping read and train operations within each time slice. ::: {#fig-fetching-optimized fig-env="figure" fig-pos="htb" fig-cap="Overlapped Data Prefetching: Read and train operations execute concurrently, with each time slice overlapping data loading for the next batch with computation on the current batch. Two epochs complete in approximately 55 seconds compared to 90 seconds with sequential fetching, a 40% speedup." fig-alt="Gantt chart showing optimized pipeline with overlapping operations. Read and Train execute in parallel across time slices. Two epochs complete in approximately 55 seconds total."}

@fig-mixed-precision (Line 2454)

The choice between formats depends on model characteristics. Models with gradient outliers, common in transformer architectures, generally benefit from BF16's wider dynamic range. Models with well-conditioned gradients may prefer FP16's greater mantissa precision. Regardless of the reduced-precision format chosen for forward and backward passes, certain operations require FP32 precision: loss accumulation, softmax denominators, normalization variance computation, and optimizer state. These requirements stem from the numerical sensitivity of these operations rather than arbitrary convention. @fig-mixed-precision traces the data flow through mixed-precision training's seven-step cycle: FP32 master weights (step 7) convert to FP16 for the forward pass (step 1), loss is scaled (step 2) before backpropagation (step 3), scaled FP16 gradients copy to FP32 (step 4), loss scaling is removed (step 5), and gradients update the FP32 master weights (step 6), completing the cycle that achieves 16x Tensor Core speedup while preserving numerical stability through strategic precision management. ::: {#fig-mixed-precision fig-env="figure" fig-pos="htb" fig-cap="Mixed Precision Training: The seven-step cycle: (1) FP32 master weights convert to FP16 for the forward pass, (2) loss is scaled to prevent gradient underflow, (3) backpropagation computes scaled FP16 gradients, (4) gradients copy to FP32, (5) loss scaling is removed, (6) FP32 gradients update master weights, and (7) the cycle repeats. This approach achieves Tensor Core speedups while preserving numerical stability." fig-alt="Flowchart showing 7-step mixed precision training cycle. FP32 master weights convert to FP16 for forward pass, loss scaling protects gradients during backpropagation, then gradients update FP32 weights."}

@fig-grad-accumulation (Line 3016)

Gradient Accumulation

Gradient accumulation simulates larger batch sizes by splitting a single effective batch into smaller "micro-batches." @fig-grad-accumulation illustrates this process: three independent batches (green, red, blue) each compute their own loss (L_1, L_2, L_3) and gradients (\delta_1, \delta_2, \delta_3), which then sum to produce the combined gradient \delta_1+\delta_2+\delta_3 used for a single parameter update. This approach achieves the same gradient as training with a batch three times larger, without requiring the memory to hold all samples simultaneously. ::: {#fig-grad-accumulation fig-env="figure" fig-pos="htb" fig-cap="Gradient Accumulation: Three micro-batches each compute independent losses and gradients, which sum into a single combined gradient for one parameter update. This simulates training with a batch three times larger without requiring the memory to hold all samples simultaneously." fig-alt="Block diagram showing three batches computing individual losses and gradients. Arrows flow from Batch 1, 2, 3 through Losses to Gradients boxes, then combine into a single summed gradient output."}

@fig-activation-checkpointing (Line 3121)

Activation checkpointing reduces memory usage during the backward pass by discarding and selectively recomputing activations. In standard training, activations from the forward pass are stored in memory for use in gradient computations during backpropagation. However, these activations can consume significant memory, particularly in deep networks. With checkpointing, only a subset of the activations is retained during the forward pass. @fig-activation-checkpointing visualizes this memory-compute tradeoff: during the forward pass (top row), only checkpoint nodes (green, solid) are retained while intermediate nodes (white, dashed) are discarded. During the backward pass (bottom row), these discarded activations are recomputed on demand (brown nodes) from the nearest checkpoint, trading approximately 33% additional compute for memory savings that can exceed 70% in deep networks. The implementation involves three steps. First, split the model into segments. Second, retain activations only at the boundaries of these segments during the forward pass. Third, recompute activations for intermediate layers during the backward pass when needed.

@fig-evolution-systems (Line 3653)

The Evolution of Training Infrastructure

Computing system architectures have evolved through distinct generations, each building upon previous advances while introducing specialized optimizations for emerging application requirements (@fig-evolution-systems). This progression demonstrates how hardware adaptation to application needs shapes modern machine learning systems. ::: {#fig-evolution-systems fig-env="figure" fig-pos="htb" fig-cap="Computing System Evolution: Hardware advancements continuously adapted to the increasing demands of machine learning workloads, transitioning from centralized mainframes to specialized architectures optimized for parallel processing and massive datasets." fig-alt="Timeline spanning 1950s to 2020s showing evolution from mainframes through HPC and warehouse-scale computing to AI hypercomputing with GPUs and TPUs."}

@fig-train-data-parallelism (Line 3730)

@fig-model-parallelism (Line 3775)

::: Model Parallelism partitions the model itself across GPUs, which becomes necessary when the model exceeds single-GPU memory. AlexNet used a simple form: certain layers resided on GPU 1, others on GPU 2, with activations passing between them. @fig-model-parallelism shows this sequential flow: data moves through model partitions on different devices, with gradients flowing backward during training. ::: {#fig-model-parallelism fig-env="figure" fig-pos="htb" fig-cap="Model Parallelism: The model is partitioned across devices, with intermediate activations passing between them. This enables training models larger than single-GPU memory at the cost of sequential dependencies." fig-alt="Diagram showing input flowing through model parts on different devices, with forward pass going right and backward pass returning left."}

@fig-layers-blocks (Line 3805)

::: In practice, model parallelism typically partitions by layers. @fig-layers-blocks shows how a 24-layer transformer might be distributed: Device 1 handles blocks 1--6, Device 2 handles blocks 7--12, and so forth. This layer-wise partitioning minimizes cross-device communication to the boundaries between partitions. ::: {#fig-layers-blocks fig-env="figure" fig-pos="htb" fig-cap="Layer-wise Partitioning: A 24-layer transformer distributed across four devices, with each device responsible for six consecutive transformer blocks. Communication occurs only at partition boundaries." fig-alt="Diagram showing transformer blocks 1-6 on GPU 1, blocks 7-12 on GPU 2, blocks 13-18 on GPU 3, and blocks 19-24 on GPU 4."}


workflow/workflow.qmd

@fig-ml-lifecycle (Line 106)

Throughout this chapter, we use lifecycle to describe the stages themselves and workflow to describe the engineering discipline of orchestrating them; the lifecycle is what you traverse, the workflow is how you manage the traversal. @fig-ml-lifecycle visualizes two parallel pipelines22 that characterize the complete lifecycle. The data pipeline (green, top row) transforms raw inputs through collection, ingestion, analysis, labeling, validation, and preparation into ML-ready datasets. The model development pipeline (blue, bottom row) takes these datasets through training, evaluation, validation, and deployment to create production systems. Their interconnections reveal the distinctive character of ML development. The curved feedback arrows show how deployment insights trigger data refinements, creating continuous improvement cycles that distinguish ML from traditional linear development.

@fig-ds-time (Line 194)

Understanding the lifecycle conceptually is necessary but insufficient for engineering decisions. Quantitative characterization reveals where effort and compute actually go in ML projects, exposing which stages bottleneck development and where optimization investments yield the highest returns. Time allocation across stages follows a consistent pattern across industries. Data-related activities (collection, cleaning, labeling, validation, and preparation) consume 60-80% of total project time [@crowdflower2016data]. Model development and training, despite receiving the most research attention, typically represents only 10-20% of effort. The remaining 10-20% goes to deployment, integration, and initial monitoring setup. This distribution surprises teams accustomed to traditional software where implementation dominates. In ML projects, the "source code" is the data, and preparing that source code is the primary engineering activity. @fig-ds-time quantifies this breakdown, revealing that data cleaning alone accounts for 60% of practitioner effort. ::: {#fig-ds-time fig-env="figure" fig-pos="htb" fig-cap="Data Scientist Time Allocation: Data preparation consumes up to 60% of data science effort, with data collection accounting for an additional 19%. Model-focused activities such as pattern mining, training set construction, and algorithm refinement together represent roughly 18% of total time. Source: CrowdFlower 2016 Data Science Report." fig-alt="Pie chart showing data scientist time allocation: 60% cleaning and organizing data, 19% collecting datasets, 9% mining for patterns, 5% building training sets, 4% refining algorithms, 3% other tasks."}

@fig-ml-lifecycle (Line 243)

::: Iteration cycles characterize successful ML projects. @fig-ml-lifecycle shows the feedback loops that drive these iterations. Production-ready ML systems typically require 4-8 complete iteration cycles, where each cycle may revisit multiple stages. The distribution of iteration causes reveals where to invest in quality:

  • Data quality issues drive approximately 60% of iterations (missing labels, distribution mismatch, preprocessing errors)

@fig-lifecycle-overview (Line 343)

Where traditional software follows requirements through implementation to testing, ML systems require a fundamentally different organization that accommodates iterative experimentation, data-driven evolution, and continuous feedback. This section presents the six-stage framework that captures these differences. @fig-lifecycle-overview presents a simplified view of the six stages that structure this approach. Problem Definition establishes objectives and constraints. Data Collection and Preparation encompasses the data pipeline. Model Development and Training creates models. Evaluation and Validation ensures quality. Deployment and Integration brings systems to production. Monitoring and Maintenance ensures continued effectiveness. The prominent feedback loop emphasizes that insights from later stages inform earlier phases, capturing the cyclical nature that distinguishes ML from linear software development. To make these stages concrete, consider how they apply to MobileNetV2 (@sec-dnn-architectures), one of our Lighthouse Examples targeting mobile deployment. A DR screening model optimized for rural clinic deployment would face similar constraints: limited device memory, strict power budgets, and the need for real-time inference without reliable connectivity. The Lighthouse Examples illustrate these distinct workload pressures that any real system encounters as it moves from training to deployment. For MobileNetV2 specifically, Problem Definition establishes the constraint: <14 MB model size, <300 MFLOPs, real-time inference on mobile GPUs. Data Collection must account for on-device preprocessing limitations. Model Development uses depthwise separable convolutions[^fn-depthwise-sep] specifically designed to meet the FLOP budget. Evaluation validates not just accuracy but latency on target devices. Deployment targets mobile NPUs with quantization. Monitoring tracks performance across diverse device populations. Each stage's decisions propagate through subsequent stages, and the workflow framework makes these dependencies explicit.

@fig-lifecycle-overview (Line 403)

::: @fig-lifecycle-overview presents a linear narrative, but experienced practitioners recognize that these stages interconnect closely. Each stage corresponds to specific terms in the performance equation, and this mapping reveals what we call the Iron Law of Workflow: decisions made during data collection constrain what is achievable during model development, which in turn determines deployment requirements. The following perspective formalizes how each lifecycle stage maps to the Iron Law of ML Systems: ::: {.callout-perspective title="The Iron Law of Workflow"}

@fig-eye-dr (Line 446)

Before examining the stage interfaces and detailed workflows, we introduce a case study that will ground our discussion throughout this chapter. Diabetic retinopathy (DR) screening systems [@gulshan2016deep] provide an ideal lens because the problem appears straightforward (image classification) but reveals deep complexity in deployment; the development journey from research to clinical use is well documented; and the challenges span every lifecycle stage from data collection through monitoring. Diabetic retinopathy affects over 100 million people worldwide and is a leading cause of preventable blindness23 . @fig-eye-dr illustrates the clinical challenge: detecting characteristic hemorrhages (dark red spots) that indicate disease progression. Rural areas in developing countries have approximately one ophthalmologist per 100,000+ people, making AI-assisted screening not just convenient but medically essential.

@fig-lifecycle-overview (Line 490)

: Stage Interface Specification: Each lifecycle stage has explicit input requirements, output deliverables, and quality invariants that must hold for the stage to be considered complete. Violations of these contracts create technical debt that compounds through subsequent stages. The deployment paradigm selection in Problem Definition (Cloud, Edge, Mobile, or TinyML from @sec-ml-system-architecture) constrains all downstream stages, as a TinyML target imposes different data, model, and monitoring requirements than a Cloud target. {#tbl-stage-interface} This specification reveals why ML projects experience the iteration cycles shown in @fig-lifecycle-overview. When a downstream stage discovers that an upstream contract was violated (for example, evaluation reveals the training data distribution does not match production), the project must iterate back to fix the root cause. Teams that validate contracts at each stage transition catch violations early, when correction costs are lowest. ::: {.callout-example title="Auditing Stage Transitions"}

@fig-lifecycle-overview (Line 523)

Problem Definition Stage

Machine learning system development begins with a challenge distinct from traditional software development: define not just what the system should do, but how it should learn to do it. Conventional software requirements translate directly into implementation rules, while ML systems require teams to consider how the system will learn from data while operating within real-world constraints24 . This first stage, positioned at the left of @fig-lifecycle-overview, lays the foundation for all subsequent phases in the ML lifecycle.

@fig-ml-lifecycle-feedback (Line 643)

The data collection experiences in such systems directly inform model development approaches. The infrastructure constraints discovered during data collection establish requirements for model efficiency that drive architectural decisions. Limited bandwidth, diverse hardware, and intermittent connectivity all shape what architectures become feasible. The distributed federated learning approach required by privacy constraints influences training pipeline design. The quality variations observed across different clinic environments shape validation strategies and robustness requirements. This coupling between data collection insights and model development strategies exemplifies how integrated lifecycle planning trumps sequential stage optimization. @fig-ml-lifecycle-feedback maps these critical feedback loops that enable continuous system improvement. Data gaps identified during evaluation flow back to data collection; for instance, evaluation might reveal that the DR model underperforms on images from older fundus cameras, triggering targeted data collection from clinics using that equipment. Validation issues inform model training adjustments, as when validation across diverse patient populations reveals lower sensitivity for patients with cataracts, driving data augmentation strategies that simulate lens opacities. Performance insights from production monitoring trigger refinements across the pipeline, such as detecting accuracy drift in clinics that upgraded their imaging equipment and using those insights to update preprocessing steps accordingly. The foundation established during data collection both enables and constrains the technical approaches available for creating effective models. This dynamic becomes apparent as we now transition to model development. ::: {#fig-ml-lifecycle-feedback fig-env="figure" fig-pos="htb" fig-cap="Feedback Paths Across Lifecycle Stages: Six labeled feedback arrows connect the lifecycle stages. Data gaps identified during evaluation flow back to collection. Validation issues inform training adjustments. Performance insights from monitoring trigger pipeline refinements. Model updates propagate from monitoring to training. Data quality issues feed back to preparation. Deployment constraints propagate backward to influence model design." fig-alt="Diagram with 6 boxes: Data Collection, Preparation, Training, Evaluation, Deployment, Monitoring. Labeled feedback arrows show data gaps, validation issues, performance insights, and deployment constraints flowing between stages."}

@fig-ml-lifecycle (Line 959)

Integrating Systems Thinking Principles

@fig-cascades (Line 990)

This propagation operates bidirectionally, creating dynamic constraint networks rather than linear dependencies. When rural clinic deployment reveals tight bandwidth limitations, teams must redesign data preprocessing pipelines to reduce transmitted data by large factors. This requires model architectures optimized for compressed inputs, which influences training strategies that account for data degradation. Understanding these cascading relationships enables teams to make architectural decisions that accommodate rather than fight against systemic constraints. The Constraint Propagation Principle quantifies what experienced ML engineers know intuitively: decisions made in ignorance of downstream constraints create compounding technical debt25 . The stage interface specification (@tbl-stage-interface) operationalizes this principle by making constraints explicit at each stage boundary, enabling early detection before propagation costs escalate. When propagation occurs specifically through data quality failures, the resulting pattern is known as a data cascade; @sec-data-engineering-ml formalizes this failure mode and illustrates its stages in @fig-cascades.

@fig-ml-lifecycle (Line 1031)

Pitfall: Treating data preparation as a one-time preprocessing step. Teams assume they can "finish" data preparation and move on to modeling. In production, data distributions shift continuously. The two-pipeline architecture (@fig-ml-lifecycle) shows data and model pipelines run in parallel with continuous feedback, not sequentially. Data quality decisions cascade through model training, validation, and deployment. Data issues compound rather than isolate. Data quality issues account for 60-80% of production ML failures. Recommendation systems see 10-15% of features requiring updates monthly. Models degrade 5-10% within months as distributions shift, requiring emergency retraining that costs 3-5× more than proactive monitoring. Organizations that build continuous data validation pipelines from the start detect drift within days rather than months, maintaining accuracy within 2-3% of development baselines. Fallacy: Passing model evaluation means the system is ready for deployment.

@fig-ml-lifecycle (Line 1035)

Fallacy: Passing model evaluation means the system is ready for deployment. Engineers treat the model development pipeline as the entire workflow, assuming strong evaluation metrics mean the system is complete. The two-pipeline architecture (@fig-ml-lifecycle) shows this ignores half the lifecycle: data pipeline feedback loops, deployment integration, and production monitoring remain unaddressed. The diabetic retinopathy screening case study (@sec-ai-development-workflow-case-study-diabetic-retinopathy-screening-7d71) demonstrates the gap: the model passed evaluation but required additional validation to handle equipment variations across clinics, operator skill differences, and demographic diversity absent from curated development data. Evaluation metrics measure algorithm quality in isolation; production readiness requires verifying the complete system, including data freshness, preprocessing consistency, latency under load, and failure recovery. By the Constraint Propagation Principle, a deployment-stage discovery costs 2^{5-1} = 16\times the effort of catching it during evaluation design. Teams that equate strong evaluation metrics with deployment readiness consistently underestimate the integration effort by 3 to 5×. Pitfall: Skipping validation stages to accelerate timelines.

@fig-ml-lifecycle (Line 1047)

Summary

This chapter established the ML lifecycle as the systematic framework for engineering machine learning systems, the mental roadmap organizing how data, models, and deployment infrastructure interconnect throughout development. @fig-ml-lifecycle captures this framework through two parallel pipelines. The data pipeline transforms raw inputs through collection, ingestion, analysis, labeling, validation, and preparation into ML-ready datasets. The model development pipeline takes these datasets through training, evaluation, validation, and deployment to create production systems. Their interconnections reveal the distinctive nature of ML development. The feedback arrows show how deployment insights trigger data refinements, creating the continuous improvement cycles that distinguish ML from traditional linear development. Understanding this framework explains why machine learning systems demand specialized approaches distinct from traditional software. ML workflows replace deterministic specifications with probabilistic optimization, static behavior with dynamic adaptation, and isolated development with continuous feedback loops. This systematic perspective recognizes that success emerges not from perfecting individual stages in isolation, but from understanding how data quality affects model performance, how deployment constraints shape training strategies, and how production insights inform each subsequent development iteration.



  1. GLUE: General Language Understanding Evaluation, a collection of nine English sentence understanding tasks including sentiment analysis, textual entailment, and similarity. Introduced in 2018, GLUE provided standardized evaluation with a human baseline of 87.1% and became obsolete when BERT achieved 80.2% in 2019, leading to the more challenging SuperGLUE benchmark. As @fig-benchmark-components illustrates in its workflow progression, strategic dataset selection shapes all subsequent implementation steps and ultimately determines the benchmark's effectiveness. In the audio anomaly detection example, the dataset must include representative waveform samples of normal operation alongside comprehensive examples of various anomalous conditions. Notable examples include datasets like ToyADMOS for industrial manufacturing anomalies and Google Speech Commands for general sound recognition. Regardless of the specific dataset chosen, the data volume must suffice for both model training and validation, while incorporating real-world signal characteristics and noise patterns that reflect deployment conditions. The selection of benchmark datasets directly shapes experimental outcomes and model evaluation. Effective datasets must balance two key requirements: accurately representing real-world challenges while maintaining sufficient complexity to differentiate model performance meaningfully. While research often utilizes simplified datasets like ToyADMOS[^fn-toyadmos] [@koizumi2019toyadmos], these controlled environments, though valuable for methodological development, may not fully capture real-world deployment complexities. ↩︎

  2. AlexNet: Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto, this 8-layer neural network revolutionized computer vision in 2012. With 60 million parameters trained on two GTX 580 GPUs, AlexNet introduced key innovations in neural network design that became standard techniques in modern AI. ↩︎

  3. Extract, Transform, Load (ETL): A data processing pattern where raw data is extracted from sources, transformed (cleaned, aggregated, validated) in a separate processing layer, then loaded into storage. For example, a traditional ETL pipeline might extract customer purchase logs, transform them by removing duplicates and aggregating daily totals in Apache Spark, then load only the aggregated results into a data warehouse. Ensures only high-quality data reaches storage but requires reprocessing all data when transformation logic changes. ↩︎

  4. Mel-frequency cepstral coefficients (MFCCs): The "mel" in MFCC derives from "melody," coined by Stanley Smith Stevens in 1937 to create a perceptual pitch scale where equal distances sound equally different to humans. "Cepstral" is a playful reversal of "spectral," invented by Bogert, Healy, and Tukey (1963) since cepstral analysis operates on the spectrum of a spectrum. MFCCs apply mel-scale filtering (emphasizing frequencies humans hear best) followed by cepstral analysis, typically extracting 13-39 coefficients encoding timbral properties essential for speech recognition. ↩︎

  5. Forced alignment: The term "forced" distinguishes this from "free" alignment where the system must also recognize what was said. In forced alignment, the transcription is known, so the algorithm is "forced" to align specific words to the audio rather than choosing among alternatives. Developed alongside early speech recognition research in the 1970s, forced alignment uses dynamic programming (Viterbi algorithm) to compute optimal alignment paths matching phonetic sequences to audio frames with millisecond precision. Tools like Montreal Forced Aligner enable extracting individual keywords for KWS training. ↩︎

  6. Active Learning: The concept traces to statistical experimental design, but Dana Angluin's work on learning from queries [@angluin1988queries] established theoretical foundations for machine learning. The term "active" contrasts with "passive" learning from pre-labeled data—the learner actively queries an oracle[^fn-oracle-etymology] rather than passively receiving examples. Early work in the 1990s demonstrated that active selection could achieve the same accuracy as passive learning with exponentially fewer labels in favorable cases. ↩︎

  7. Logits: Short for "log-odds units." In statistics, the logit function is the inverse of the sigmoid (logistic) function, transforming a probability p \in (0,1) into a real number \in (-\infty, \infty) representing the log-odds \log(\frac{p}{1-p}). In deep learning, we use the term loosely to refer to the raw, unnormalized scores output by the last layer before the Softmax activation converts them into probabilities. ↩︎

  8. GEMM (General Matrix Multiply): A fundamental operation underlying many neural network layers. GEMM performs (C = \alpha AB + \beta C) and has been optimized for decades. In many dense models, a large fraction of runtime is spent in GEMM-like kernels, and well-tuned libraries can approach peak throughput on suitable workloads. This is why matrix-kernel efficiency often dominates end-to-end performance in practice. ↩︎

  9. ImageNet Revolution: AlexNet's dramatic victory in the 2012 ImageNet challenge [@krizhevsky2012imagenet] reduced top-5 error from 26.2% (the runner-up) to 15.3%, a 10.9 percentage point improvement that sparked the deep learning renaissance. ImageNet's 14+ million labeled images across more than 21,000 categories (with the ILSVRC competition subset using 1,000 classes) provided the scale needed to train deep CNNs, proving that "big data + big compute + big models" could achieve unprecedented performance. ↩︎

  10. Yann LeCun and CNNs: LeCun's 1989 LeNet architecture was inspired by Hubel and Wiesel's discovery of simple and complex cells in cat visual cortex [@hubel1962receptive]. LeNet-5 achieved approximately 0.9% error rate on MNIST in 1998 and was deployed by banks to read millions of checks daily, among the first large-scale commercial applications of neural networks. ↩︎

  11. Feature Map: The output of applying a convolutional filter to an input, representing detected features at different spatial locations. A 64-filter layer produces 64 feature maps, each highlighting different patterns like edges, textures, or shapes. Feature maps become more abstract (detecting objects, faces) in deeper layers compared to early layers (detecting edges, colors). ↩︎

  12. im2col (Image to Column): A data layout transformation that converts convolution operations into matrix multiplications by unfolding image patches into columns. This approach trades memory consumption (through data duplication) for computational efficiency, enabling CNNs to leverage decades of GEMM optimizations and achieving substantial speedups. ↩︎

  13. HBM (introduced in @sec-dnn-architectures) achieves 2-10x higher bandwidth than GDDR memory through 3D die stacking with thousands of TSVs. From a hardware architecture perspective, HBM's 2-3 TB/s bandwidth (vs. 500-700 GB/s for GDDR6X) transforms memory-bound ML workloads toward compute-bound performance. The trade-off is higher manufacturing cost, limiting HBM to data center accelerators where bandwidth justifies the premium. ↩︎

  14. Systolic Array: Named from the Greek "systole" (contraction), referring to the rhythmic pumping of the heart. H.T. Kung and Charles Leiserson chose this term at CMU in 1979 because data pulses through the processing element grid in waves, like blood through cardiac chambers. Each element contracts (computes) and passes data to neighbors in a coordinated rhythm. Google's TPUs revived systolic designs for neural networks by pairing them with software stacks optimized for dense linear algebra, achieving high throughput when workloads map well to this rhythmic dataflow pattern. ↩︎

  15. Tensor Processing Unit (TPU): Google's custom ASIC designed specifically for tensor operations, first used internally in 2015 for neural network inference [@jouppi2017datacenter]. The name derives from "tensor," coined by mathematician William Rowan Hamilton in 1846 from Latin tendere (to stretch), describing mathematical objects that transform under coordinate changes. Neural networks are fundamentally tensor computations: weights are matrices (rank-2 tensors), batched inputs form higher-rank tensors. A single TPU v4 Pod contains 4,096 chips and delivers over 1 exaflop of peak performance [@jouppi2023tpu]. ↩︎

  16. IoT Device Growth: Explosive growth from 8.4B connected devices (2017) to projected 25.4B by 2030 [@mckinsey2021iot]. Daily data generation approaches 2.5 quintillion bytes, with 90% requiring real-time processing. Network bandwidth and cloud costs make edge processing economically essential; uploading raw sensor data would cost $10-100 per device monthly. ↩︎

  17. Microcontrollers: Single-chip computers with integrated CPU, memory, and peripherals, typically operating at 1-100MHz with 32KB-2MB RAM. Arduino Uno uses an ATmega328P with 32KB flash and 2KB RAM, while ESP32 provides WiFi capability with 520KB RAM, still thousands of times less than a smartphone. ↩︎

  18. Distillation: Borrowed from chemistry, where distillation separates mixtures by selective evaporation and condensation, extracting the essence while leaving impurities behind. Hinton et al. [@hinton2015distilling] chose this metaphor deliberately: the teacher's "dark knowledge" about class relationships is the essence being extracted, while the massive parameter count is the impurity left behind. The temperature parameter T in the softmax even mirrors the literal temperature control in chemical distillation. ↩︎

  19. Hardware-Aware NAS: Architecture search directly optimizing for target hardware latency rather than proxy metrics like FLOPs. MnasNet (2019) uses actual measured latency in the search objective, finding architectures with 1.8× speedup over MobileNetV2 at higher accuracy. Platform-specific search discovers that optimal architectures differ significantly between mobile CPUs, GPUs, and TPUs. ↩︎

  20. Floating-Point Dynamic Range: The dynamic range of a floating-point format is determined by its exponent bit-width and bias. FP32 and BFloat16 both use an 8-bit exponent with a bias of 127, resulting in an exponent range of [-126, 127] and an approximate numerical range of 10^{-38} to 10^{38}. FP16, with a 5-bit exponent and a bias of 15, has an exponent range of [-14, 15], leading to a more constrained numerical range of roughly 10^{-5} to 10^5. This reduced range in FP16 can lead to numerical instability in training, whereas BFloat16 retains FP32's broader range, making it more suitable for training deep neural networks. ↩︎

  21. LIME and SHAP: Two dominant post-hoc explainability methods with different computational trade-offs. LIME (Local Interpretable Model-agnostic Explanations) [@ribeiro2016why] fits a simple interpretable model around each prediction, offering faster computation but potentially inconsistent explanations. SHAP derives its name from SHapley Additive exPlanations, honoring Lloyd Shapley, the mathematician who introduced Shapley values in his 1953 game theory work on fair allocation of cooperative gains. Lundberg and Lee [@lundberg2017unified] adapted this framework to compute feature contributions, providing mathematically consistent explanations but with exponential worst-case complexity. Shapley received the 2012 Nobel Prize in Economics for this foundational work. Systems implication: SHAP may add 10-100x inference latency, making LIME preferable for real-time applications. ↩︎

  22. Pipeline: Borrowed from the oil industry, where pipelines transported crude oil from wells to refineries starting in the 1860s. The computing metaphor emerged in the 1960s at IBM to describe data flowing through connected processing stages, just as oil flows through physical pipes. In ML, the metaphor extends naturally: raw data enters one end, flows through transformation stages, and emerges as trained models or predictions. The term captures the key insight that ML development requires continuous flow rather than discrete steps. ↩︎

  23. Diabetic Retinopathy Global Impact: Affects 93-103 million people worldwide, with 22-35% of diabetic patients developing retinopathy [@who2019classification]. In developing countries, up to 90% of vision loss from diabetes is preventable with early detection, but access to specialists remains severely limited [@rajkomar2019machine]. ↩︎

  24. ML vs. Traditional Problem Definition: Traditional software problems are defined by deterministic specifications ("if input X, then output Y"), but ML problems are defined by examples and desired behaviors. This shift means that ML projects face higher failure rates, with industry surveys suggesting 70-90% of ML projects fail to reach production deployment, many during problem formulation and requirements phases, compared to lower failure rates in traditional software projects [@standish2020chaos]. The challenge lies in translating business objectives into learning objectives, something that did not exist in software engineering until the rise of data-driven systems in the 2000s [@amershi2019software]. ↩︎

  25. ML Technical Debt: A concept from Sculley et al.'s influential 2015 paper "Hidden Technical Debt in Machine Learning Systems" [@sculley2015hidden], which identified that ML systems accumulate debt faster than traditional software due to entanglement (changing one feature affects all others), hidden feedback loops (model predictions influence future training data), and undeclared consumers (downstream systems depending on model outputs without explicit contracts). The paper found that ML code often represents less than 5% of a production ML system, with configuration, data pipelines, and serving infrastructure dominating complexity. @sec-machine-learning-operations-mlops addresses debt management strategies. ↩︎