mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
style(vol1): fix remaining multiplication notation violations
Second pass catching ~37 additional instances missed in the initial cleanup, including prose in frameworks, glossary definitions, footnotes, fig-caps, fig-alts, table cells, and callout content. All remaining `Nx` patterns are now exclusively inside Python code blocks (comments, docstrings, f-strings) or are mathematical variable expressions (e.g., derivative = 2x), which are correct as-is.
This commit is contained in:
@@ -537,7 +537,7 @@ This glossary defines key terms used throughout this book. Terms are organized a
|
||||
: 32-bit floating-point numerical representation that provides standard precision for mathematical computations but requires more memory and computational resources than lower-precision formats.
|
||||
|
||||
**fp32 to int8**
|
||||
: A common quantization transformation that converts 32-bit floating point weights and activations to 8-bit integers, achieving roughly 4x memory reduction while maintaining acceptable accuracy for many models.
|
||||
: A common quantization transformation that converts 32-bit floating point weights and activations to 8-bit integers, achieving roughly 4× memory reduction while maintaining acceptable accuracy for many models.
|
||||
|
||||
**framework decomposition**
|
||||
: The systematic breakdown of neural network frameworks into hardware-mappable components, enabling efficient distribution of operations across processing elements.
|
||||
@@ -666,7 +666,7 @@ This glossary defines key terms used throughout this book. Terms are organized a
|
||||
: The interface between software and hardware that defines the set of instructions a processor can execute, including data types and addressing modes.
|
||||
|
||||
**int8**
|
||||
: 8-bit integer numerical representation used in quantized neural networks, where model weights and activations are represented using 8-bit integers instead of 32-bit floating point, reducing memory usage by roughly 4x and accelerating inference on specialized hardware while attempting to maintain model accuracy.
|
||||
: 8-bit integer numerical representation used in quantized neural networks, where model weights and activations are represented using 8-bit integers instead of 32-bit floating point, reducing memory usage by roughly 4× and accelerating inference on specialized hardware while attempting to maintain model accuracy.
|
||||
|
||||
**internet of things**
|
||||
: A network of physical objects embedded with sensors, software, and other technologies that connect and exchange data with other devices and systems over the internet.
|
||||
|
||||
@@ -2526,7 +2526,7 @@ Data locality becomes critical at this scale. At 10 GB/s peak throughput, transf
|
||||
|
||||
Single-machine processing\index{Single-Machine Processing!scalability} suffices for surprisingly large workloads when engineered carefully. Modern servers with 256 gigabytes RAM can process datasets of several terabytes using out-of-core processing that streams data from disk. Libraries like Dask\index{Dask!out-of-core processing} or Vaex\index{Lazy Evaluation!data processing} enable pandas-like APIs that automatically stream and parallelize computations across multiple cores. Before investing in distributed processing infrastructure, teams should exhaust single-machine optimization: using efficient data formats (Parquet[^fn-parquet] instead of CSV), minimizing memory allocations, leveraging vectorized operations, and exploiting multi-core parallelism. The operational simplicity of single-machine processing—no network coordination, no partial failures, simple debugging—makes it preferable when performance is adequate.
|
||||
|
||||
[^fn-parquet]: **Parquet**: Named after the herringbone wood flooring pattern, this columnar storage format (developed by Cloudera and Twitter, 2013) stores data in nested column structures that visually resemble parquet flooring tiles. The name reflects how data is interlocked column-by-column rather than row-by-row. For ML systems, this columnar layout enables reading only required features and achieves 5-10x I/O reduction compared to row-based formats like CSV.
|
||||
[^fn-parquet]: **Parquet**: Named after the herringbone wood flooring pattern, this columnar storage format (developed by Cloudera and Twitter, 2013) stores data in nested column structures that visually resemble parquet flooring tiles. The name reflects how data is interlocked column-by-column rather than row-by-row. For ML systems, this columnar layout enables reading only required features and achieves 5–10× I/O reduction compared to row-based formats like CSV.
|
||||
|
||||
Distributed processing frameworks become necessary when data volumes or computational requirements exceed single-machine capacity, but the speedup achievable through parallelization faces inherent limits described by **Amdahl's Law**. Let $S$ be the serial fraction, $p$ the parallelizable fraction, and $N$ the number of workers. @eq-amdahl-data gives the bound:
|
||||
|
||||
@@ -3324,7 +3324,7 @@ The patterns in this section—data debt accumulation, diagnostic debugging, and
|
||||
|
||||
**Fallacy:** *More data always improves model performance.*
|
||||
|
||||
Beyond a threshold, additional data yields diminishing returns while costs scale linearly. A dataset of 10 million examples may provide only marginal accuracy gains over 1 million examples, yet incur 10x the storage, labeling, and processing costs. The Information Entropy concept from @sec-data-engineering-physics-data-cdcb explains why: if new examples are redundant (low entropy), they add mass without information. Smart data selection—active learning, deduplication, curriculum design—often outperforms naive data accumulation.
|
||||
Beyond a threshold, additional data yields diminishing returns while costs scale linearly. A dataset of 10 million examples may provide only marginal accuracy gains over 1 million examples, yet incur 10× the storage, labeling, and processing costs. The Information Entropy concept from @sec-data-engineering-physics-data-cdcb explains why: if new examples are redundant (low entropy), they add mass without information. Smart data selection—active learning, deduplication, curriculum design—often outperforms naive data accumulation.
|
||||
|
||||
**Pitfall:** *Treating data preprocessing as a one-time task.*
|
||||
|
||||
@@ -3360,7 +3360,7 @@ The technical architecture of data systems demonstrates how engineering decision
|
||||
|
||||
* **Labeling costs dominate and require substantial resource allocation.** Labeling typically costs 1,000–3,000× more than model training compute. Labeling is the serial bottleneck that parallelization cannot solve.
|
||||
|
||||
* **Storage hierarchy determines iteration speed.** The 70x throughput gap between local NVMe (7 GB/s) and cloud object storage (100 MB/s) determines whether iterations occur daily or weekly.
|
||||
* **Storage hierarchy determines iteration speed.** The 70× throughput gap between local NVMe (7 GB/s) and cloud object storage (100 MB/s) determines whether iterations occur daily or weekly.
|
||||
|
||||
* **Data debt compounds and requires continuous remediation.** Documentation, schema, quality, and freshness debt accumulate with compound interest. Allocate sustained engineering capacity to prevent remediation from overwhelming new feature work.
|
||||
|
||||
|
||||
@@ -661,8 +661,8 @@ ratio_str = QualityMultiplier.ratio_str
|
||||
**The Physics of Noise**: Why is one clean sample worth 100 noisy ones?
|
||||
|
||||
**The Math**: Classical learning theory (for convex optimization with SGD) tells us that convergence rates depend on label noise. While deep learning operates in a non-convex regime, the qualitative relationship holds broadly.
|
||||
1. **Clean Data**: Convergence rate is typically $O(1/N)$. To halve the error, you need **2x** data.
|
||||
2. **Noisy Data**: Convergence rate drops to $O(1/\sqrt{N})$. To halve the error, you need **4x** data.
|
||||
1. **Clean Data**: Convergence rate is typically $O(1/N)$. To halve the error, you need **2×** data.
|
||||
2. **Noisy Data**: Convergence rate drops to $O(1/\sqrt{N})$. To halve the error, you need **4×** data.
|
||||
|
||||
**The Multiplier**:
|
||||
To reach a target error $\epsilon$:
|
||||
@@ -1527,7 +1527,7 @@ Contrast the two bar charts in @fig-amortization-comparison to see this cost str
|
||||
```{python}
|
||||
#| label: fig-amortization-comparison
|
||||
#| echo: false
|
||||
#| fig-cap: "**Cost Amortization in Foundation Models**: Training from scratch (left) requires 1,000 GPU-hours per task (10,000 total for 10 tasks). The foundation model approach (right) pays 10,000 GPU-hours upfront for pre-training but reduces each subsequent task to just 50 GPU-hours. At 10 tasks the totals are comparable (10,000 vs 10,500), but the per-task marginal cost drops by 20x, and the crossover favoring the foundation model occurs around 11 tasks."
|
||||
#| fig-cap: "**Cost Amortization in Foundation Models**: Training from scratch (left) requires 1,000 GPU-hours per task (10,000 total for 10 tasks). The foundation model approach (right) pays 10,000 GPU-hours upfront for pre-training but reduces each subsequent task to just 50 GPU-hours. At 10 tasks the totals are comparable (10,000 vs 10,500), but the per-task marginal cost drops by 20×, and the crossover favoring the foundation model occurs around 11 tasks."
|
||||
#| fig-alt: "Two bar charts side by side. Left (Train from Scratch) shows 10 equal bars of 1,000 GPU-hours each, totaling 10,000 hours. Right (Foundation Model) shows one tall pre-training bar of 10,000 GPU-hours followed by 10 short fine-tuning bars of 50 GPU-hours each, totaling 10,500 hours."
|
||||
|
||||
import numpy as np
|
||||
@@ -2218,9 +2218,9 @@ The selection inequality addresses compute overhead, but data selection introduc
|
||||
|
||||
| **Storage Tier** | **Sequential Throughput** | **Random I/O (IOPS)** | **Random Throughput (approx)** | **Random Penalty** |
|
||||
|:-----------------|--------------------------:|----------------------:|-------------------------------:|-------------------:|
|
||||
| **HDD (7.2k)** | ~150 MB/s | ~80 | ~0.3 MB/s | **500x** |
|
||||
| **SATA SSD** | ~550 MB/s | ~10k | ~40 MB/s | **14x** |
|
||||
| **NVMe SSD** | ~3,500 MB/s | ~500k | ~2,000 MB/s | **1.75x** |
|
||||
| **HDD (7.2k)** | ~150 MB/s | ~80 | ~0.3 MB/s | **500×** |
|
||||
| **SATA SSD** | ~550 MB/s | ~10k | ~40 MB/s | **14×** |
|
||||
| **NVMe SSD** | ~3,500 MB/s | ~500k | ~2,000 MB/s | **1.75×** |
|
||||
| **Cloud (S3)** | ~100 MB/s (per conn) | ~10–50 ms (lat) | Very Low (per conn) | **Extreme** |
|
||||
|
||||
: **The Cost of Randomness.** Comparative I/O throughput for sequential vs. random 4KB reads across different storage tiers. Standard data loaders optimize for sequential throughput, while data selection strategies often incur the random access penalty. {#tbl-io-performance .striped .hover}
|
||||
|
||||
@@ -1484,11 +1484,11 @@ Neural network training universally uses reverse mode (covered next), but forwar
|
||||
\index{Dual Numbers!forward mode AD}
|
||||
Forward mode automatic differentiation computes derivatives alongside the original computation, tracking how changes propagate from input to output. This approach mirrors manual derivative computation, making it intuitive to understand and implement.
|
||||
|
||||
Forward mode's memory requirements are its strength: the method stores only the original value, a single derivative value, and temporary results. Memory usage stays constant regardless of computation depth, making forward mode particularly suitable for embedded systems, real-time applications, and memory-bandwidth-limited systems. However, this comes with a computational cost. Forward mode doubles the Ops term (in **Iron Law** terms) for each input parameter whose derivative is requested. For a model with $N$ parameters, forward mode multiplies total computation by $N$, because each parameter requires a separate forward pass. Reverse mode, by contrast, adds a constant factor of approximately 2 to 3x regardless of $N$. This asymmetry explains why forward mode is never used for training neural networks, where $N$ ranges from millions to hundreds of billions. This combination of computational scaling with input count but constant memory creates a specific niche: forward mode excels in scenarios with few inputs but many outputs, such as sensitivity analysis, feature importance computation, and online learning with single-example updates.
|
||||
Forward mode's memory requirements are its strength: the method stores only the original value, a single derivative value, and temporary results. Memory usage stays constant regardless of computation depth, making forward mode particularly suitable for embedded systems, real-time applications, and memory-bandwidth-limited systems. However, this comes with a computational cost. Forward mode doubles the Ops term (in **Iron Law** terms) for each input parameter whose derivative is requested. For a model with $N$ parameters, forward mode multiplies total computation by $N$, because each parameter requires a separate forward pass. Reverse mode, by contrast, adds a constant factor of approximately 2 to 3× regardless of $N$. This asymmetry explains why forward mode is never used for training neural networks, where $N$ ranges from millions to hundreds of billions. This combination of computational scaling with input count but constant memory creates a specific niche: forward mode excels in scenarios with few inputs but many outputs, such as sensitivity analysis, feature importance computation, and online learning with single-example updates.
|
||||
|
||||
To see the mechanism concretely, consider computing both the value and derivative of $f(x) = x^2 \sin(x)$. @lst-forward_mode_ad shows how forward mode propagates derivative computations alongside every operation, applying the chain rule and product rule at each step:
|
||||
|
||||
::: {#lst-forward_mode_ad lst-cap="**Forward Mode AD**: Propagates derivatives forward through the computation graph, computing one directional derivative per forward pass with 2x computational overhead."}
|
||||
::: {#lst-forward_mode_ad lst-cap="**Forward Mode AD**: Propagates derivatives forward through the computation graph, computing one directional derivative per forward pass with 2× computational overhead."}
|
||||
```{.python}
|
||||
def f(x): # Computing both value and derivative
|
||||
# Step 1: x -> x²
|
||||
@@ -1530,7 +1530,7 @@ dresult = (
|
||||
```
|
||||
:::
|
||||
|
||||
The dual number trace demonstrates the 2x computational overhead per input: every arithmetic operation (multiply, sine, product rule combination) is performed twice, once for the value and once for the derivative. For this single-input function, the overhead is acceptable. For a neural network with $N = 100{,}000{,}000$ parameters, computing all gradients would require 100 million such passes, which is why forward mode is restricted to the few-input applications described above.
|
||||
The dual number trace demonstrates the 2× computational overhead per input: every arithmetic operation (multiply, sine, product rule combination) is performed twice, once for the value and once for the derivative. For this single-input function, the overhead is acceptable. For a neural network with $N = 100{,}000{,}000$ parameters, computing all gradients would require 100 million such passes, which is why forward mode is restricted to the few-input applications described above.
|
||||
|
||||
Forward mode's strength in single-input analysis becomes its fatal weakness for training. A neural network has one scalar loss but millions of parameters, and forward mode would require a separate pass for each one---an intractable $O(N)$ cost that explains why no production framework uses forward mode for training. Forward mode remains useful for targeted analyses such as sensitivity analysis (how does changing one pixel affect the prediction?) and feature importance (which input dimensions most influence the output?), where the number of inputs of interest is small.
|
||||
|
||||
@@ -1945,7 +1945,7 @@ These three principles connect directly to the framework's role as a compiler fo
|
||||
#### Mixed-Precision Training Support {#sec-ml-frameworks-mixedprecision-training-support-d31d}
|
||||
|
||||
\index{Mixed Precision!FP16 vs. FP32 trade-offs}
|
||||
Mixed precision exploits a hardware asymmetry to improve two Iron Law terms simultaneously: Tensor Cores execute FP16 matrix multiplications at 2x the throughput of FP32 (increasing effective $O/R_{peak}$), while FP16 activations halve the memory footprint (reducing $D_{vol}$). Improving both terms simultaneously is rare; most optimizations improve one at the expense of the other.
|
||||
Mixed precision exploits a hardware asymmetry to improve two Iron Law terms simultaneously: Tensor Cores execute FP16 matrix multiplications at 2× the throughput of FP32 (increasing effective $O/R_{peak}$), while FP16 activations halve the memory footprint (reducing $D_{vol}$). Improving both terms simultaneously is rare; most optimizations improve one at the expense of the other.
|
||||
|
||||
Frameworks exploit this through automatic mixed-precision APIs that select reduced precision for compute-intensive operations while maintaining FP32 where numerical stability demands it. Inside these APIs, frameworks automatically apply precision rules: matrix multiplications and convolutions use FP16 for bandwidth efficiency, while numerically sensitive operations like softmax and layer normalization remain in FP32. This selective precision maintains accuracy while achieving speedups on modern GPUs with specialized hardware units. Because FP16 has a narrower dynamic range than FP32, gradients can underflow to zero during backpropagation. Loss scaling addresses this by multiplying the loss by a large factor before the backward pass, then dividing gradients by the same factor afterward.
|
||||
|
||||
@@ -2144,7 +2144,7 @@ The execution and differentiation problems together enable the training loop: th
|
||||
\index{Abstraction Problem!definition}
|
||||
\index{Hardware Abstraction!framework design}
|
||||
\index{Hardware Abstraction!two dimensions (data, execution)}
|
||||
The hardware diversity described above is not merely inconvenient; it is architecturally fundamental. A GPU offers 1,000x the parallelism of a CPU but has different memory semantics. A TPU provides higher throughput but requires static shapes. A microcontroller has kilobytes where a server has gigabytes. The abstraction problem asks: how should frameworks hide this complexity behind a single programming interface while still enabling efficient utilization of each target's unique capabilities?
|
||||
The hardware diversity described above is not merely inconvenient; it is architecturally fundamental. A GPU offers 1,000× the parallelism of a CPU but has different memory semantics. A TPU provides higher throughput but requires static shapes. A microcontroller has kilobytes where a server has gigabytes. The abstraction problem asks: how should frameworks hide this complexity behind a single programming interface while still enabling efficient utilization of each target's unique capabilities?
|
||||
|
||||
The problem decomposes into two interacting dimensions. The first is *data representation*: how should frameworks represent tensors, parameters, and computational state in ways that work across hardware? The second is *execution mapping*: how should high-level operations translate to hardware-specific implementations? These dimensions are not independent concerns. The way data is represented (memory layout, precision, device placement) directly affects what execution strategies are possible. A tensor stored in row-major format on a GPU requires different kernels than one in column-major format on a CPU. A model quantized to INT8 enables entirely different execution paths than FP32.
|
||||
|
||||
@@ -2578,7 +2578,7 @@ The cost of moving data between devices varies by orders of magnitude depending
|
||||
| **NVLink 3.0** | `{python} nvlink_a100_gbs_str` GB/s bidirectional | `{python} nvlink_4mb_ms_str` ms | Comparable to GPU compute |
|
||||
| **GPU Memory** | `{python} a100_bw_gbs_str` GB/s | `{python} hbm_4mb_ms_str` ms | Optimal |
|
||||
|
||||
: **Device Transfer Overhead.** Transfer time for a 4 MB tensor across different interconnects. PCIe bandwidth shown is unidirectional (typical for GPU transfers), with full-duplex operation providing 2x total bandwidth. NVLink bandwidth is bidirectional (300 GB/s per direction). Transfer times dominate for small operations, making device placement critical for performance. {#tbl-device-transfer-overhead}
|
||||
: **Device Transfer Overhead.** Transfer time for a 4 MB tensor across different interconnects. PCIe bandwidth shown is unidirectional (typical for GPU transfers), with full-duplex operation providing 2× total bandwidth. NVLink bandwidth is bidirectional (300 GB/s per direction). Transfer times dominate for small operations, making device placement critical for performance. {#tbl-device-transfer-overhead}
|
||||
|
||||
These numbers connect directly to the **Iron Law** of performance. Every cross-device transfer inflates the data movement term ($D_{vol}/BW$) at a fraction of the available on-device bandwidth. A PCIe 4.0 transfer at `{python} pcie4_gbs_str` GB/s means moving a 1 GB activation tensor adds approximately `{python} pcie4_1gb_ms_str` ms to the data movement cost, equivalent to roughly `{python} pcie4_1gb_equiv_ops_str` trillion operations on a GPU delivering `{python} A100BLAS.dense_tflops_str` TFLOPS. For a model forward pass taking 0.5 ms on GPU, transferring inputs and outputs over PCIe 3.0 doubles the total latency. When batches are small or models are lightweight, transfer overhead can exceed computation time entirely.
|
||||
|
||||
@@ -3082,9 +3082,9 @@ resnet_gflops_str = ResNetGFLOPS.resnet_gflops_str
|
||||
\index{GEMM!arithmetic intensity}
|
||||
With hardware abstraction managing the platform-specific details, frameworks build a layer of mathematical operations on top. General Matrix Multiply (GEMM)\index{GEMM!matrix multiplication}\index{Tensor Operations!GEMM} dominates ML computation (see @sec-algorithm-foundations-general-matrix-multiply-gemm-b55d for arithmetic intensity analysis and the roofline implications). The operation C = $\alpha$AB + $\beta$C accounts for the vast majority of arithmetic in neural networks: a single ResNet-50 forward pass performs approximately `{python} resnet_gflops_str` billion floating-point operations, nearly all of which reduce to GEMM. Frameworks optimize GEMM through cache-aware tiling (splitting matrices into blocks that fit in L1/L2 cache), loop unrolling for instruction-level parallelism, and shape-specific kernels. Fully connected layers use standard dense GEMM, while convolutional layers use im2col transformations that reshape input patches into matrix columns, converting convolution into GEMM.
|
||||
|
||||
Beyond GEMM, frameworks implement BLAS operations\index{BLAS!vector and matrix operations} (AXPY for vector addition, GEMV for matrix-vector products) and element-wise operations\index{Element-wise Operations!memory bandwidth} (activation functions, normalization). Element-wise operations are individually cheap but collectively expensive due to memory bandwidth. Each operation reads and writes the full tensor, so a sequence of five element-wise operations on a 100 MB tensor moves 1 GB of data. Fusing those five operations into a single kernel reduces memory traffic to 200 MB, a 5x bandwidth savings that directly translates to faster execution.
|
||||
Beyond GEMM, frameworks implement BLAS operations\index{BLAS!vector and matrix operations} (AXPY for vector addition, GEMV for matrix-vector products) and element-wise operations\index{Element-wise Operations!memory bandwidth} (activation functions, normalization). Element-wise operations are individually cheap but collectively expensive due to memory bandwidth. Each operation reads and writes the full tensor, so a sequence of five element-wise operations on a 100 MB tensor moves 1 GB of data. Fusing those five operations into a single kernel reduces memory traffic to 200 MB, a 5× bandwidth savings that directly translates to faster execution.
|
||||
|
||||
Numerical precision adds another dimension. Training in FP32 uses 4 bytes per parameter; quantizing to INT8 reduces this to 1 byte, cutting memory by 4x and enabling 2-4x throughput improvements on hardware with INT8 acceleration. Training typically requires FP32 for gradient stability, while inference runs at FP16 or INT8 with minimal accuracy loss. Frameworks maintain separate kernel implementations for each precision format and handle mixed-precision workflows where different layers operate at different bit widths within a single forward pass.
|
||||
Numerical precision adds another dimension. Training in FP32 uses 4 bytes per parameter; quantizing to INT8 reduces this to 1 byte, cutting memory by 4× and enabling 2–4× throughput improvements on hardware with INT8 acceleration. Training typically requires FP32 for gradient stability, while inference runs at FP16 or INT8 with minimal accuracy loss. Frameworks maintain separate kernel implementations for each precision format and handle mixed-precision workflows where different layers operate at different bit widths within a single forward pass.
|
||||
|
||||
#### System-Level Operations {#sec-ml-frameworks-systemlevel-operations-6a1c}
|
||||
|
||||
@@ -3545,13 +3545,13 @@ How large are these differences in practice? @tbl-framework-efficiency-matrix co
|
||||
| **Framework** | **Inference** **Latency (ms)** | **Memory** **Usage (MB)** | **Energy** **(mJ/inference)** | **Model Size** **Reduction** | **Hardware** **Utilization (%)** |
|
||||
|:--------------------------|-------------------------------:|--------------------------:|------------------------------:|-----------------------------:|---------------------------------:|
|
||||
| **TensorFlow** | 45 | 2,100 | 850 | None | 35 |
|
||||
| **TensorFlow Lite** | 12 | 180 | 120 | 4x (quantized) | 65 |
|
||||
| **TensorFlow Lite Micro** | 8 | 32 | 45 | 8x (pruned+quant) | 75 |
|
||||
| **TensorFlow Lite** | 12 | 180 | 120 | 4× (quantized) | 65 |
|
||||
| **TensorFlow Lite Micro** | 8 | 32 | 45 | 8× (pruned+quant) | 75 |
|
||||
| **PyTorch** | 52 | 1,800 | 920 | None | 32 |
|
||||
| **PyTorch Mobile** | 18 | 220 | 180 | 3x (quantized) | 58 |
|
||||
| **ONNX Runtime** | 15 | 340 | 210 | 2x (optimized) | 72 |
|
||||
| **TensorRT** | 3 | 450 | 65 | 2x (precision opt) | 88 |
|
||||
| **Apache TVM** | 6 | 280 | 95 | 3x (compiled) | 82 |
|
||||
| **PyTorch Mobile** | 18 | 220 | 180 | 3× (quantized) | 58 |
|
||||
| **ONNX Runtime** | 15 | 340 | 210 | 2× (optimized) | 72 |
|
||||
| **TensorRT** | 3 | 450 | 65 | 2× (precision opt) | 88 |
|
||||
| **Apache TVM** | 6 | 280 | 95 | 3× (compiled) | 82 |
|
||||
|
||||
: **Framework Efficiency Comparison.** Quantitative comparison of major ML frameworks across efficiency dimensions using ResNet-50 inference on representative hardware (NVIDIA A100 GPU for server frameworks, ARM Cortex-A78 for mobile). Metrics reflect production workloads with accuracy maintained within 1% of baseline. Hardware utilization represents percentage of theoretical peak performance on typical operations. {#tbl-framework-efficiency-matrix}
|
||||
|
||||
@@ -3559,7 +3559,7 @@ How large are these differences in practice? @tbl-framework-efficiency-matrix co
|
||||
\index{Apache TVM!ML compiler}
|
||||
The efficiency data reveals several important patterns. First, specialized inference frameworks (TensorRT, Apache TVM) achieve 10–15× lower latency than general-purpose training frameworks (PyTorch, TensorFlow) on identical hardware, demonstrating that framework selection has quantitative performance implications beyond qualitative design preferences. Second, mobile-optimized variants (TF Lite, PyTorch Mobile) reduce memory requirements by 10× compared to their full counterparts while maintaining accuracy within 1% through quantization and graph optimization. Third, hardware utilization varies dramatically: TensorRT achieves 88% GPU utilization through aggressive kernel fusion while vanilla PyTorch achieves only 32%, a 2.75× efficiency gap that directly translates to cost differences in production deployment.
|
||||
|
||||
These efficiency gaps, significant in the data center, become existential as we move beyond the server room. A 17x latency difference between PyTorch and TensorRT is an optimization opportunity on a cloud GPU; on a microcontroller with 256 KB of RAM, a framework that requires 1.8 GB of memory simply cannot run at all. The question shifts from "which framework is fastest?" to "which framework fits?"
|
||||
These efficiency gaps, significant in the data center, become existential as we move beyond the server room. A 17× latency difference between PyTorch and TensorRT is an optimization opportunity on a cloud GPU; on a microcontroller with 256 KB of RAM, a framework that requires 1.8 GB of memory simply cannot run at all. The question shifts from "which framework is fastest?" to "which framework fits?"
|
||||
|
||||
## Deployment Targets {#sec-ml-frameworks-deployment-targets-13f1}
|
||||
|
||||
@@ -3931,7 +3931,7 @@ This detailed trace through a single training step demonstrates how deeply the t
|
||||
|
||||
## Fallacies and Pitfalls {#sec-ml-frameworks-fallacies-pitfalls-61ef}
|
||||
|
||||
Framework selection involves subtle trade-offs where intuitions from conventional software engineering fail. The memory wall, kernel fusion constraints, and deployment target diversity create pitfalls that waste months of engineering effort and cause production systems to miss latency targets by 10x or more.
|
||||
Framework selection involves subtle trade-offs where intuitions from conventional software engineering fail. The memory wall, kernel fusion constraints, and deployment target diversity create pitfalls that waste months of engineering effort and cause production systems to miss latency targets by 10× or more.
|
||||
|
||||
```{python}
|
||||
#| label: framework-gaps-calc
|
||||
|
||||
@@ -92,10 +92,10 @@ Hardware acceleration does not speed up the entire system; it *only speeds up th
|
||||
$$ Speedup = \frac{1}{(1 - p) + \frac{p}{S}} $$ {#eq-amdahl}
|
||||
|
||||
* **$p$ (Parallel Fraction):** The matrix multiplications (typically 90–99% of an ML workload).
|
||||
* **$S$ (Speedup):** The raw speed advantage of the GPU/TPU over the CPU (typically 100x-1000x).
|
||||
* **$S$ (Speedup):** The raw speed advantage of the GPU/TPU over the CPU (typically 100–1,000×).
|
||||
* **$1-p$ (Serial Fraction):** Data loading, Python overhead, and kernel launch latency.\index{Serial Fraction!Amdahl bottleneck}\index{Kernel Launch Latency!serial overhead}
|
||||
|
||||
**The Pitfall:** If data loading takes 10% of the time ($p=0.9$), even an **infinite speed** accelerator ($S=\infty$) can only achieve a **10x** total speedup. The "boring" serial part dominates the "exciting" AI part.
|
||||
**The Pitfall:** If data loading takes 10% of the time ($p=0.9$), even an **infinite speed** accelerator ($S=\infty$) can only achieve a **10×** total speedup. The "boring" serial part dominates the "exciting" AI part.
|
||||
:::
|
||||
|
||||
\index{Acceleration Wall!diminishing returns}
|
||||
@@ -3585,7 +3585,7 @@ The mapping strategies from the preceding section establish *where* computations
|
||||
Three questions structure all dataflow decisions:
|
||||
|
||||
1. **Which data stays local?** Weight-stationary, output-stationary, and input-stationary strategies each make different choices about what to cache near compute units, trading off different memory access patterns.
|
||||
2. **How is data organized?** Tensor layouts (NHWC vs. NCHW) determine whether memory accesses align with hardware preferences, with performance impacts of 2-5x.
|
||||
2. **How is data organized?** Tensor layouts (NHWC vs. NCHW) determine whether memory accesses align with hardware preferences, with performance impacts of 2–5×.
|
||||
3. **How are operations combined?** Kernel fusion and tiling restructure computation to minimize memory traffic, often achieving 2–10× speedups through reduced data movement alone.
|
||||
|
||||
By mastering these patterns, we can reason about 90% of dataflow optimization decisions without exhaustive search. We examine each question in turn, then see how they combine for specific neural network architectures including ResNet-50, GPT-2, and MLPs.
|
||||
|
||||
@@ -546,7 +546,7 @@ This hybrid approach combined human-engineered features with statistical learnin
|
||||
4. Post-processing
|
||||
:::
|
||||
|
||||
[^fn-viola-jones]: **Viola-Jones Algorithm**: A groundbreaking computer vision algorithm that detected faces in real-time by using simple rectangular patterns (comparing brightness of eye regions versus cheek regions) and making decisions in stages, filtering out non-faces quickly. The cascade approach reduced computation 10-100x by rejecting easy negatives early, making real-time vision feasible on CPUs. This compute-saving pattern appears throughout edge ML systems where power budgets matter.
|
||||
[^fn-viola-jones]: **Viola-Jones Algorithm**: A groundbreaking computer vision algorithm that detected faces in real-time by using simple rectangular patterns (comparing brightness of eye regions versus cheek regions) and making decisions in stages, filtering out non-faces quickly. The cascade approach reduced computation 10–100× by rejecting easy negatives early, making real-time vision feasible on CPUs. This compute-saving pattern appears throughout edge ML systems where power budgets matter.
|
||||
|
||||
```{python}
|
||||
#| echo: false
|
||||
@@ -1835,7 +1835,7 @@ To appreciate the magnitude of these gains, consider the trajectory in @fig-algo
|
||||
#| label: fig-algo-efficiency
|
||||
#| echo: false
|
||||
#| fig-cap: "**Algorithmic Efficiency Trajectory.** Training efficiency factor relative to AlexNet (2012 baseline) for ImageNet classification. Each point represents a model architecture that achieves comparable accuracy with fewer computational resources. The trajectory from AlexNet (1×) through VGG, ResNet, MobileNet, and ShuffleNet to EfficientNet (44×) demonstrates that algorithmic innovation has delivered a 44-fold reduction in required compute over eight years, independent of hardware improvements."
|
||||
#| fig-alt: "Scatter plot showing training efficiency factor from 2012 to 2020. Red dots mark models from AlexNet at 1x to EfficientNet at 44x. Dashed trend line curves upward. Labels identify VGG, ResNet, MobileNet, ShuffleNet versions at their positions."
|
||||
#| fig-alt: "Scatter plot showing training efficiency factor from 2012 to 2020. Red dots mark models from AlexNet at 1× to EfficientNet at 44×. Dashed trend line curves upward. Labels identify VGG, ResNet, MobileNet, ShuffleNet versions at their positions."
|
||||
|
||||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||||
# │ ALGORITHMIC EFFICIENCY TRAJECTORY (FIGURE)
|
||||
|
||||
@@ -515,7 +515,7 @@ The ML lifecycle is not a straight line; it is a spiral of continuous refinement
|
||||
- [ ] **Problem Definition**: Have you defined success metrics that actually map to business value?
|
||||
- [ ] **Data**: Is your data pipeline reproducible? Can you trace a model prediction back to the training data version?
|
||||
- [ ] **Modeling**: Are you iterating fast enough? (The **Iteration Tax** says speed matters as much as quality).
|
||||
- [ ] **Deployment**: Have you accounted for the **Constraint Propagation Principle**? (A constraint ignored at stage 1 costs 16x to fix at stage 5).
|
||||
- [ ] **Deployment**: Have you accounted for the **Constraint Propagation Principle**? (A constraint ignored at stage 1 costs 16× to fix at stage 5).
|
||||
:::
|
||||
|
||||
\index{Iron Law of Workflow!definition}
|
||||
|
||||
@@ -553,7 +553,7 @@ While the preceding sections established the technical foundations of deep learn
|
||||
```{python}
|
||||
#| label: fig-trends
|
||||
#| echo: false
|
||||
#| fig-cap: "**Computational Growth**: Log-scale scatter plot showing training compute in FLOPS from 1952 to 2025. Computational power grew at a 1.4x rate from 1952 to 2010, then accelerated to a doubling every 3.4 months from 2012 to 2025. Large-scale models after 2015 followed an even faster 10-month doubling cycle, addressing the historical bottleneck of training complex neural networks."
|
||||
#| fig-cap: "**Computational Growth**: Log-scale scatter plot showing training compute in FLOPS from 1952 to 2025. Computational power grew at a 1.4× rate from 1952 to 2010, then accelerated to a doubling every 3.4 months from 2012 to 2025. Large-scale models after 2015 followed an even faster 10-month doubling cycle, addressing the historical bottleneck of training complex neural networks."
|
||||
#| fig-alt: "Log-scale scatter plot showing training compute in FLOPS from 1950 to 2025. Points represent AI models, with different colors for pre-deep-learning era and deep learning era. Trend lines show the acceleration of compute usage over time."
|
||||
|
||||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
@@ -1494,7 +1494,7 @@ Training large models requires managing the memory wall (the bandwidth bottlenec
|
||||
|
||||
**The Bottleneck**
|
||||
|
||||
- [ ] **Activation Memory**: Do you understand why activations (stored for backprop) dominate memory usage, often exceeding parameter size by 10x?
|
||||
- [ ] **Activation Memory**: Do you understand why activations (stored for backprop) dominate memory usage, often exceeding parameter size by 10×?
|
||||
- [ ] **Optimization Strategy**: Can you explain how **Gradient Checkpointing** trades compute (re-calculating activations) for memory capacity?
|
||||
|
||||
**Scaling Limits**
|
||||
@@ -4019,7 +4019,7 @@ Not all operations are equally expensive to recompute, which motivates *selectiv
|
||||
|
||||
Gradient accumulation[^fn-gradient-accumulation-training] simulates larger batch sizes without increasing memory requirements for storing the full batch. Larger batch sizes improve gradient estimates, leading to more stable convergence and faster training. This flexibility proves particularly valuable when training on high-resolution data where even a single batch may exceed available memory.
|
||||
|
||||
[^fn-gradient-accumulation-training]: **Gradient Accumulation Impact**: Enables effective batch sizes of 2048+ on single GPUs with only 32--64 micro-batch size, essential for transformer training. BERT-Large training uses effective batch size of 256 (accumulated over 8 steps) achieving 99.5% of full-batch performance while reducing memory requirements by 8x. The technique trades 10--15% compute overhead for massive memory savings.
|
||||
[^fn-gradient-accumulation-training]: **Gradient Accumulation Impact**: Enables effective batch sizes of 2048+ on single GPUs with only 32--64 micro-batch size, essential for transformer training. BERT-Large training uses effective batch size of 256 (accumulated over 8 steps) achieving 99.5% of full-batch performance while reducing memory requirements by 8×. The technique trades 10--15% compute overhead for massive memory savings.
|
||||
|
||||
[^fn-training-activation-checkpointing]: **Activation Checkpointing Trade-offs**: Reduces memory usage by 50--90% at the cost of 15--30% additional compute time due to recomputation. For training GPT-3 on V100s, checkpointing enables 2.8× larger models (from 1.3 B to 3.7 B parameters) within `{python} TrainingHardware.v100_mem_str` GB memory constraints, making it essential for memory-bound large model training despite the compute penalty.
|
||||
|
||||
@@ -4755,7 +4755,7 @@ Hybrid strategies combine these approaches because each has different scaling ch
|
||||
|
||||
The implementation details—gradient synchronization algorithms (AllReduce\index{AllReduce!gradient synchronization}\index{Ring AllReduce!bandwidth optimization}[^fn-allreduce], ring-reduce), communication patterns (parameter server, peer-to-peer), fault tolerance mechanisms, and scaling efficiency analysis for training runs spanning thousands of GPUs—constitute a specialized domain that builds on the foundations established here.
|
||||
|
||||
[^fn-allreduce]: **AllReduce**: A collective communication primitive that aggregates data across all participating devices and distributes the result back to each. For gradient synchronization, AllReduce sums gradients from all GPUs so each has the identical averaged gradient. Ring AllReduce [@patarasuk2009bandwidth], popularized by Baidu in 2017, achieves bandwidth-optimal performance by passing data in a ring topology, requiring only 2(N-1)/N of the data volume (approaching 2x for large N) regardless of participant count, making it the standard for data-parallel training.
|
||||
[^fn-allreduce]: **AllReduce**: A collective communication primitive that aggregates data across all participating devices and distributes the result back to each. For gradient synchronization, AllReduce sums gradients from all GPUs so each has the identical averaged gradient. Ring AllReduce [@patarasuk2009bandwidth], popularized by Baidu in 2017, achieves bandwidth-optimal performance by passing data in a ring topology, requiring only 2(N-1)/N of the data volume (approaching 2× for large N) regardless of participant count, making it the standard for data-parallel training.
|
||||
|
||||
### The Evolution of Training Infrastructure {#sec-model-training-evolution-training-infrastructure-f3a6}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user