mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 01:29:07 -05:00
fix: replace directional figure/table refs with explicit @tbl- cross-refs
Replace 'above'/'below' references with stable @tbl- and @fig- IDs
to avoid broken refs when content or layout changes.
Vol 1: model_compression.qmd - deployment gap table
Vol 2: collective_communication, compute_infrastructure, introduction,
ops_scale
This commit is contained in:
@@ -284,9 +284,9 @@ According to the **Iron Law** ($T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak} \cdot
|
||||
|
||||
| **Operation** | **Bit-Width** | **Relative Energy** |
|
||||
|:----------------|--------------:|--------------------:|
|
||||
| **Integer Add** | 8-bit | 1$\times$ |
|
||||
| **Float Add** | 32-bit | 30$\times$ |
|
||||
| **DRAM Read** | 64-bit | **40,000$\times$** |
|
||||
| **Integer Add** | 8-bit | 1$\times$ |
|
||||
| **Float Add** | 32-bit | 30$\times$ |
|
||||
| **DRAM Read** | 64-bit | **40,000$\times$** |
|
||||
|
||||
**For Inference**: Moving from FP32 to INT8 doesn't just save 4$\times$ memory; it can reduce the **energy per inference** by up to **`{python} int8_energy_reduction_str`$\times$** on hardware with dedicated INT8 units, depending on the compute-to-memory ratio of the workload.
|
||||
|
||||
@@ -494,7 +494,7 @@ TinyML\index{TinyML!model compression}[^fn-microcontroller-constraints] makes op
|
||||
|
||||
[^fn-microcontroller-constraints]: **Microcontroller Constraints**: Microcontrollers operate under severe constraints relative to servers and modern accelerators, often with *kilobytes to low megabytes* of RAM and limited persistent storage. A practical mental model is that you may have \(10^3\) to \(10^6\) bytes of memory available for the entire pipeline, which is why "model optimization" is often a prerequisite rather than an optional improvement in embedded deployments.
|
||||
|
||||
The deployment gap table below quantifies this mismatch using the Lighthouse models from @sec-ml-systems. The gap between model requirements and device capabilities explains *why* compression is not optional for resource-constrained deployment: without it, the models simply cannot run.
|
||||
@tbl-model-vs-device quantifies this mismatch using the Lighthouse models from @sec-ml-systems. The gap between model requirements and device capabilities explains *why* compression is not optional for resource-constrained deployment: without it, the models simply cannot run.
|
||||
|
||||
```{python}
|
||||
#| label: model-device-comparison
|
||||
@@ -633,16 +633,16 @@ Optimization is a search for the **Pareto Frontier**.
|
||||
|
||||
@tbl-optimization-tradeoffs summarizes the key optimization techniques, their systems benefits, and their ML costs. These are empirical relationships—actual results depend on model architecture, task, and careful implementation.
|
||||
|
||||
| **Technique** | **Systems Gain** | **ML Cost** | **Typical Impact** | **Region** |
|
||||
|:------------------------------|:-------------------------------------------|:---------------------|:---------------------|:----------:|
|
||||
| **Operator Fusion** | 10–30% latency reduction | None | No accuracy loss | 1 |
|
||||
| **FP32 → BF16** | 2$\times$ memory, ~2$\times$ throughput | Minimal | <0.1% accuracy drop | 1 |
|
||||
| **FP16 → INT8** | 2$\times$ memory, 2–4$\times$ throughput | Quantization error | 0.5–1% accuracy drop | 2 |
|
||||
| **50% Pruning** | ~2$\times$ smaller model | Capacity loss | 0.5–1% accuracy drop | 2 |
|
||||
| **Knowledge Distillation** | 2–10$\times$ smaller student | Capability ceiling | 1–3% accuracy drop | 2 |
|
||||
| **4-bit Quantization** | 4$\times$ memory reduction | Significant error | 2–5% accuracy drop | 2–3 |
|
||||
| **90% Pruning** | ~10$\times$ smaller model | Severe capacity loss | 5–15% accuracy drop | 3 |
|
||||
| **↑ Batch Size (8$\times$)** | Higher throughput, better GPU util | Generalization gap | Requires LR scaling | — |
|
||||
| **Technique** | **Systems Gain** | **ML Cost** | **Typical Impact** | **Region** |
|
||||
|:-----------------------------|:-----------------------------------------|:---------------------|:---------------------|:----------:|
|
||||
| **Operator Fusion** | 10–30% latency reduction | None | No accuracy loss | 1 |
|
||||
| **FP32 → BF16** | 2$\times$ memory, ~2$\times$ throughput | Minimal | <0.1% accuracy drop | 1 |
|
||||
| **FP16 → INT8** | 2$\times$ memory, 2–4$\times$ throughput | Quantization error | 0.5–1% accuracy drop | 2 |
|
||||
| **50% Pruning** | ~2$\times$ smaller model | Capacity loss | 0.5–1% accuracy drop | 2 |
|
||||
| **Knowledge Distillation** | 2–10$\times$ smaller student | Capability ceiling | 1–3% accuracy drop | 2 |
|
||||
| **4-bit Quantization** | 4$\times$ memory reduction | Significant error | 2–5% accuracy drop | 2–3 |
|
||||
| **90% Pruning** | ~10$\times$ smaller model | Severe capacity loss | 5–15% accuracy drop | 3 |
|
||||
| **↑ Batch Size (8$\times$)** | Higher throughput, better GPU util | Generalization gap | Requires LR scaling | — |
|
||||
|
||||
: **The Optimization Tradeoffs.** Region 1 = Free Lunch, Region 2 = Efficient Trade, Region 3 = Danger Zone. Batch size affects training dynamics rather than model quality directly. These ranges are empirical guidelines from published benchmarks [@jacob2018quantization; @han2015deep; @hinton2015distilling]; actual results vary with architecture, task, and implementation quality. {#tbl-optimization-tradeoffs}
|
||||
|
||||
@@ -3250,16 +3250,16 @@ To appreciate how precision loss manifests in practice, examine the representati
|
||||
|
||||
@tbl-numerics compares commonly used numerical precision formats in machine learning, each exhibiting distinct trade-offs in storage efficiency, computational speed, and energy consumption. Emerging formats like FP8\index{FP8} and TF32\index{TF32 (TensorFloat-32)} have been introduced to further optimize performance, especially on AI accelerators.
|
||||
|
||||
| **Precision Format** | **Bit-Width** | **Storage Reduction (vs FP32)** | **Compute Speed (vs FP32)** | **Power Consumption** | **Use Cases** |
|
||||
|:-------------------------------------------|--------------:|--------------------------------:|----------------------------------------------:|:----------------------|:------------------------------------------------------------|
|
||||
| **FP32 (Single-Precision Floating Point)** | 32-bit | Baseline (1$\times$) | Baseline (1$\times$) | High | Training & inference (general-purpose) |
|
||||
| **FP16 (Half-Precision Floating Point)** | 16-bit | 2$\times$ smaller | 2$\times$ faster on FP16-optimized hardware | Lower | Accelerated training, inference (NVIDIA Tensor Cores, TPUs) |
|
||||
| **bfloat16 (Brain Floating Point)** | 16-bit | 2$\times$ smaller | Similar speed to FP16, better dynamic range | Lower | Training on TPUs, transformer-based models |
|
||||
| **Precision Format** | **Bit-Width** | **Storage Reduction (vs FP32)** | **Compute Speed (vs FP32)** | **Power Consumption** | **Use Cases** |
|
||||
|:-------------------------------------------|--------------:|--------------------------------:|---------------------------------------------:|:----------------------|:------------------------------------------------------------|
|
||||
| **FP32 (Single-Precision Floating Point)** | 32-bit | Baseline (1$\times$) | Baseline (1$\times$) | High | Training & inference (general-purpose) |
|
||||
| **FP16 (Half-Precision Floating Point)** | 16-bit | 2$\times$ smaller | 2$\times$ faster on FP16-optimized hardware | Lower | Accelerated training, inference (NVIDIA Tensor Cores, TPUs) |
|
||||
| **bfloat16 (Brain Floating Point)** | 16-bit | 2$\times$ smaller | Similar speed to FP16, better dynamic range | Lower | Training on TPUs, transformer-based models |
|
||||
| **TF32 (TensorFloat-32)** | 19-bit | Similar to FP16 | Up to 8$\times$ faster on NVIDIA Ampere GPUs | Lower | Training on NVIDIA GPUs |
|
||||
| **FP8 (Floating-Point 8-bit)** | 8-bit | 4$\times$ smaller | Faster than INT8 in some cases | Significantly lower | Efficient training/inference (H100, AI accelerators) |
|
||||
| **INT8 (8-bit Integer)** | 8-bit | 4$\times$ smaller | 4–8$\times$ faster than FP32 | Significantly lower | Quantized inference (Edge AI, mobile AI, NPUs) |
|
||||
| **INT4 (4-bit Integer)** | 4-bit | 8$\times$ smaller | Hardware-dependent | Extremely low | Ultra-low-power AI, experimental quantization |
|
||||
| **Binary/Ternary (1-bit / 2-bit)** | 1–2-bit | 16–32$\times$ smaller | Highly hardware-dependent | Lowest | Extreme efficiency (binary/ternary neural networks) |
|
||||
| **FP8 (Floating-Point 8-bit)** | 8-bit | 4$\times$ smaller | Faster than INT8 in some cases | Significantly lower | Efficient training/inference (H100, AI accelerators) |
|
||||
| **INT8 (8-bit Integer)** | 8-bit | 4$\times$ smaller | 4–8$\times$ faster than FP32 | Significantly lower | Quantized inference (Edge AI, mobile AI, NPUs) |
|
||||
| **INT4 (4-bit Integer)** | 4-bit | 8$\times$ smaller | Hardware-dependent | Extremely low | Ultra-low-power AI, experimental quantization |
|
||||
| **Binary/Ternary (1-bit / 2-bit)** | 1–2-bit | 16–32$\times$ smaller | Highly hardware-dependent | Lowest | Extreme efficiency (binary/ternary neural networks) |
|
||||
|
||||
: **Numerical Precision Formats**: Comparison of precision formats by bit width, memory reduction, computational efficiency, accuracy retention, and typical use cases across deployment contexts. {#tbl-numerics}
|
||||
|
||||
|
||||
@@ -313,7 +313,7 @@ Understanding this mapping is essential: the *what* of parallelism directly dete
|
||||
|
||||
Expert Parallelism refers to the Mixture of Experts (MoE)[^fn-moe] architecture pattern.
|
||||
|
||||
The table above shows that different parallelism strategies impose fundamentally different communication patterns. Data parallelism and FSDP generate large, bandwidth-bound messages that benefit from ring-based algorithms and hierarchical decomposition. Tensor and pipeline parallelism generate small, latency-bound messages that benefit from tree-based algorithms and low-overhead software stacks. Expert parallelism generates all-to-all traffic patterns that stress the network's bisection bandwidth. To reason quantitatively about these differences, we need a model of network performance.
|
||||
@tbl-parallelism-communication-mapping shows that different parallelism strategies impose fundamentally different communication patterns. Data parallelism and FSDP generate large, bandwidth-bound messages that benefit from ring-based algorithms and hierarchical decomposition. Tensor and pipeline parallelism generate small, latency-bound messages that benefit from tree-based algorithms and low-overhead software stacks. Expert parallelism generates all-to-all traffic patterns that stress the network's bisection bandwidth. To reason quantitatively about these differences, we need a model of network performance.
|
||||
|
||||
## Mapping the Terrain: Network Performance Modeling {#sec-communication-collective-operations-collective-operations-network-performance-modeling-0d8e}
|
||||
|
||||
@@ -528,12 +528,12 @@ The alpha-beta model provides useful first-order predictions, but real communica
|
||||
|
||||
| **Message Size** | **$\alpha$-$\beta$ Prediction** | **Measured NCCL** | **Ratio (Measured/Predicted)** | **Explanation** |
|
||||
|:-----------------|--------------------------------:|------------------:|-------------------------------:|:----------------------------------------|
|
||||
| 1 KB | ~3.1 μs | ~25 μs | ~8.0$\times$ | NCCL protocol setup dominates |
|
||||
| 64 KB | ~4.3 μs | ~30 μs | ~7.0$\times$ | Still latency-bound; NCCL overhead |
|
||||
| 1 MB | ~23 μs | ~40 μs | ~1.7$\times$ | Transitioning to bandwidth-bound |
|
||||
| 64 MB | ~1.3 ms | ~1.6 ms | ~1.2$\times$ | NCCL approaches theoretical bandwidth |
|
||||
| 1 GB | ~20 ms | ~23 ms | ~1.15$\times$ | Bandwidth-dominant; NCCL nearly optimal |
|
||||
| 10 GB | ~200 ms | ~215 ms | ~1.08$\times$ | Large payloads saturate the wire |
|
||||
| 1 KB | ~3.1 μs | ~25 μs | ~8.0$\times$ | NCCL protocol setup dominates |
|
||||
| 64 KB | ~4.3 μs | ~30 μs | ~7.0$\times$ | Still latency-bound; NCCL overhead |
|
||||
| 1 MB | ~23 μs | ~40 μs | ~1.7$\times$ | Transitioning to bandwidth-bound |
|
||||
| 64 MB | ~1.3 ms | ~1.6 ms | ~1.2$\times$ | NCCL approaches theoretical bandwidth |
|
||||
| 1 GB | ~20 ms | ~23 ms | ~1.15$\times$ | Bandwidth-dominant; NCCL nearly optimal |
|
||||
| 10 GB | ~200 ms | ~215 ms | ~1.08$\times$ | Large payloads saturate the wire |
|
||||
|
||||
: **$\alpha$-$\beta$ Predictions vs. Measured NCCL Performance**: For small messages, NCCL's protocol overhead (memory registration, channel setup, kernel launch) inflates the effective latency by 7--8$\times$ over the bare-wire alpha. For large messages, NCCL achieves within 8--15% of theoretical bandwidth, validating the model's predictions in the bandwidth-bound regime. Measurements represent Ring AllReduce on 8-node DGX H100 clusters with InfiniBand NDR 400G. {#tbl-nccl-vs-theory}
|
||||
|
||||
@@ -1005,8 +1005,8 @@ The algorithms above assume a flat network where every link has the same bandwid
|
||||
|
||||
The algorithms above (Ring, Tree, Butterfly, Double Binary Tree) all assume a **flat** network where every link has the same bandwidth. This assumption holds within a single node (where all GPUs are connected by NVLink at equal bandwidth) but fails spectacularly in multi-node clusters. Real clusters are **hierarchical**, with fundamentally different bandwidths at each tier, as @tbl-bandwidth-hierarchy quantifies:
|
||||
|
||||
| **Tier** | **Interconnect** | **Bandwidth** | **Relative Speed** |
|
||||
|:---------------|--------------------:|--------------:|----------------------:|
|
||||
| **Tier** | **Interconnect** | **Bandwidth** | **Relative Speed** |
|
||||
|:---------------|--------------------:|--------------:|---------------------:|
|
||||
| **Intra-Node** | NVLink 4.0 | ~900 GB/s | 18$\times$ faster |
|
||||
| **Inter-Node** | InfiniBand NDR 400G | ~50 GB/s | 1$\times$ (baseline) |
|
||||
|
||||
@@ -1499,10 +1499,10 @@ Gradient compression is not free; it trades reduced communication for increased
|
||||
|
||||
| **Method** | **Compression Ratio** | **Convergence Impact** | **Best Use Case** |
|
||||
|:--------------------------|----------------------:|-------------------------------:|:-------------------------------|
|
||||
| **FP16** | 2$\times$ | Negligible | Default for all training |
|
||||
| **INT8 + Error FB** | 4$\times$ | Minor slowdown (~5-10%) | Bandwidth-constrained clusters |
|
||||
| **Top-K (1%) + Error FB** | 100$\times$ | Moderate slowdown (~10-20%) | Cross-datacenter training |
|
||||
| **1-bit + Error FB** | 32$\times$ | Significant slowdown (~20-30%) | Extreme bandwidth constraints |
|
||||
| **FP16** | 2$\times$ | Negligible | Default for all training |
|
||||
| **INT8 + Error FB** | 4$\times$ | Minor slowdown (~5-10%) | Bandwidth-constrained clusters |
|
||||
| **Top-K (1%) + Error FB** | 100$\times$ | Moderate slowdown (~10-20%) | Cross-datacenter training |
|
||||
| **1-bit + Error FB** | 32$\times$ | Significant slowdown (~20-30%) | Extreme bandwidth constraints |
|
||||
|
||||
**When to Use Compression:**
|
||||
|
||||
|
||||
@@ -596,7 +596,7 @@ A subtlety that affects fleet consistency is the **silicon lottery** -- the manu
|
||||
|
||||
: **NVIDIA GPU Architecture Evolution**. Each generation approximately doubles efficiency (TFLOPS/W) while introducing architectural features targeted at the dominant ML workload pattern of its era. The NVLink bandwidth doubles each generation, tracking the growth in model sizes that require increasingly aggressive tensor parallelism. {#tbl-gpu-evolution}
|
||||
|
||||
The table above compresses four hardware generations into a few columns. @fig-accelerator-efficiency-wall unpacks two of those columns, raw throughput and power efficiency, to reveal a divergence that shapes every infrastructure decision in this volume.
|
||||
@tbl-gpu-evolution compresses four hardware generations into a few columns. @fig-accelerator-efficiency-wall unpacks two of those columns, raw throughput and power efficiency, to reveal a divergence that shapes every infrastructure decision in this volume.
|
||||
|
||||
::: {#fig-accelerator-efficiency-wall fig-env="figure" fig-pos="htb" fig-cap="**The Accelerator Efficiency Wall**. FP16 throughput (blue, left axis) and power efficiency (green, right axis) for six generations of NVIDIA datacenter GPUs, both on logarithmic scales. Raw throughput has grown 236$\\times$ from the P100 to the B200, but efficiency (TFLOPS per watt) has grown only 70$\\times$ over the same period. The shaded region highlights the widening gap that the power grid, cooling systems, and datacenter infrastructure must absorb." fig-alt="Dual-axis log-scale plot showing GPU FP16 TFLOPS and TFLOPS per watt from 2016 to 2024 across six NVIDIA GPU generations."}
|
||||
```{python}
|
||||
@@ -756,12 +756,12 @@ The entire HBM assembly sits on the same silicon interposer as the processor die
|
||||
|
||||
The interposer itself is a passive silicon substrate with etched wiring layers that connect the HBM stacks to the processor, forming what is effectively a miniature PCB made of silicon rather than fiberglass. The interposer's silicon construction allows much finer trace widths and tighter pitches than a fiberglass PCB, enabling thousands of parallel connections between the HBM stacks and the processor die in a physical area of just a few hundred square millimeters.
|
||||
|
||||
| **Metric** | **Host DRAM** (DDR5) | **Accelerator HBM** (HBM3e) | **Scaling Factor** |
|
||||
|:--------------------|:---------------------------:|:----------------------------:|:--------------------------:|
|
||||
| **Mechanism** | 2D PCB Traces | **3D Die Stacking** | - |
|
||||
| **Placement** | Socketed DIMMs | **On-package (Substrate)** | **Physical Proximity** |
|
||||
| **Bandwidth** | ~`{python} ddr_bw_str` GB/s | **~`{python} h100_bw` TB/s** | **~50$\times$ Faster** |
|
||||
| **Interface Width** | 64-bit | **1024-bit per stack** | **16$\times$ Wider** |
|
||||
| **Metric** | **Host DRAM** (DDR5) | **Accelerator HBM** (HBM3e) | **Scaling Factor** |
|
||||
|:--------------------|:---------------------------:|:----------------------------:|:-------------------------:|
|
||||
| **Mechanism** | 2D PCB Traces | **3D Die Stacking** | - |
|
||||
| **Placement** | Socketed DIMMs | **On-package (Substrate)** | **Physical Proximity** |
|
||||
| **Bandwidth** | ~`{python} ddr_bw_str` GB/s | **~`{python} h100_bw` TB/s** | **~50$\times$ Faster** |
|
||||
| **Interface Width** | 64-bit | **1024-bit per stack** | **16$\times$ Wider** |
|
||||
| **Energy** | ~20 pJ/bit | **~2 pJ/bit** | **10$\times$ Efficiency** |
|
||||
|
||||
: **HBM vs. Standard DRAM Comparison**. HBM achieves its bandwidth advantage through three simultaneous innovations: 3D die stacking (more bits per package), TSV interconnects (shorter signal paths), and on-package placement (proximity to the processor). {#tbl-hbm-comparison}
|
||||
@@ -1607,11 +1607,11 @@ This pipelining ensures that each layer's computation can proceed without waitin
|
||||
|
||||
The following table summarizes the memory tiers available within a DGX H100 node:
|
||||
|
||||
| **Memory Tier** | **Capacity** | **Bandwidth** | **Latency** | **Role** |
|
||||
|:----------------|:-------------------------:|:-----------------:|:-----------:|:----------------------|
|
||||
| **Memory Tier** | **Capacity** | **Bandwidth** | **Latency** | **Role** |
|
||||
|:----------------|:------------------------:|:-----------------:|:-----------:|:----------------------|
|
||||
| **GPU HBM3** | 80 GB$\times$ 8 = 640 GB | 3.35 TB/s per GPU | <1 μs | Active computation |
|
||||
| **Host DDR5** | 2 TB | ~50 GB/s | ~100 ns | Optimizer state |
|
||||
| **NVMe SSD** | 8--30 TB | 5--7 GB/s | ~10 μs | Overflow, checkpoints |
|
||||
| **Host DDR5** | 2 TB | ~50 GB/s | ~100 ns | Optimizer state |
|
||||
| **NVMe SSD** | 8--30 TB | 5--7 GB/s | ~10 μs | Overflow, checkpoints |
|
||||
|
||||
: **Node Memory Hierarchy**. Each tier trades capacity for bandwidth. The training framework's memory manager must orchestrate data flow across these tiers to fit models whose total state exceeds the aggregate HBM capacity. {#tbl-node-memory}
|
||||
|
||||
@@ -1758,15 +1758,15 @@ The cumulative efficiency across all five stages is typically 85--90%, meaning t
|
||||
|
||||
To make this concrete, consider the power budget for a single rack containing four DGX H100 nodes:
|
||||
|
||||
| **Component** | **Power (kW)** | **% of Rack Total** |
|
||||
|:------------------------------------|:--------------:|:-------------------:|
|
||||
| **Component** | **Power (kW)** | **% of Rack Total** |
|
||||
|:-----------------------------------|:--------------:|:-------------------:|
|
||||
| **GPU compute (32$\times$ 700 W)** | 22.4 | 67% |
|
||||
| **Host CPUs and DRAM** | 3.2 | 10% |
|
||||
| **NVSwitch fabric** | 1.6 | 5% |
|
||||
| **InfiniBand HCAs** | 0.8 | 2% |
|
||||
| **Power conversion losses** | 2.8 | 8% |
|
||||
| **Cooling overhead (PUE 1.1)** | 2.7 | 8% |
|
||||
| **Total** | **33.5** | **100%** |
|
||||
| **Host CPUs and DRAM** | 3.2 | 10% |
|
||||
| **NVSwitch fabric** | 1.6 | 5% |
|
||||
| **InfiniBand HCAs** | 0.8 | 2% |
|
||||
| **Power conversion losses** | 2.8 | 8% |
|
||||
| **Cooling overhead (PUE 1.1)** | 2.7 | 8% |
|
||||
| **Total** | **33.5** | **100%** |
|
||||
|
||||
: **Power Budget for a Four-Node DGX H100 Rack**. GPUs consume two thirds of total rack power, but the remaining one third (host systems, networking, power conversion, and cooling) cannot be eliminated and must be budgeted for when sizing the facility's electrical infrastructure. The power conversion losses (8%) represent the cumulative inefficiency of the five-stage delivery chain. {#tbl-rack-power}
|
||||
|
||||
@@ -2417,10 +2417,10 @@ The trajectory of power efficiency across accelerator generations provides a qua
|
||||
|
||||
| **Generation** | **TFLOPS (FP16)** | **TDP (W)** | **TFLOPS/W** | **Relative Efficiency** |
|
||||
|:----------------|------------------:|------------:|:------------------:|:-----------------------:|
|
||||
| **V100 (2017)** | 125 | 300 | `{python} v100_ef` | 1.0$\times$ |
|
||||
| **A100 (2020)** | 312 | 400 | `{python} a100_ef` | 1.9$\times$ |
|
||||
| **H100 (2022)** | 1979 | 700 | `{python} h100_ef` | 6.8$\times$ |
|
||||
| **B200 (2024)** | 4500 | 1000 | `{python} b200_ef` | 10.8$\times$ |
|
||||
| **V100 (2017)** | 125 | 300 | `{python} v100_ef` | 1.0$\times$ |
|
||||
| **A100 (2020)** | 312 | 400 | `{python} a100_ef` | 1.9$\times$ |
|
||||
| **H100 (2022)** | 1979 | 700 | `{python} h100_ef` | 6.8$\times$ |
|
||||
| **B200 (2024)** | 4500 | 1000 | `{python} b200_ef` | 10.8$\times$ |
|
||||
|
||||
: **Power Efficiency Across GPU Generations**. Each generation delivers substantially more computation per watt, meaning that for a fixed power budget, newer hardware provides multiplicatively more throughput. A facility that draws 10 MW can train models roughly 10$\times$ faster with B200s than with V100s, without any increase in electricity cost. {#tbl-power-efficiency}
|
||||
|
||||
|
||||
@@ -200,9 +200,11 @@ GPT-3 (2020) consumed an estimated `{python} gpt3_training_ops_sci` FLOPS during
|
||||
| **PaLM** | 2022 | 6144 TPUs | ~60 days | ~10²⁴ |
|
||||
| **GPT-4** | 2023 | ~25000 GPUs | ~100 days | ~10²⁵ |
|
||||
|
||||
: **Training Compute Evolution** {#tbl-training-compute-evolution}
|
||||
|
||||
:::
|
||||
|
||||
The table above captures the growth in training compute, but an equally important dimension is the growth in *cluster size* itself. @fig-cluster-size-explosion traces this trajectory by plotting the number of accelerators used to train landmark models over the past decade.
|
||||
@tbl-training-compute-evolution captures the growth in training compute, but an equally important dimension is the growth in *cluster size* itself. @fig-cluster-size-explosion traces this trajectory by plotting the number of accelerators used to train landmark models over the past decade.
|
||||
|
||||
::: {#fig-cluster-size-explosion fig-env="figure" fig-pos="htb" fig-cap="**The Cluster Size Explosion**. Number of accelerators used to train landmark models, 2012--2024. Verified counts from published papers are shown as filled circles; the GPT-3 estimate (hollow marker) reflects approximate cluster size from Microsoft infrastructure announcements rather than a precise published count. The dashed trend line indicates approximately 4$\\times$ annual growth in cluster size, a rate that outpaces Moore's Law and drives every infrastructure challenge in this volume." fig-alt="Scatter plot with log-scale y-axis showing accelerator count versus year from 2012 to 2025. Points rise from 2 GPUs for AlexNet in 2012 to 16384 for Llama 3 in 2024."}
|
||||
```{python}
|
||||
|
||||
@@ -2679,9 +2679,9 @@ These controls should inform rather than block. The goal is cost awareness, not
|
||||
|
||||
Model selection should explicitly consider cost alongside accuracy. @tbl-ops-scale-cost-quality illustrates the diminishing returns: moving from small to medium model yields 3% accuracy gain for 10$\times$ training cost increase, while medium to large yields only 1% additional accuracy for another 10$\times$ cost, a pattern that should inform deployment decisions:
|
||||
|
||||
| **Model** | **Accuracy** | **Training Cost** | **Serving Cost/1K** | **Value Judgment** |
|
||||
|:-----------|-------------:|------------------:|--------------------:|:-----------------------------------|
|
||||
| **Small** | 92% | \$500 | \$0.10 | Baseline |
|
||||
| **Model** | **Accuracy** | **Training Cost** | **Serving Cost/1K** | **Value Judgment** |
|
||||
|:-----------|-------------:|------------------:|--------------------:|:----------------------------------|
|
||||
| **Small** | 92% | \$500 | \$0.10 | Baseline |
|
||||
| **Medium** | 95% | \$5,000 | \$0.50 | 3% accuracy for 10$\times$ cost |
|
||||
| **Large** | 96% | \$50,000 | \$2.00 | Additional 1% for 10$\times$ more |
|
||||
|
||||
@@ -2924,14 +2924,14 @@ A critical but often overlooked factor: failed experiments have real cost. If 90
|
||||
- Dedicated ML platform team (15 engineers)
|
||||
- Hybrid cloud/on-premise infrastructure
|
||||
|
||||
| **Cost Component** | **Startup** | **Production Company** | **Scaling Factor** |
|
||||
|:-------------------|-------------------:|-----------------------:|----------------------------------------:|
|
||||
| **Training** | \$5,000/month | \$150,000/month | 30x (more models, larger) |
|
||||
| **Cost Component** | **Startup** | **Production Company** | **Scaling Factor** |
|
||||
|:-------------------|-------------------:|-----------------------:|---------------------------------------:|
|
||||
| **Training** | \$5,000/month | \$150,000/month | 30x (more models, larger) |
|
||||
| **Inference** | \$2,000/month | \$400,000/month | 200x (100$\times$ users, optimization) |
|
||||
| **Data** | \$500/month | \$80,000/month | 160x (superlinear with users) |
|
||||
| **Iteration** | \$40,000/month | \$350,000/month | 8.75x (team size, experiments) |
|
||||
| **Total TCO** | **\$47,500/month** | **\$980,000/month** | **20.6x** |
|
||||
| **Dominant Cost** | Iteration (84%) | Inference (41%) | |
|
||||
| **Data** | \$500/month | \$80,000/month | 160x (superlinear with users) |
|
||||
| **Iteration** | \$40,000/month | \$350,000/month | 8.75x (team size, experiments) |
|
||||
| **Total TCO** | **\$47,500/month** | **\$980,000/month** | **20.6x** |
|
||||
| **Dominant Cost** | Iteration (84%) | Inference (41%) | |
|
||||
|
||||
: **TCO Comparison: Startup vs. Production Company**. Cost structure shifts dramatically with scale. Startups are dominated by iteration costs (engineering salaries for experimentation), while production companies see inference costs dominate as serving volume grows. The 100$\times$ user increase yields only 20$\times$ TCO increase due to optimization effects, but note the superlinear 160$\times$ scaling in data costs. {#tbl-ops-tco-comparison}
|
||||
|
||||
@@ -3081,7 +3081,7 @@ For a platform managing over 10,000 models, this centralized dual-store architec
|
||||
|
||||
Feature freshness represents the delay between real-world events and their reflection in feature values. @tbl-ops-scale-feature-freshness maps four feature types to their freshness requirements: static features like user demographics tolerate day-scale staleness with batch computation, while real-time features capturing the last user action demand seconds-scale freshness through streaming or on-demand computation.
|
||||
|
||||
This freshness requirement is particularly acute for recommendation systems, as illustrated below:
|
||||
This freshness requirement is particularly acute for recommendation systems, as the following callout and @tbl-ops-scale-feature-freshness illustrate:
|
||||
|
||||
::: {.callout-note title="Archetype B: The Staleness Tax"}
|
||||
**Archetype B (The Global Real-Time Recommendation Engine)** is uniquely sensitive to freshness. Unlike Archetype A (where grammar rules don't change), Archetype B's "ground truth" changes every second. If a user clicks a video about *baking*, and the feature store has a 10-minute lag, the next 100 recommendations will miss this new intent. This "staleness tax" directly degrades engagement, forcing Archetype B systems to adopt expensive streaming pipelines over cheaper batch ones.
|
||||
|
||||
Reference in New Issue
Block a user