mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 09:38:38 -05:00
- Rename _regression_testing.png to regression_testing.png for fault_tolerance.qmd - Collapse extra blank lines (security_privacy, fault_tolerance) - Prettify pipe tables (appendix_machine)
1185 lines
86 KiB
Plaintext
1185 lines
86 KiB
Plaintext
---
|
||
engine: jupyter
|
||
---
|
||
|
||
# Machine Foundations {#sec-machine-foundations}
|
||
|
||
## Purpose {.unnumbered}
|
||
|
||
_What reference numbers and physical laws should every ML systems engineer carry into design decisions?_
|
||
|
||
In ML systems, performance failures often masquerade as software problems: a training step “mysteriously” slows down, a serving stack misses its service-level agreement (SLA), or an accelerator upgrade fails to deliver the expected speedup. Many of these surprises are not bugs—they are the predictable consequences of physics (latency, bandwidth, energy) and architecture (memory hierarchy, precision, parallel scaling).
|
||
|
||
This appendix collects the reference numbers and compact models that let you do quick, quantitative reasoning. It begins with a quick-reference section of “numbers to know,” then summarizes the tools used throughout the book: roofline analysis, dimensional analysis, scaling laws, and precision trade-offs.
|
||
|
||
:::: {.callout-tip title="Learning Objectives"}
|
||
|
||
- Recall the core latency, bandwidth, and energy reference numbers that anchor system intuition
|
||
- Apply the **Roofline Model** to distinguish compute-bound from memory-bound workloads
|
||
- Use **Amdahl's** and **Gustafson's Laws** to reason about scaling limits
|
||
- Explain how **memory hierarchy** and **precision** choices shape performance and energy
|
||
- Identify common fallacies that violate physical constraints
|
||
|
||
::::
|
||
|
||
## How to Use This Appendix {.unnumbered}
|
||
|
||
This appendix is designed as a reference. When you are stuck, use it to turn a vague symptom ("it’s slow") into a specific constraint ("memory-bound at batch size 1") and then choose the lever that can actually move.
|
||
|
||
Conventions used here follow the book-wide notation (for example, we reserve \(B\) for batch size and use \(\text{BW}\) for bandwidth).
|
||
|
||
- **Sanity-check feasibility**: Start with @sec-machine-foundations-numbers-know-b531 for order-of-magnitude numbers.
|
||
- **Diagnose the dominant ceiling**: Use the Roofline Model in @sec-machine-foundations-roofline-model-2529 to decide whether you are compute-bound or memory-bound.
|
||
- **Reason about scaling limits**: Use Amdahl’s and Gustafson’s Laws to understand why adding accelerators may not reduce time-to-train.
|
||
- **Choose the right precision**: Use @sec-machine-foundations-floatingpoint-format-comparison-1836 to reason about FP32 vs. BF16/FP16 vs. INT8 as a systems trade-off.
|
||
- **Cross-reference for depth**: When you want the full narrative, jump back to @sec-hardware-acceleration, @sec-model-training, and @sec-model-serving.
|
||
|
||
## Numbers to Know {#sec-machine-foundations-numbers-know-b531}
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: numbers-to-know-setup
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ NUMBERS TO KNOW SETUP
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-numbers-know-b531 — all reference tables
|
||
# │ (@tbl-energy-ratios-ref, @tbl-memory-ratios-ref,
|
||
# │ @tbl-scaling-rules-ref, @tbl-memory-current-ref,
|
||
# │ @tbl-compute-current-ref, @tbl-ridge-current-ref)
|
||
# │
|
||
# │ Goal: Compute all invariant ratios and current hardware specs for the
|
||
# │ "Numbers Every ML Systems Engineer Should Know" reference section.
|
||
# │ Show: Energy ratios (~580$\times$ DRAM vs FP16 compute), latency hierarchy,
|
||
# │ bandwidth specs, ridge points for A100/H100.
|
||
# │ How: Scalar extraction via .m_as(unit); ratios remain stable across
|
||
# │ hardware generations.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (*), mlsysim.book (fmt)
|
||
# │ Exports: dram_vs_compute, fp32_vs_int8, fp32_vs_fp16, l1_vs_reg,
|
||
# │ hbm_vs_l1, ssd_vs_l1, network_vs_local, gpu_bw_vs_pcie,
|
||
# │ lat_l1_ns, lat_l2_ns, lat_hbm_ns, lat_pcie_ns, lat_ib_ns,
|
||
# │ lat_ssd_ns, bw_hbm_h100, bw_pcie5, bw_dram, bw_nvme,
|
||
# │ flops_h100_fp16, flops_h100_fp8, flops_a100_fp16,
|
||
# │ flops_mobile_int8, dc_mobile_ratio, ridge_a100, ridge_h100
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.core.constants import *
|
||
from mlsysim.fmt import fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class NumbersToKnow:
|
||
"""Namespace for machine-foundations reference numbers (ratios and hardware specs)."""
|
||
|
||
# Philosophy: RATIOS and RULES are eternal; absolute numbers are snapshots.
|
||
# We emphasize what will still be true in 10 years.
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
# (All inputs drawn from mlsysim.core.constants via wildcard import above)
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
# Step 1: Physics Constants (Eternal)
|
||
speed_of_light_km_ms = int(SPEED_OF_LIGHT_FIBER_KM_S.m_as(ureg.kilometer / second) / 1000)
|
||
|
||
# Step 2: Energy Ratios (Stable across process nodes)
|
||
dram_vs_compute = int(ENERGY_DRAM_ACCESS_PJ.m_as(ureg.picojoule) / ENERGY_FLOP_FP16_PJ.m_as(ureg.picojoule))
|
||
fp32_vs_int8 = int(ENERGY_FLOP_FP32_PJ.m_as(ureg.picojoule) / ENERGY_FLOP_INT8_PJ.m_as(ureg.picojoule))
|
||
fp32_vs_fp16 = round(ENERGY_FLOP_FP32_PJ.m_as(ureg.picojoule) / ENERGY_FLOP_FP16_PJ.m_as(ureg.picojoule), 1)
|
||
l1_vs_reg = int(ENERGY_SRAM_L1_PJ.m_as(ureg.picojoule) / ENERGY_REG_PJ.m_as(ureg.picojoule))
|
||
|
||
# Step 3: Memory Hierarchy Ratios (Stable)
|
||
hbm_vs_l1 = int(LATENCY_HBM3.m_as(NS) / LATENCY_L1_REGISTER.m_as(NS))
|
||
ssd_vs_l1 = int(LATENCY_NVME_SSD.m_as(NS) / LATENCY_L1_REGISTER.m_as(NS))
|
||
network_vs_local = int(LATENCY_INFINIBAND.m_as(NS) / LATENCY_HBM3.m_as(NS))
|
||
gpu_bw_vs_pcie = int(H100_MEM_BW.m_as(GB / second) / PCIE_GEN5_BW.m_as(GB / second))
|
||
|
||
# Step 4: Current Hardware Reference (circa 2024)
|
||
lat_l1_ns = int(LATENCY_L1_REGISTER.m_as(NS))
|
||
lat_l2_ns = int(LATENCY_L2_CACHE.m_as(NS))
|
||
lat_hbm_ns = int(LATENCY_HBM3.m_as(NS))
|
||
lat_pcie_ns = int(LATENCY_PCIE_GEN5.m_as(NS))
|
||
lat_ib_ns = int(LATENCY_INFINIBAND.m_as(NS))
|
||
lat_ssd_ns = int(LATENCY_NVME_SSD.m_as(NS))
|
||
|
||
bw_hbm_h100 = f"{H100_MEM_BW.m_as(TB / second):.1f}"
|
||
bw_pcie5 = int(PCIE_GEN5_BW.m_as(GB / second))
|
||
bw_dram = int(SYSTEM_MEMORY_BW.m_as(GB / second))
|
||
bw_nvme = f"{NVME_SEQUENTIAL_BW.m_as(GB / second):.1f}"
|
||
|
||
flops_h100_fp16 = int(H100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second))
|
||
flops_h100_fp8 = int(H100_FLOPS_FP8_TENSOR.m_as(TFLOPs / second))
|
||
flops_a100_fp16 = int(A100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second))
|
||
flops_mobile_int8 = int(MOBILE_NPU_TOPS_INT8.m_as(TFLOPs / second))
|
||
dc_mobile_ratio = int(flops_h100_fp16 / flops_mobile_int8)
|
||
|
||
ridge_a100 = int(A100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second) / A100_MEM_BW.m_as(TB / second))
|
||
ridge_h100 = int(H100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second) / H100_MEM_BW.m_as(TB / second))
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — all values are monotone functions of constants.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
# All exports are the computed class attributes above.
|
||
```
|
||
|
||
Just as Jeff Dean's "Latency Numbers Every Programmer Should Know"[^fn-jeff-dean-latency] shaped a generation of systems engineers, these reference numbers provide the order-of-magnitude intuition essential for ML systems design. While absolute values evolve with hardware generations, the *ratios* between categories remain remarkably stable. **Memorize the relationships; use the specific numbers as sanity checks.**
|
||
|
||
::: {.callout-takeaways title="Three Numbers That Matter Most"}
|
||
If you memorize nothing else from this section, memorize these:
|
||
|
||
1. **~600$\times$ energy ratio**: DRAM access costs ~`{python} NumbersToKnow.dram_vs_compute`$\times$ more energy than an FP16 multiply-add. This is why arithmetic intensity is everything.
|
||
|
||
2. **16 bytes/parameter for training**: Model weights (2B FP16) + master weights (4B FP32) + optimizer states (8B Adam). A 7B model needs 112 GB just to start training.
|
||
|
||
3. **~200 km/ms speed of light in fiber**: Cross-country latency is ~40 ms. No optimization can reduce this—it is physics.
|
||
:::
|
||
|
||
[^fn-jeff-dean-latency]: **Jeff Dean** is a Google Senior Fellow and one of the architects of Google's distributed systems infrastructure, including MapReduce, BigTable, and TensorFlow. His latency numbers, originally presented with Peter Norvig around 2010, became a canonical reference for systems engineers. The numbers have been updated over the years as hardware evolved, but the *hierarchy* of latencies remains remarkably stable. See Colin Scott's interactive visualization at <https://colin-scott.github.io/personal_website/research/interactive_latency.html>.
|
||
|
||
### The Invariants: Numbers That Won't Change {.unnumbered}
|
||
|
||
These relationships are governed by physics or arithmetic—they will still be true in 2035.
|
||
|
||
#### Speed of Light Tax {.unnumbered}
|
||
|
||
@tbl-speed-of-light-ref shows the irreducible latency floor for any distributed system.
|
||
|
||
| Distance | Round-Trip Latency | Implication |
|
||
|:-------------------|-------------------:|:------------------------------|
|
||
| Same datacenter | ~1 ms | Distributed training feasible |
|
||
| Cross-country (US) | ~40 ms | Edge needed for <100 ms apps |
|
||
| Cross-Atlantic | ~60 ms | CDN required for global users |
|
||
| Cross-Pacific | ~100 ms | Data locality is critical |
|
||
|
||
: Light in fiber travels ~200 km/ms. These latencies are physics—no optimization can reduce them. {#tbl-speed-of-light-ref}
|
||
|
||
#### Energy Hierarchy {.unnumbered}
|
||
|
||
@tbl-energy-ratios-ref quantifies the energy cost of data movement versus computation—the fundamental reason why arithmetic intensity dominates ML performance optimization.[^fn-horowitz-energy-app]
|
||
|
||
| Relationship | Ratio | Why It's Stable |
|
||
|:-----------------------------|--------------------------------------------------:|:--------------------------------------|
|
||
| DRAM access vs. FP16 compute | ~`{python} NumbersToKnow.dram_vs_compute`$\times$ | Wire capacitance scales with distance |
|
||
| FP32 vs. INT8 energy | ~`{python} NumbersToKnow.fp32_vs_int8`$\times$ | Bit width determines switching energy |
|
||
| FP32 vs. FP16 energy | ~`{python} NumbersToKnow.fp32_vs_fp16`$\times$ | Halving bits roughly halves energy |
|
||
| L1 SRAM vs. register | ~`{python} NumbersToKnow.l1_vs_reg`$\times$ | Distance to ALU |
|
||
|
||
: **The Energy Wall.** Moving data costs ~580$\times$ more energy than computing on it. This ratio is physics, not engineering. {#tbl-energy-ratios-ref}
|
||
|
||
[^fn-horowitz-energy-app]: Energy numbers from Horowitz's classic "Computing's Energy Problem" (ISSCC 2014, 45nm process). While absolute values scale with process node, the *ratios* between memory access and compute remain remarkably stable because wire capacitance (distance) dominates.
|
||
|
||
#### Memory Hierarchy {.unnumbered}
|
||
|
||
@tbl-memory-ratios-ref shows how each level of the memory hierarchy costs roughly 10--100$\times$ more latency than the one above it.
|
||
|
||
| Relationship | Ratio | Why It Persists |
|
||
|:-----------------------------------------------|----------------------------------------------------------:|:----------------------------------|
|
||
| Accelerator memory (HBM) vs. L1 cache | ~`{python} NumbersToKnow.hbm_vs_l1`$\times$ slower | On-chip vs. off-chip |
|
||
| SSD vs. L1 cache | ~`{python} NumbersToKnow.ssd_vs_l1`$\times$ slower | Electrical vs. mechanical/flash |
|
||
| Network vs. local memory | ~`{python} NumbersToKnow.network_vs_local`$\times$ slower | Speed of light + switching |
|
||
| Accelerator memory BW vs. CPU↔Accelerator link | ~`{python} NumbersToKnow.gpu_bw_vs_pcie`$\times$ faster | Architectural investment priority |
|
||
|
||
: **The Latency Hierarchy.** Each level costs roughly 10--100$\times$ more than the one above it. {#tbl-memory-ratios-ref}
|
||
|
||
#### Scaling Laws {.unnumbered}
|
||
|
||
@tbl-scaling-rules-ref collects the arithmetic relationships that govern memory and compute requirements for training and inference.[^fn-training-memory]
|
||
|
||
| Rule | Formula | Example |
|
||
|:------------------------------|:--------------------------------------------------|:-------------------------------------------|
|
||
| Inference memory (FP16) | 2 bytes$\times$ parameters | 7B params → 14 GB |
|
||
| Inference memory (INT8) | 1 byte$\times$ parameters | 7B params → 7 GB |
|
||
| Training memory (Adam) | 16 bytes$\times$ parameters | 7B params → 112 GB |
|
||
| Inference FLOPs (transformer) | ~2$\times$ parameters per token | 7B model → ~14 GFLOPs/token |
|
||
| Training FLOPs | ~6$\times$ parameters$\times$ tokens | 7B on 1T tokens → $4 \times 10^{22}$ FLOPs |
|
||
| Datacenter vs. edge compute | ~`{python} NumbersToKnow.dc_mobile_ratio`$\times$ | Compute per watt$\times$ power budget |
|
||
|
||
: **Scaling Rules.** These are arithmetic, not hardware-specific. Training memory includes FP16 weights (2B), FP32 master weights (4B), and Adam optimizer states (8B for momentum + variance). {#tbl-scaling-rules-ref}
|
||
|
||
[^fn-training-memory]: The 16 bytes/parameter rule assumes mixed-precision training with Adam. ZeRO optimization can reduce per-accelerator memory by sharding optimizer states across accelerators, but the total memory across all accelerators remains ~16$\times$ parameters.
|
||
|
||
### Latency Budgets: The Non-Negotiables { .unnumbered}
|
||
|
||
These budgets are set by physics (safety) or psychology (human perception)—not by engineering choice. Unlike hardware specs that improve each generation, these are *constraints* your system must meet (@tbl-latency-targets-ref).
|
||
|
||
| Application | Budget | Constraint |
|
||
|:-------------------|-----------:|:-------------------------------------|
|
||
| Autonomous braking | <10 ms | At 100 km/h, 10 ms = 28 cm of travel |
|
||
| Voice assistant | <100 ms | Human perception of "instant" |
|
||
| Web search | <200 ms | User patience threshold |
|
||
| Video streaming | <1 s | Buffer tolerance |
|
||
| Batch training | hours–days | Throughput dominates latency |
|
||
|
||
: **Latency Targets.** Miss these and the application fails, regardless of accuracy. {#tbl-latency-targets-ref}
|
||
|
||
### Current Hardware Reference (c. 2024) {.unnumbered}
|
||
|
||
These numbers reflect the current generation. Use them for back-of-envelope calculations, but expect them to improve ~2$\times$ every 2–3 years.
|
||
|
||
#### Memory Latency and Bandwidth {.unnumbered}
|
||
|
||
@tbl-memory-current-ref captures the full latency and bandwidth hierarchy for current-generation hardware.
|
||
|
||
| Level | Latency | Bandwidth |
|
||
|:---------------------|-----------------------------------------:|------------------------------------------:|
|
||
| Register | ~0.3 ns | — |
|
||
| L1 Cache | ~`{python} NumbersToKnow.lat_l1_ns` ns | — |
|
||
| L2 Cache | ~`{python} NumbersToKnow.lat_l2_ns` ns | — |
|
||
| GPU HBM3 | ~`{python} NumbersToKnow.lat_hbm_ns` ns | `{python} NumbersToKnow.bw_hbm_h100` TB/s |
|
||
| PCIe Gen5 (CPU↔GPU) | ~`{python} NumbersToKnow.lat_pcie_ns` ns | `{python} NumbersToKnow.bw_pcie5` GB/s |
|
||
| CPU DRAM | ~100 ns | `{python} NumbersToKnow.bw_dram` GB/s |
|
||
| InfiniBand (network) | ~`{python} NumbersToKnow.lat_ib_ns` ns | 50 GB/s |
|
||
| NVMe SSD | ~`{python} NumbersToKnow.lat_ssd_ns` ns | `{python} NumbersToKnow.bw_nvme` GB/s |
|
||
|
||
: **Memory Hierarchy (c. 2024).** Specific values for current hardware. {#tbl-memory-current-ref}
|
||
|
||
#### Compute Throughput {.unnumbered}
|
||
|
||
@tbl-compute-current-ref shows the raw throughput available at each tier of the deployment hierarchy.
|
||
|
||
| Platform | FP16/BF16 | INT8 | Power |
|
||
|:----------------------|------------------------------------------------:|------------------------------------------------:|------:|
|
||
| Datacenter GPU (H100) | `{python} NumbersToKnow.flops_h100_fp16` TFLOPS | `{python} NumbersToKnow.flops_h100_fp8` TOPS | 700W |
|
||
| Datacenter GPU (A100) | `{python} NumbersToKnow.flops_a100_fp16` TFLOPS | 624 TOPS | 400W |
|
||
| Mobile NPU | — | `{python} NumbersToKnow.flops_mobile_int8` TOPS | 3–5W |
|
||
|
||
: **Compute Reference (c. 2024).** Datacenter is ~28$\times$ more powerful than mobile—this ratio persists across generations. {#tbl-compute-current-ref}
|
||
|
||
#### Roofline Ridge Points {.unnumbered}
|
||
|
||
@tbl-ridge-current-ref defines the arithmetic intensity thresholds that determine whether a workload is memory-bound or compute-bound.
|
||
|
||
| Accelerator | Ridge Point | Implication |
|
||
|:------------|---------------------------------------------:|:-----------------------------|
|
||
| A100 (FP16) | `{python} NumbersToKnow.ridge_a100` ops/byte | Below → memory-bound |
|
||
| H100 (FP16) | `{python} NumbersToKnow.ridge_h100` ops/byte | Higher bar for compute-bound |
|
||
|
||
: **Arithmetic Intensity Thresholds (c. 2024).** Most inference workloads are <10 ops/byte—firmly memory-bound. {#tbl-ridge-current-ref}
|
||
|
||
::: {.callout-perspective title="A Note on Terminology: GPUs and Accelerators"}
|
||
Throughout this book, we often use "accelerator" when discussing hardware acceleration. However, the principles—roofline analysis, memory hierarchies, numerical precision, and performance modeling—apply equally to **GPUs**, **TPUs**, **NPUs**, **custom ASICs**, and other specialized AI accelerators. We use "accelerator" as the universal term, but readers should understand these concepts apply to GPUs unless we explicitly discuss vendor-specific features (e.g., CUDA, NVLink).
|
||
:::
|
||
|
||
Knowing the numbers is only the first step. The real power comes from having compact models that tell you *which* number matters for *your* specific bottleneck. The next section provides exactly these diagnostic tools—starting with the Roofline Model, which translates raw hardware specs into actionable performance ceilings.
|
||
|
||
## Physics of Computing {#sec-machine-foundations-physics-computing-c77f}
|
||
|
||
Raw hardware specs—TFLOP/s, TB/s, watt budgets—are necessary but insufficient for performance reasoning. Without compact analytical models, an engineer cannot distinguish a compute-bound workload from a memory-bound one, or predict whether doubling GPUs will halve training time. The models in this section provide exactly these diagnostic tools.
|
||
|
||
::: {.callout-perspective title="Why This Matters"}
|
||
You have trained a model that achieves good accuracy, but inference takes 200 ms when your SLA requires 50 ms. Where do you start? Performance analysis models give you a systematic way to diagnose whether you are limited by computation, memory bandwidth, or something else entirely. Without these tools, optimization is guesswork.
|
||
:::
|
||
|
||
### The Roofline Model {#sec-machine-foundations-roofline-model-2529}
|
||
|
||
The Roofline Model [@williams2009roofline] answers a deceptively simple question: *how fast can this workload possibly run on this hardware?* The answer depends on whether you run out of compute or memory bandwidth first.
|
||
|
||
Every operation has an **arithmetic intensity**: the ratio of computations performed to bytes moved from memory. Matrix multiplication has high arithmetic intensity because you can reuse each loaded element many times. Element-wise operations like ReLU have low intensity because you load a number, do one operation, and write it back. As @fig-roofline illustrates, each workload is bounded by either memory bandwidth or compute throughput, and its arithmetic intensity determines which ceiling it hits first.
|
||
|
||
::: {#fig-roofline fig-env="figure" fig-pos="htb" fig-cap="**The Roofline Model**: Performance ceiling for a hypothetical accelerator. The sloped line represents memory bandwidth limits; the horizontal line represents peak compute. Every workload can be plotted on this diagram to determine its optimization strategy." fig-alt="A plot with arithmetic intensity on the x-axis and performance on the y-axis. Two lines form a roofline shape: a diagonal line rising from the origin labeled Memory Bound, and a horizontal line labeled Compute Bound. They meet at the Ridge Point."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=1.0]
|
||
\tikzset{
|
||
Axis/.style={line width=1.0pt, draw=GrayLine, ->, >=Latex},
|
||
Guide/.style={dashed, draw=GrayLine!60, line width=0.6pt},
|
||
Label/.style={text=TextBlack, align=center, font=\footnotesize\usefont{T1}{phv}{m}{n}},
|
||
Dot/.style={circle, fill=#1, draw=white, line width=0.5pt, minimum size=5pt, inner sep=0pt}
|
||
}
|
||
\draw[step=0.5, gray!15, very thin] (0,0) grid (6,4);
|
||
\draw[Axis] (0,0) -- (6,0) node[right,text=TextBlack] {Arithmetic Intensity (Ops/Byte)};
|
||
\draw[Axis] (0,0) -- (0,4.2) node[above, text=black] {Performance (FLOP/s)};
|
||
\draw[BlueLine, line width=2pt] (0,0) -- (3,3);
|
||
\draw[RedLine, line width=2pt] (3,3) -- (5.8,3);
|
||
\node[Label, text=BlueLine, rotate=45, anchor=south, yshift=2pt] at (1.5, 1.5) {\textbf{Memory Bound}};
|
||
\node[Label, text=RedLine, anchor=south, yshift=2pt] at (4.4, 3) {\textbf{Compute Bound}};
|
||
\draw[Guide] (3,0) -- (3,3);
|
||
\node[Dot=TextBlack] at (3,3) {};
|
||
\node[below, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=TextBlack] at (3,0) {Ridge Point};
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
The **ridge point** determines the hardware's balance. If your workload's intensity is below this point, you are **memory-bound** (sloped region). If it is above, you are **compute-bound** (flat region).
|
||
$$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}} $$
|
||
|
||
$$ \text{Ridge Point} = \frac{\text{Peak FLOP/s}}{\text{Memory Bandwidth}} $$
|
||
|
||
::: {.callout-tip title="Batch Size Controls Arithmetic Intensity"}
|
||
For matrix multiplications, arithmetic intensity scales with the batch dimension. When you compute $Y = XW$ where $X$ is $(B \times D_{\text{in}})$ and $W$ is $(D_{\text{in}} \times D_{\text{out}})$:
|
||
|
||
- **FLOPs**: $2 \times B \times D_{\text{in}} \times D_{\text{out}}$ (multiply-adds)
|
||
- **Bytes**: Weights are loaded once: $D_{\text{in}} \times D_{\text{out}} \times \text{bytes}_{\text{precision}}$
|
||
|
||
Doubling the batch size $B$ doubles FLOPs while keeping weight loads constant—directly increasing arithmetic intensity. This is why inference serving batches requests: batch size 1 is almost always memory-bound, while batch size 64+ can approach the compute ceiling.
|
||
:::
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: appendix-machine-setup
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ ROOFLINE AND HARDWARE CHEAT SHEET SETUP
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-roofline-model-2529 (A100 analysis),
|
||
# │ @tbl-latency-hierarchy, @tbl-hardware-cheatsheet
|
||
# │
|
||
# │ Goal: Compute roofline ridge point, GEMM/ReLU intensity examples, energy
|
||
# │ wall values, and full hardware specs for H100 and TPU v5p.
|
||
# │ Show: A100 ridge ~208 FLOP/byte; n=4096 GEMM is compute-bound (~1365);
|
||
# │ ReLU at 0.25 op/byte is deeply memory-bound (~0.12% utilization).
|
||
# │ How: Scalar extraction via .m_as(unit); class-level computation
|
||
# │ pattern not used here — module-level variables exported directly.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (*), mlsysim.book (fmt)
|
||
# │ Exports: a100_fp16, a100_bw_tb, n_gemm, ridge_point, gemm_intensity,
|
||
# │ relu_intensity, relu_util_str, dram_pj, flop_pj, energy_ratio_str,
|
||
# │ l1_ns, l2_ns, hbm_ns, nvlink_ns, pcie_ns, ib_ns, ssd_ns,
|
||
# │ h100_flops, h100_bw, h100_cap, h100_nvlink, h100_l2_mb,
|
||
# │ tpuv5_flops, tpuv5_bw, tpuv5_cap, tpuv5_ici, tpuv5_l2_mb
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.core.constants import *
|
||
from mlsysim.fmt import fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class AppendixMachineSetup:
|
||
"""Namespace for roofline analysis, energy wall, latency table, and hardware cheatsheet."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
n_gemm_value = 4096
|
||
relu_intensity_value = 0.25
|
||
h100_l2_mb_value = 50
|
||
tpuv5_l2_mb_value = 100
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
a100_fp16_raw_value = A100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second)
|
||
a100_bw_raw_value = A100_MEM_BW.m_as(TB / second)
|
||
ridge_point_value = int(a100_fp16_raw_value / a100_bw_raw_value)
|
||
|
||
gemm_intensity_value = int(n_gemm_value / 3)
|
||
relu_achieved_tflops_value = relu_intensity_value * a100_bw_raw_value
|
||
relu_utilization_value = relu_achieved_tflops_value / a100_fp16_raw_value * 100
|
||
|
||
dram_pj_value = int(ENERGY_DRAM_ACCESS_PJ.m_as(ureg.picojoule))
|
||
flop_pj_value = ENERGY_FLOP_FP16_PJ.m_as(ureg.picojoule)
|
||
energy_ratio_value = int(dram_pj_value / flop_pj_value)
|
||
|
||
l1_ns_value = int(LATENCY_L1_REGISTER.m_as(NS))
|
||
l2_ns_value = int(LATENCY_L2_CACHE.m_as(NS))
|
||
hbm_ns_value = int(LATENCY_HBM3.m_as(NS))
|
||
nvlink_ns_value = int(LATENCY_NVLINK.m_as(NS))
|
||
pcie_ns_value = int(LATENCY_PCIE_GEN5.m_as(NS))
|
||
ib_ns_value = int(LATENCY_INFINIBAND.m_as(NS))
|
||
ssd_ns_value = int(LATENCY_NVME_SSD.m_as(NS))
|
||
|
||
h100_flops_value = int(H100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second))
|
||
h100_bw_value = H100_MEM_BW.m_as(TB / second)
|
||
h100_cap_value = int(H100_MEM_CAPACITY.m_as(GiB))
|
||
h100_nvlink_value = int(NVLINK_H100_BW.m_as(GB / second))
|
||
|
||
tpuv5_flops_value = int(TPUV5P_FLOPS_BF16.m_as(TFLOPs / second))
|
||
tpuv5_bw_value = TPUV5P_MEM_BW.m_as(TB / second)
|
||
tpuv5_cap_value = int(TPUV5P_MEM_CAPACITY.m_as(GiB))
|
||
tpuv5_ici_value = int(TPUV5P_ICI_BW.m_as(GB / second))
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — all values are monotone functions of constants.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
a100_fp16 = fmt(A100_FLOPS_FP16_TENSOR, "TFLOP/s", precision=0)
|
||
a100_bw_tb = fmt(A100_MEM_BW, "TB/s", precision=1)
|
||
|
||
n_gemm = n_gemm_value
|
||
ridge_point = ridge_point_value
|
||
gemm_intensity = gemm_intensity_value
|
||
relu_intensity = relu_intensity_value
|
||
relu_util_str = fmt(relu_utilization_value, precision=2, commas=False)
|
||
|
||
dram_pj = dram_pj_value
|
||
flop_pj = f"{flop_pj_value:.0f}"
|
||
energy_ratio_str = f"{energy_ratio_value}"
|
||
|
||
l1_ns = l1_ns_value
|
||
l2_ns = l2_ns_value
|
||
hbm_ns = hbm_ns_value
|
||
nvlink_ns = nvlink_ns_value
|
||
pcie_ns = pcie_ns_value
|
||
ib_ns = ib_ns_value
|
||
ssd_ns = ssd_ns_value
|
||
|
||
h100_flops = h100_flops_value
|
||
h100_bw = f"{h100_bw_value:.2f}"
|
||
h100_cap = h100_cap_value
|
||
h100_nvlink = h100_nvlink_value
|
||
h100_l2_mb = h100_l2_mb_value
|
||
|
||
tpuv5_flops = tpuv5_flops_value
|
||
tpuv5_bw = f"{tpuv5_bw_value:.2f}"
|
||
tpuv5_cap = tpuv5_cap_value
|
||
tpuv5_ici = tpuv5_ici_value
|
||
tpuv5_l2_mb = tpuv5_l2_mb_value
|
||
|
||
# --- Unified Hierarchy Stats (Reference Table) ---
|
||
bw_hbm_str = f"{H100_MEM_BW.m_as(GB/second):,.0f}"
|
||
bw_nvlink_str = f"{NVLINK_H100_BW.m_as(GB/second):,.0f}"
|
||
bw_pcie_str = f"{PCIE_GEN5_BW.m_as(GB/second):,.0f}"
|
||
bw_dram_str = f"{SYSTEM_MEMORY_BW.m_as(GB/second):,.0f}"
|
||
bw_ssd_str = f"{NVME_SEQUENTIAL_BW.m_as(GB/second):.1f}"
|
||
bw_net_str = f"{INFINIBAND_NDR_BW_GBS}"
|
||
|
||
e_reg = "0.01"
|
||
e_l1 = "0.5"
|
||
e_l2 = "2.0"
|
||
e_dram = "640"
|
||
e_ssd = "~5,000" # ~1.2 uJ per 1KB read -> ~5nJ per 32b
|
||
e_net = "~10,000" # ~1uJ per 1KB packet (header overhead)
|
||
```
|
||
|
||
#### A Concrete Example: The A100 Analysis {#sec-machine-foundations-concrete-example-a100-analysis-5b30}
|
||
|
||
Consider an NVIDIA A100 GPU with FP16 Tensor Core performance of `{python} AppendixMachineSetup.a100_fp16` TFLOP/s and HBM2e bandwidth of `{python} AppendixMachineSetup.a100_bw_tb` TB/s. The ridge point is `{python} AppendixMachineSetup.a100_fp16` / `{python} AppendixMachineSetup.a100_bw_tb` = `{python} AppendixMachineSetup.ridge_point` FLOP/byte (the Tera prefixes cancel, yielding FLOP/byte).
|
||
|
||
Now compare two common operations:
|
||
|
||
**GEMM (Matrix Multiplication)**: For two `{python} AppendixMachineSetup.n_gemm`$\times$ `{python} AppendixMachineSetup.n_gemm` matrices, arithmetic intensity is approximately `{python} AppendixMachineSetup.gemm_intensity` FLOP/byte. Since `{python} AppendixMachineSetup.gemm_intensity` > `{python} AppendixMachineSetup.ridge_point`, this operation is compute-bound. You are using the hardware efficiently.
|
||
|
||
**ReLU (Element-wise)**: For a `{python} AppendixMachineSetup.n_gemm`$\times$ `{python} AppendixMachineSetup.n_gemm` tensor, intensity is approximately `{python} AppendixMachineSetup.relu_intensity` op/byte. Since `{python} AppendixMachineSetup.relu_intensity` ≪ `{python} AppendixMachineSetup.ridge_point`, this operation is severely memory-bound, achieving only about `{python} AppendixMachineSetup.relu_util_str`% of peak TFLOP/s. The hardware is mostly waiting for data.
|
||
|
||
This explains why modern frameworks fuse operations: combining ReLU with the preceding MatMul avoids writing intermediate results to memory, effectively increasing arithmetic intensity.
|
||
|
||
### Dimensional Analysis {#sec-machine-foundations-dimensional-analysis-76b3}
|
||
|
||
The Roofline Model helps diagnose *where* a bottleneck lies. But before applying any performance equation, we should verify that it is physically meaningful. Dimensional analysis provides this sanity check: any valid equation must be **dimensionally homogeneous**—every term must resolve to the same units. If they do not, the equation contains an error.
|
||
|
||
Consider the Iron Law of ML Systems (Principle \ref{pri-iron-law}) introduced in @sec-introduction-iron-law-ml-systems-c32a:
|
||
$$ T = \frac{D_{\text{vol}}}{BW} + \frac{O}{R_{\text{peak}} \cdot \eta} + L_{\text{lat}} $$
|
||
|
||
We verify correctness by confirming that every term resolves to **Time (seconds)**:
|
||
$$ T [s] = \underbrace{ \frac{D_{\text{vol}} [\text{Bytes}]}{BW [\text{Bytes/s}]} }_{\text{Seconds}} + \underbrace{ \frac{O [\text{FLOPs}]}{R_{\text{peak}} [\text{FLOPs/s}] \cdot \eta [1]} }_{\text{Seconds}} + \underbrace{ L_{\text{lat}} [s] }_{\text{Seconds}} $$
|
||
|
||
* **Data Term**: $\frac{\text{Bytes}}{\text{Bytes/s}} = \text{Bytes} \times \frac{\text{s}}{\text{Bytes}} = \mathbf{s}$
|
||
* **Compute Term**: $\frac{\text{FLOPs}}{\text{FLOPs/s}} = \text{FLOPs} \times \frac{\text{s}}{\text{FLOPs}} = \mathbf{s}$
|
||
* **Overhead Term**: Already in seconds.
|
||
|
||
The equation is physically consistent. Apply this technique to any systems equation you encounter: if the dimensions do not match, the formula is wrong. Note also that you cannot directly trade "FLOPs" for "Bandwidth"—they have different units. Any such trade-off must convert through Time, which is precisely what the Iron Law quantifies.
|
||
|
||
The fundamental limits of scaling across multiple devices are the subject of the following section.
|
||
|
||
### Amdahl's Law and Gustafson's Law {#sec-machine-foundations-amdahls-law-gustafsons-law-b741}
|
||
|
||
Parallelization is the primary tool for scaling ML, but its limits depend on *how* you scale. These two laws frame the fundamental tension in parallel computing. **Amdahl's Law** is the pessimist's view, governing how much faster a *fixed* task can run (optimizing latency). **Gustafson's Law** is the optimist's view, governing how much *more* work we can do in the same time (optimizing throughput).
|
||
|
||
#### Strong Scaling (Amdahl's Law) {#sec-machine-foundations-strong-scaling-amdahls-law-c6c2}
|
||
|
||
**Strong scaling** answers the question: *If I add more processors to a fixed-size problem, how much faster will it run?*
|
||
|
||
Amdahl's Law [@amdahl1967validity] states that the speedup is limited by the serial portion of the task.[^fn-amdahl] If a fraction $s$ of your task is serial (cannot be parallelized) and $p = 1-s$ is parallelizable, the maximum speedup with $n$ processors is:
|
||
|
||
[^fn-amdahl]: **Gene Amdahl** (1922–2015) was a legendary computer architect at IBM, where he was the chief architect of the System/360. He later founded Amdahl Corporation to compete with IBM in the mainframe market.
|
||
|
||
$$ \text{Speedup}(n) = \frac{1}{s + \frac{1-s}{n}} $$
|
||
|
||
As $n \to \infty$, the term $\frac{1-s}{n} \to 0$, and the speedup converges to $1/s$.
|
||
|
||
```{python}
|
||
#| label: amdahl-setup
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ AMDAHL SETUP
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-strong-scaling-amdahls-law-c6c2
|
||
# │
|
||
# │ Goal: Define constants for Amdahl's Law prose lead-in text.
|
||
# │ Show: "5%" serial overhead, "95%" parallel fraction — inline in prose.
|
||
# │ How: Pure scalar constants; string formatting for inline Python refs.
|
||
# │
|
||
# │ Imports: (none — pure scalars)
|
||
# │ Exports: s_pct_str, p_pct_str, n_8_str, s_str, p_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class AmdahlSetup:
|
||
"""Namespace for Amdahl's Law lead-in constants."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
s_value = 0.05 # Serial fraction
|
||
n_8_value = 8 # Number of processors for example
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
p_value = 1 - s_value # Parallel fraction
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — values are definitional constants.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
s_pct_str = "5"
|
||
p_pct_str = "95"
|
||
n_8_str = "8"
|
||
s_str = "0.05"
|
||
p_str = "0.95"
|
||
```
|
||
|
||
To see Amdahl's Law in action, suppose `{python} AmdahlSetup.s_pct_str`% of your training step is serial overhead (e.g., Python GIL, kernel launch latency) and `{python} AmdahlSetup.p_pct_str`% is parallelizable matrix math:
|
||
|
||
```{python}
|
||
#| label: amdahl-example
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ AMDAHL EXAMPLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-strong-scaling-amdahls-law-c6c2
|
||
# │
|
||
# │ Goal: Compute Amdahl's Law speedup at n=8 and n=infinity.
|
||
# │ Show: "~7.5x" at 8 GPUs, "20x" max — inline in bullet-list example.
|
||
# │ How: Amdahl formula: 1 / (s + p/n); limit as n→∞ is 1/s.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: amdahl_8_str, amdahl_inf_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class AmdahlExample:
|
||
"""Namespace for Amdahl's Law speedup at 8 processors and theoretical max."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
s_value = 0.05
|
||
n_8_value = 8
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
p_value = 1 - s_value
|
||
amdahl_8_value = 1 / (s_value + p_value / n_8_value)
|
||
amdahl_inf_value = 1 / s_value
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — values are monotone functions of inputs.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
amdahl_8_str = fmt(amdahl_8_value, precision=1, commas=False)
|
||
amdahl_inf_str = fmt(amdahl_inf_value, precision=0, commas=False)
|
||
```
|
||
|
||
* With $n=1$, speedup is 1.
|
||
* With n=`{python} AmdahlSetup.n_8_str`, speedup is 1/(`{python} AmdahlSetup.s_str` + `{python} AmdahlSetup.p_str`/`{python} AmdahlSetup.n_8_str`) ≈ `{python} AmdahlExample.amdahl_8_str`$\times$.
|
||
* With n=infinity, speedup is capped at 1/`{python} AmdahlSetup.s_str` = `{python} AmdahlExample.amdahl_inf_str`$\times$.
|
||
|
||
No matter how many accelerators you buy, you cannot make this fixed workload run faster than `{python} AmdahlExample.amdahl_inf_str`$\times$.
|
||
|
||
#### Weak Scaling (Gustafson's Law) {#sec-machine-foundations-weak-scaling-gustafsons-law-eb7e}
|
||
|
||
**Weak scaling** answers the question: *If I add more processors, how much larger of a problem can I solve in the same amount of time?*
|
||
|
||
This is the reality of Large Language Models. We do not use 1,000 accelerators to train GPT-4 on a laptop-sized dataset in milliseconds; we use them to train on a dataset 1,000$\times$ larger in reasonable time.
|
||
|
||
Gustafson's Law [@gustafson1988reevaluating] models this "scaled speedup":[^fn-gustafson]
|
||
|
||
[^fn-gustafson]: **John Gustafson** is a computer scientist known for his work in parallel computing and for introducing the Unum (universal number) format. His law was a direct response to the perceived "limits" of **Amdahl's Law** when applied to massive scale.
|
||
|
||
$$ \text{Scaled Speedup}(n) = n - s(n - 1) $$
|
||
|
||
Here, the parallel part of the workload grows linearly with $n$, while the serial part $s$ remains fixed.
|
||
|
||
```{python}
|
||
#| label: gustafson-setup
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ GUSTAFSON SETUP
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-weak-scaling-gustafsons-law-eb7e
|
||
# │
|
||
# │ Goal: Define constants for Gustafson's Law prose lead-in text.
|
||
# │ Show: "5%" serial overhead — inline in prose before the example.
|
||
# │ How: Pure scalar constants; string formatting for inline Python refs.
|
||
# │
|
||
# │ Imports: (none — pure scalars)
|
||
# │ Exports: s_g_pct_str, s_g_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class GustafsonSetup:
|
||
"""Namespace for Gustafson's Law lead-in constants."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
s_g_value = 0.05 # Serial fraction (same as Amdahl example)
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
# No derived values — only string formatting.
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — values are definitional constants.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
s_g_pct_str = "5"
|
||
s_g_str = "0.05"
|
||
```
|
||
|
||
Using the same `{python} GustafsonSetup.s_g_pct_str`% serial overhead ($s$ = `{python} GustafsonSetup.s_g_str`), Gustafson's Law tells a very different story:
|
||
|
||
```{python}
|
||
#| label: gustafson-example
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ GUSTAFSON EXAMPLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-weak-scaling-gustafsons-law-eb7e
|
||
# │
|
||
# │ Goal: Compute Gustafson's Law scaled speedup at n=8 and n=1000.
|
||
# │ Show: "~7.67x" at 8 GPUs, "~951x" at 1000 GPUs — inline in bullet list.
|
||
# │ How: Gustafson formula: n - s*(n-1).
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: gustafson_8_str, gustafson_8_serial, gustafson_1000_str,
|
||
# │ n_8_g_str, n_8_g_minus_1_str, n_1000_str, n_1000_minus_1_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class GustafsonExample:
|
||
"""Namespace for Gustafson's Law scaled speedup at 8 and 1000 processors."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
s_g_value = 0.05
|
||
n_8_g_value = 8
|
||
n_1000_value = 1000
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
gustafson_8_value = n_8_g_value - s_g_value * (n_8_g_value - 1)
|
||
gustafson_8_serial_value = s_g_value * (n_8_g_value - 1)
|
||
gustafson_1000_value = n_1000_value - s_g_value * (n_1000_value - 1)
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — values are monotone functions of inputs.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
gustafson_8_str = fmt(gustafson_8_value, precision=2, commas=False)
|
||
gustafson_8_serial = f"{gustafson_8_serial_value:.2f}"
|
||
gustafson_1000_str = fmt(gustafson_1000_value, precision=0, commas=False)
|
||
n_8_g_str = str(n_8_g_value)
|
||
n_8_g_minus_1_str = str(n_8_g_value - 1)
|
||
n_1000_str = str(n_1000_value)
|
||
n_1000_minus_1_str = str(n_1000_value - 1)
|
||
```
|
||
|
||
* With $n=1$, speedup is 1.
|
||
* With n=`{python} GustafsonExample.n_8_g_str`, Scaled Speedup is `{python} GustafsonExample.n_8_g_str` - `{python} GustafsonSetup.s_g_str`$\times$ (`{python} GustafsonExample.n_8_g_minus_1_str`) = `{python} GustafsonExample.n_8_g_str` - `{python} GustafsonExample.gustafson_8_serial` = `{python} GustafsonExample.gustafson_8_str`$\times$.
|
||
* With n=`{python} GustafsonExample.n_1000_str`, Scaled Speedup is `{python} GustafsonExample.n_1000_str` - `{python} GustafsonSetup.s_g_str`$\times$ (`{python} GustafsonExample.n_1000_minus_1_str`) ≈ `{python} GustafsonExample.gustafson_1000_str`$\times$.
|
||
|
||
In weak scaling, efficiency remains high because the useful work (training the model) scales up to dwarf the fixed overheads.
|
||
|
||
```{python}
|
||
#| label: training-time-example
|
||
#| echo: false
|
||
from mlsysim.core.constants import (
|
||
BILLION, TRILLION, SECONDS_PER_MINUTE, SEC_PER_DAY,
|
||
A100_FLOPS_FP16_TENSOR, TFLOPs, second
|
||
)
|
||
from mlsysim.fmt import fmt, check, md_math, sci_latex
|
||
|
||
# =============================================================================
|
||
# PURPOSE
|
||
# =============================================================================
|
||
# Purpose: Estimate training time from model scale and utilization
|
||
# Used in: Training time equation example
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class TrainingTimeRef:
|
||
"""
|
||
Reference calculation for the Training Time Equation.
|
||
Scenario: Training 1B model on 20B tokens using 1 A100.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
p_params = 1 * BILLION
|
||
d_tokens = 20 * BILLION
|
||
n_gpus = 1
|
||
x_flops = A100_FLOPS_FP16_TENSOR.m_as(TFLOPs / second)
|
||
u_mfu = 0.40
|
||
|
||
# Rename variables for internal logic
|
||
p_params_value = p_params
|
||
d_tokens_value = d_tokens
|
||
n_gpus_value = n_gpus
|
||
x_flops_value = x_flops * TRILLION
|
||
u_mfu_value = u_mfu
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
total_flops = 6 * p_params_value * d_tokens_value
|
||
# Step 1: x_flops_value is in TFLOPS (1e12)
|
||
throughput = n_gpus_value * (x_flops_value) * u_mfu_value
|
||
|
||
t_seconds_value = total_flops / throughput
|
||
t_minutes_value = t_seconds_value / SECONDS_PER_MINUTE
|
||
t_days_value = t_seconds_value / SEC_PER_DAY
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(t_seconds_value > 0, "Training time must be positive.")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
T_seconds_str = fmt(t_seconds_value, precision=0, commas=False)
|
||
T_minutes_str = fmt(t_minutes_value, precision=0, commas=False)
|
||
T_days_str = fmt(t_days_value, precision=0, commas=False)
|
||
p_params_str = f"{p_params/BILLION:.0f}B"
|
||
d_tokens_str = f"{d_tokens/BILLION:.0f}B"
|
||
n_gpus_str = str(int(n_gpus))
|
||
u_mfu_pct_str = str(int(u_mfu * 100))
|
||
eq_total_flops = md_math(f"\\text{{Total FLOPs}} = 6 \\times {sci_latex(p_params, 0)} \\times {sci_latex(d_tokens, 0)} = {sci_latex(total_flops, 1)} \\text{{ FLOPs}}")
|
||
eq_throughput = md_math(f"\\text{{Throughput}} = {int(n_gpus)} \\times ({sci_latex(x_flops*TRILLION, 0)}) \\times {u_mfu:.2f} \\approx {sci_latex(throughput, 2)} \\text{{ FLOP/s}}")
|
||
eq_time = md_math(f"T = \\frac{{{sci_latex(total_flops, 1)}}}{{{sci_latex(throughput, 2)}}} \\approx {fmt(t_seconds_value, precision=0, commas=True)} \\text{{ seconds}} \\approx {fmt(t_minutes_value, precision=0, commas=True)} \\text{{ minutes}}")
|
||
|
||
```
|
||
|
||
::: {.callout-notebook title="The Training Time Equation"}
|
||
|
||
Just as classical architecture has an "Iron Law" of performance, Large Language Model training has a fundamental governing equation. To estimate training time $T$:
|
||
$$ T \approx \frac{6 \cdot P \cdot D}{N \cdot X \cdot U} $$
|
||
Where:
|
||
|
||
* **$6$**: The factor deriving from the forward pass ($2PD$) and backward pass ($4PD$) FLOPs per token.
|
||
* **$P$**: Number of model parameters.
|
||
* **$D$**: Number of training tokens.
|
||
* **$N$**: Number of accelerators (GPUs).
|
||
* **$X$**: Peak FLOP/s of one accelerator.
|
||
* **$U$**: Model FLOPs Utilization (MFU), typically 30%–50%.
|
||
|
||
**Example**: Training a **`{python} TrainingTimeRef.p_params_str` parameter** model on **`{python} TrainingTimeRef.d_tokens_str` tokens** using **`{python} TrainingTimeRef.n_gpus_str` A100** (`{python} AppendixMachineSetup.a100_fp16` TFLOPS) at **`{python} TrainingTimeRef.u_mfu_pct_str`% utilization**.
|
||
`{python} TrainingTimeRef.eq_total_flops`
|
||
`{python} TrainingTimeRef.eq_throughput`
|
||
`{python} TrainingTimeRef.eq_time`
|
||
|
||
The computed result: **`{python} TrainingTimeRef.T_seconds_str` seconds** (≈ `{python} TrainingTimeRef.T_minutes_str` minutes, or about `{python} TrainingTimeRef.T_days_str` days).
|
||
|
||
:::
|
||
|
||
::: {.callout-checkpoint title="Check Your Understanding: Performance Models"}
|
||
|
||
1. A new accelerator doubles compute throughput but keeps memory bandwidth the same. For a workload that is **memory-bound** on the current hardware, how much speedup do you expect? What about a **compute-bound** workload?
|
||
|
||
2. Your training pipeline has 10% serial overhead. Using Amdahl's Law, what is the maximum possible speedup regardless of how many accelerators you add? Using Gustafson's Law with 256 accelerators, what is the scaled speedup?
|
||
|
||
3. An inference service must handle 500 queries per second (QPS) at 100 ms latency. Using Little's Law, how many concurrent requests must the system support? If each request needs 2 GB of KV cache memory, what is the minimum accelerator memory required?
|
||
|
||
:::
|
||
|
||
### Little's Law {#sec-machine-foundations-littles-law-21a3}
|
||
|
||
For capacity planning in inference systems, **Little's Law** [@little1961proof] relates concurrency ($L$), arrival rate ($\lambda$), and latency ($W$):[^fn-little]
|
||
|
||
[^fn-little]: **John Little** is an Institute Professor at MIT and a pioneer in the field of operations research. His law, proved in 1961, is fundamental to queuing theory and is used across fields from manufacturing to computer network analysis.
|
||
|
||
$$ L = \lambda \times W $$
|
||
|
||
```{python}
|
||
#| label: littles-law-example
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ LITTLE'S LAW EXAMPLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-littles-law-21a3
|
||
# │
|
||
# │ Goal: Apply Little's Law to size inference concurrency and memory.
|
||
# │ Show: "50" concurrent requests, "24" GPU memory cap at 1 GB/req — inline.
|
||
# │ How: L = λ × W; max_concurrent = gpu_mem / mem_per_req.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt, md_math)
|
||
# │ Exports: L_concurrent_str, lambda_qps_str, lambda_qps_raw_str,
|
||
# │ w_latency_ms_str, w_latency_s_str, mem_per_req_gb_str,
|
||
# │ gpu_mem_gb_str, max_concurrent_str, eq_max_throughput
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import fmt, md_math
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class LittlesLawExample:
|
||
"""Namespace for Little's Law concurrency sizing for inference systems."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
lambda_qps_value = 1000 # queries per second
|
||
w_latency_s_value = 0.050 # 50 ms in seconds
|
||
mem_per_req_gb_value = 1
|
||
gpu_mem_gb_value = 24
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
l_concurrent_value = lambda_qps_value * w_latency_s_value
|
||
max_concurrent_value = int(gpu_mem_gb_value / mem_per_req_gb_value)
|
||
max_throughput_value = max_concurrent_value / w_latency_s_value
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — values are monotone functions of inputs.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
L_concurrent_str = fmt(l_concurrent_value, precision=0, commas=False)
|
||
lambda_qps_str = fmt(lambda_qps_value, precision=0)
|
||
lambda_qps_raw_str = fmt(lambda_qps_value, precision=0, commas=False)
|
||
w_latency_ms_str = str(int(w_latency_s_value * 1000))
|
||
w_latency_s_str = f"{w_latency_s_value:.2f}"
|
||
mem_per_req_gb_str = str(mem_per_req_gb_value)
|
||
gpu_mem_gb_str = str(gpu_mem_gb_value)
|
||
max_concurrent_str = str(max_concurrent_value)
|
||
eq_max_throughput = md_math(f"L/W = {max_concurrent_value} / {w_latency_s_value} = {int(max_throughput_value)} \\text{{ QPS}}")
|
||
```
|
||
|
||
To see this in practice, consider sustaining `{python} LittlesLawExample.lambda_qps_str` queries per second (QPS) with `{python} LittlesLawExample.w_latency_ms_str` ms average latency. The law tells us the system must support `{python} LittlesLawExample.lambda_qps_raw_str`$\times$ `{python} LittlesLawExample.w_latency_s_str` = `{python} LittlesLawExample.L_concurrent_str` concurrent requests.
|
||
|
||
This directly determines how to size inference worker pools. If serving one request requires `{python} LittlesLawExample.mem_per_req_gb_str` GB of temporary memory (KV cache, activations), handling `{python} LittlesLawExample.L_concurrent_str` concurrent requests requires `{python} LittlesLawExample.L_concurrent_str` GB of memory. If your accelerator only has `{python} LittlesLawExample.gpu_mem_gb_str` GB, you are physically limited to `{python} LittlesLawExample.max_concurrent_str` concurrent requests. Your maximum throughput is capped at `{python} LittlesLawExample.eq_max_throughput`, regardless of how many requests arrive.
|
||
|
||
These physics-based models—Roofline, Amdahl, Gustafson, and Little—diagnose *where* bottlenecks lie. But translating those diagnoses into actionable optimizations requires understanding the concrete hardware structures that impose them: caches, memory buses, and interconnects.
|
||
|
||
## Computer Architecture Essentials {#sec-machine-foundations-computer-architecture-essentials-10c2}
|
||
|
||
A GPU advertises 1,000 TFLOP/s, yet your kernel achieves only 30 TFLOP/s. The missing 97% is not a software bug—it is the cost of moving data through a memory hierarchy that spans five orders of magnitude in latency. While physics sets theoretical performance bounds, computer architecture defines the machinery that determines how close a real workload can get. The following discussion covers the latency, bandwidth, and energy trade-offs that shape system design.
|
||
|
||
### Latencies Every Programmer Should Know {#sec-machine-foundations-latencies-every-programmer-know-e89f}
|
||
|
||
The first step in systems intuition is understanding the cost of distance. @tbl-latency-hierarchy quantifies how long the processor waits for data from different levels of the memory hierarchy. If accessing a register is like picking up a pencil from your desk, fetching from HBM is walking across the office, and fetching from disk is flying to the moon.
|
||
|
||
| **Component** | **Latency (ns)** | **Cycles (Approx)** | **Relative "Distance"** |
|
||
|:-------------------------|----------------------------------------------:|--------------------:|:-----------------------:|
|
||
| **Register** | ~0.3 ns | 1 cycle | 10 seconds |
|
||
| **L1 Cache** | ~`{python} AppendixMachineSetup.l1_ns` ns | 3–4 cycles | 1 minute |
|
||
| **L2 Cache** | ~`{python} AppendixMachineSetup.l2_ns` ns | 12 cycles | 4 minutes |
|
||
| **HBM3 (GPU Memory)** | ~`{python} AppendixMachineSetup.hbm_ns` ns | 1,000 cycles | 5 hours |
|
||
| **NVLink (GPU-GPU)** | ~`{python} AppendixMachineSetup.nvlink_ns` ns | 1,500 cycles | 8 hours |
|
||
| **PCIe (CPU-GPU)** | ~`{python} AppendixMachineSetup.pcie_ns` ns | 3,000 cycles | 1 day |
|
||
| **InfiniBand (Network)** | ~`{python} AppendixMachineSetup.ib_ns` ns | 15,000 cycles | 1 week |
|
||
| **SSD (NVMe)** | ~`{python} AppendixMachineSetup.ssd_ns` ns | 300,000 cycles | 3 months |
|
||
|
||
: **The Latency Hierarchy.** Access times for modern AI hardware. Note the massive jump from SRAM (Cache) to HBM. Any kernel that misses cache pays a heavy penalty. {#tbl-latency-hierarchy}
|
||
|
||
### The AI Hardware Cheat Sheet (Modern Reference) {#sec-machine-foundations-ai-hardware-cheat-sheet-modern-reference-4dc1}
|
||
|
||
While latency tells us how long we wait for the *first* byte, bandwidth tells us how many bytes follow. @tbl-hardware-cheatsheet provides the constants for back-of-the-envelope "Roofline" calculations. These represent the "standard units of compute" for the current era of machine learning.
|
||
|
||
| **Spec** | **NVIDIA H100 (SXM)** | **Google TPU v5p** | **System Impact** |
|
||
|:---------------------|----------------------------------------------------------:|-----------------------------------------------------:|:----------------------------------------|
|
||
| **FP16/BF16 Peak** | `{python} AppendixMachineSetup.h100_flops` TFLOPS | `{python} AppendixMachineSetup.tpuv5_flops` TFLOPS | The "Speed Limit" ($R_{peak}$) |
|
||
| **Memory Bandwidth** | `{python} AppendixMachineSetup.h100_bw` TB/s | `{python} AppendixMachineSetup.tpuv5_bw` TB/s | The "Width of the Pipe" ($BW$) |
|
||
| **HBM Capacity** | `{python} AppendixMachineSetup.h100_cap` GB | `{python} AppendixMachineSetup.tpuv5_cap` GB | Max Model Size ($P$) / Batch Size ($B$) |
|
||
| **L2/SRAM Cache** | `{python} AppendixMachineSetup.h100_l2_mb` MB | ~`{python} AppendixMachineSetup.tpuv5_l2_mb` MB | Critical for Operator Fusion |
|
||
| **Interconnect** | `{python} AppendixMachineSetup.h100_nvlink` GB/s (NVLink) | `{python} AppendixMachineSetup.tpuv5_ici` GB/s (ICI) | Determines Model Parallelism Scaling |
|
||
|
||
: **Reference Specs.** Key constants for quantitative analysis. Always check specific datasheets, but these serve as standard units of compute. {#tbl-hardware-cheatsheet}
|
||
|
||
### The Memory Hierarchy {#sec-machine-foundations-memory-hierarchy-2278}
|
||
|
||
Computer systems use a hierarchy because no single technology provides both high capacity and low latency. Examine the pyramid in @fig-memory-hierarchy to see how each level balances this tradeoff: every technique that keeps data higher in the pyramid (registers/cache) directly improves performance.
|
||
|
||
::: {#fig-memory-hierarchy fig-env="figure" fig-pos="htb" fig-cap="**The Memory Hierarchy**: Performance depends on data proximity. Accessing HBM is ~100$\times$ slower than registers; accessing SSD is ~100,000$\times$ slower." fig-alt="Pyramid showing Registers at top, followed by Cache, HBM/DRAM, and Storage at bottom."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[line cap=round, line join=round, font=\usefont{T1}{phv}{m}{n}\small]
|
||
% --- parameters ---
|
||
\def\H{6.2} % triangle height
|
||
\def\W{4.6} % half triangle width
|
||
|
||
% --- levele (down to up) ---
|
||
\def\yone{1.5}
|
||
\def\ytwo{3}
|
||
\def\ythree{4.5}
|
||
% --- macro: calculate the half-width at the height y and write it in the macro \tmp
|
||
\newcommand{\halfwidthat}[2]{%
|
||
\pgfmathsetmacro#2{\W*(1-#1/\H)}%
|
||
}
|
||
% ---calculate the required widths ---
|
||
\halfwidthat{0}{\wzero}
|
||
\halfwidthat{\yone}{\wone}
|
||
\halfwidthat{\ytwo}{\wtwo}
|
||
\halfwidthat{\ythree}{\wthree}
|
||
% --- centri po visini (x=0 zbog simetrije) ---
|
||
\pgfmathsetmacro{\ycA}{0.5*(0+\yone)}
|
||
\pgfmathsetmacro{\ycB}{0.5*(\yone+\ytwo)}
|
||
\pgfmathsetmacro{\ycC}{0.5*(\ytwo+\ythree)}
|
||
\pgfmathsetmacro{\ycD}{0.46*(\ythree+\H)}
|
||
%up to down
|
||
\filldraw[fill=RedFill, draw=RedLine, line width=1pt]
|
||
(-\wthree,\ythree) -- (\wthree,\ythree) -- (0,\H) -- cycle;
|
||
|
||
\filldraw[fill=YellowFill, draw=YellowLine, line width=1pt]
|
||
(-\wtwo,\ytwo) -- (\wtwo,\ytwo) -- (\wthree,\ythree) -- (-\wthree,\ythree) -- cycle;
|
||
|
||
\filldraw[fill=BlueFill, draw=BlueLine, line width=1pt]
|
||
(-\wone,\yone) -- (\wone,\yone) -- (\wtwo,\ytwo) -- (-\wtwo,\ytwo) -- cycle;
|
||
|
||
\filldraw[fill=GreenFill, draw=GreenD, line width=1pt]
|
||
(-\wzero,0) -- (\wzero,0) -- (\wone,\yone) -- (-\wone,\yone) -- cycle;
|
||
|
||
% ---text ---
|
||
\node[font=\usefont{T1}{phv}{b}{n}\small,text=GreenD] at (0,\ycA) {Storage (SSD / Disk)};
|
||
\node[font=\usefont{T1}{phv}{bm}{n}\small,text=BlueLine] at (0,\ycB) {HBM / DRAM};
|
||
\node[font=\usefont{T1}{phv}{b}{n}\small,text=YellowLine] at (0,\ycC) {L1 / L2 / L3 Cache};
|
||
\node[font=\usefont{T1}{phv}{b}{n}\small,text=RedLine] at (0,\ycD) {Registers};
|
||
%
|
||
\coordinate(D)at($(\W,0)+(0.65,0)$);
|
||
\coordinate(L)at($(-\W,0)+(-0.65,0)$);
|
||
\coordinate(V)at($(0,\H)+(0,0)$);
|
||
\path[green](D)|-coordinate(D1)(V);
|
||
\path[green](L)|-coordinate(L1)(V);
|
||
%
|
||
\draw[->,>=Latex,line width=1pt,draw=black!40](D)--
|
||
node[align=center,right]{Faster Speed\\ Lower Latency}(D1);
|
||
\draw[->,>=Latex,line width=1pt,draw=black!40](L1)--
|
||
node[align=center,left]{Larger Capacity\\ Lower Cost}(L);
|
||
% --- outline ---
|
||
%\draw[thick] (-\W,0) -- (0,\H) -- (\W,0) -- cycle;
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
The memory hierarchy is the fundamental physical constraint of machine learning systems. @tbl-physical-hierarchy-ref consolidates the physical properties—latency, bandwidth, and energy—across the entire stack.
|
||
|
||
| **Layer** | **Technology** | **Latency** | **Bandwidth** | **Energy (per 32b)** |
|
||
|:---------------------|:---------------|----------------------------------------------:|---------------------------------------------------:|-------------------------------------------:|
|
||
| **Registers** | Flip-Flops | ~0.3 ns | — | `{python} AppendixMachineSetup.e_reg` pJ |
|
||
| **L1 Cache** | SRAM | ~`{python} AppendixMachineSetup.l1_ns` ns | — | `{python} AppendixMachineSetup.e_l1` pJ |
|
||
| **L2 Cache** | SRAM | ~`{python} AppendixMachineSetup.l2_ns` ns | — | `{python} AppendixMachineSetup.e_l2` pJ |
|
||
| **Memory (Local)** | HBM3 | ~`{python} AppendixMachineSetup.hbm_ns` ns | `{python} AppendixMachineSetup.bw_hbm_str` GB/s | `{python} AppendixMachineSetup.e_dram` pJ |
|
||
| **Interconnect** | NVLink 4.0 | ~`{python} AppendixMachineSetup.nvlink_ns` ns | `{python} AppendixMachineSetup.bw_nvlink_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
|
||
| **Host Link** | PCIe Gen5 | ~`{python} AppendixMachineSetup.pcie_ns` ns | `{python} AppendixMachineSetup.bw_pcie_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
|
||
| **System RAM** | DDR5 | ~100 ns | `{python} AppendixMachineSetup.bw_dram_str` GB/s | ~`{python} AppendixMachineSetup.e_dram` pJ |
|
||
| **Network (Fabric)** | InfiniBand NDR | ~`{python} AppendixMachineSetup.ib_ns` ns | `{python} AppendixMachineSetup.bw_net_str` GB/s | `{python} AppendixMachineSetup.e_net` pJ |
|
||
| **Storage (Local)** | NVMe SSD | ~`{python} AppendixMachineSetup.ssd_ns` ns | `{python} AppendixMachineSetup.bw_ssd_str` GB/s | `{python} AppendixMachineSetup.e_ssd` pJ |
|
||
|
||
: **Physical Properties of the Memory Hierarchy (c. 2024)**: Consolidating latency, bandwidth, and energy across the memory hierarchy. The hierarchy spans five orders of magnitude in latency and six orders of magnitude in energy per access. For the ML engineer, this table defines the "Silicon Contract": every optimization that moves data one layer higher in the hierarchy delivers an order-of-magnitude dividend in performance. {#tbl-physical-hierarchy-ref}
|
||
|
||
The hierarchy's energy costs reveal why data movement dominates modern system design.
|
||
|
||
::: {.callout-notebook title="The High Cost of Data Movement"}
|
||
Fetching a 32-bit value from DRAM costs roughly **`{python} AppendixMachineSetup.energy_ratio_str`$\times$ more energy** than performing a floating-point operation on it (e.g., ~`{python} AppendixMachineSetup.dram_pj` pJ vs ~`{python} AppendixMachineSetup.flop_pj` pJ). This "Energy Wall" means that maximizing **arithmetic intensity** (doing many ops per loaded byte) is the only way to be energy efficient.
|
||
:::
|
||
|
||
### Bandwidth vs. Latency {#sec-machine-foundations-bandwidth-vs-latency-5320}
|
||
|
||
Bandwidth (throughput) and latency (delay) are distinct constraints. Total transfer time follows:
|
||
|
||
$$ T = \text{Latency} + \frac{\text{Data Size}}{\text{Bandwidth}} $$
|
||
|
||
For small transfers (e.g., single-token inference), latency dominates. For large transfers (e.g., loading weights), bandwidth dominates.
|
||
|
||
```{python}
|
||
#| label: bandwidth-latency-setup
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ BANDWIDTH-LATENCY SETUP
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-bandwidth-vs-latency-5320
|
||
# │
|
||
# │ Goal: Define constants for bandwidth-latency prose lead-in text.
|
||
# │ Show: "10 Gbps", "10 ms ping", "1 KB" — inline before the example.
|
||
# │ How: Pure scalar constants; string formatting for inline Python refs.
|
||
# │
|
||
# │ Imports: (none — pure scalars)
|
||
# │ Exports: bw_gbps_str, ping_ms_str, data_kb_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class BandwidthLatencySetup:
|
||
"""Namespace for bandwidth-latency lead-in constants."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
bw_gbps_value = 10 # Link bandwidth in Gbps
|
||
ping_ms_value = 10 # Network latency in ms
|
||
data_kb_value = 1 # Packet size in KB
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
# No derived values — only string formatting.
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — values are definitional constants.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
bw_gbps_str = "10"
|
||
ping_ms_str = "10"
|
||
data_kb_str = "1"
|
||
```
|
||
|
||
Consider sending data over a `{python} BandwidthLatencySetup.bw_gbps_str` Gbps link with `{python} BandwidthLatencySetup.ping_ms_str` ms ping (latency). The dominant bottleneck depends entirely on the transfer size:
|
||
|
||
```{python}
|
||
#| label: bandwidth-latency-example
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ BANDWIDTH-LATENCY EXAMPLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-machine-foundations-bandwidth-vs-latency-5320
|
||
# │
|
||
# │ Goal: Estimate transmission time for small (latency-bound) and large
|
||
# │ (bandwidth-bound) transfers to illustrate the T = lat + size/bw formula.
|
||
# │ Show: "0.8 μs" for 1 KB packet vs "~810 ms" for 1 GB checkpoint — inline.
|
||
# │ How: tx_time = (data_bytes * 8) / bw_bits; large_time = size / bw + ping.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (MILLION, BILLION), mlsysim.book (fmt, md_math)
|
||
# │ Exports: tx_time_us_str, large_data_gb_str, eq_large_tx, eq_large_total
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import fmt, md_math
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class BandwidthLatencyExample:
|
||
"""Namespace for latency-bound vs. bandwidth-bound transmission time example."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
data_kb_value = 1e3 # 1 KB in bytes
|
||
bw_gbps_value = 10e9 # 10 Gbps in bits/s
|
||
large_data_gb_value = 1
|
||
ping_ms_value = 10
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
tx_time_s_value = (data_kb_value * 8) / bw_gbps_value
|
||
tx_time_us_value = tx_time_s_value * MILLION
|
||
|
||
large_data_bits_value = large_data_gb_value * 8e9
|
||
large_tx_time_s_value = large_data_bits_value / bw_gbps_value
|
||
total_large_time_ms_value = ping_ms_value + large_tx_time_s_value * 1000
|
||
|
||
# ┌── 3. GUARD (Invariants) ──────────────────────────────────────────
|
||
# No check() calls needed — values are monotone functions of inputs.
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
tx_time_us_str = fmt(tx_time_us_value, precision=1, commas=False)
|
||
large_data_gb_str = str(large_data_gb_value)
|
||
eq_large_tx = md_math(f"{large_data_gb_value}\\text{{GB}} / {int(bw_gbps_value/BILLION)}\\text{{Gbps}} \\approx {int(large_tx_time_s_value*1000)}\\text{{ms}}")
|
||
eq_large_total = md_math(f"\\approx {int(ping_ms_value)}\\text{{ms}} + {int(large_tx_time_s_value*1000)}\\text{{ms}} = {int(total_large_time_ms_value)}\\text{{ms}}")
|
||
```
|
||
|
||
* **Latency-Bound (`{python} BandwidthLatencySetup.data_kb_str` KB Packet)**:
|
||
* Transmission: `{python} BandwidthLatencySetup.data_kb_str` KB / `{python} BandwidthLatencySetup.bw_gbps_str` Gbps ≈ `{python} BandwidthLatencyExample.tx_time_us_str` μs.
|
||
* Total Time ≈ `{python} BandwidthLatencySetup.ping_ms_str` ms + `{python} BandwidthLatencyExample.tx_time_us_str` μs ≈ `{python} BandwidthLatencySetup.ping_ms_str` ms.
|
||
* *Result*: The bandwidth is irrelevant; the speed of light (ping) is the bottleneck.
|
||
|
||
* **Bandwidth-Bound (`{python} BandwidthLatencyExample.large_data_gb_str` GB Checkpoint)**:
|
||
* Transmission: `{python} BandwidthLatencyExample.eq_large_tx`.
|
||
* Total Time `{python} BandwidthLatencyExample.eq_large_total`.
|
||
* *Result*: The ping is negligible; the pipe size is the bottleneck.
|
||
|
||
Architecture determines how fast data can move, but there is another lever that directly controls *how much* data must move: the numerical precision of each value. Halving precision from FP32 to FP16 halves the bytes per parameter, which doubles effective bandwidth for free—if the model can tolerate the reduced precision. Understanding these trade-offs requires a closer look at how numbers are represented in hardware.
|
||
|
||
## Numerical Representations {#sec-machine-foundations-numerical-representations-c889}
|
||
|
||
While statistics helps us understand data distributions, numerical representations determine how we store the values themselves. In ML systems, the choice of precision (FP32 vs. BF16 vs. INT8) is a direct trade-off between statistical fidelity and hardware throughput.
|
||
|
||
::: {.callout-perspective title="Why This Matters"}
|
||
Your production model runs at 50 QPS in FP32 but your target is 200 QPS. Switching to INT8 could get you there, but will accuracy suffer? Understanding numerical formats lets you make this trade-off quantitatively rather than hoping for the best.
|
||
:::
|
||
|
||
### Floating-Point Format Comparison {#sec-machine-foundations-floatingpoint-format-comparison-1836}
|
||
|
||
The IEEE 754 standard and its AI-specific derivatives define different trade-offs between dynamic range (the span of representable values) and precision (how finely you can represent values within that range). @tbl-numerical-formats summarizes the key formats and their use cases, while @fig-float-formats visualizes the bit allocations.
|
||
|
||
| **Format** | **Bits** | **Exponent** | **Mantissa** | **Dynamic Range** | **Typical Use Case** |
|
||
|:-----------|---------:|-------------:|-------------:|--------------------------------------:|:-----------------------------------------------|
|
||
| **FP32** | 32 | 8 | 23 | $\sim 10^{-38}$ to $10^{38}$ | Training (full precision), reference inference |
|
||
| **FP16** | 16 | 5 | 10 | $\sim 10^{-5}$ to $6.5 \times 10^{4}$ | Training with loss scaling, inference |
|
||
| **BF16** | 16 | 8 | 7 | Same as FP32 | Training (preferred), avoids loss scaling |
|
||
| **FP8** | 8 | 4 or 5 | 3 or 2 | Varies | Inference on newest hardware (H100+) |
|
||
| **INT8** | 8 | N/A | N/A | -128 to 127 | Inference after quantization |
|
||
|
||
: **Numerical Format Comparison**: Each format trades off precision, dynamic range, memory footprint, and compute throughput. BF16 has emerged as the preferred training format because it matches FP32's range while using half the memory. {#tbl-numerical-formats}
|
||
|
||
::: {#fig-float-formats fig-env="figure" fig-pos="htb" fig-cap="**Numerical Format Bit Layouts**: A visual comparison of bit allocations. Note how **BF16** (Brain Float 16) preserves the 8-bit exponent of **FP32**, ensuring the same dynamic range for training stability. **FP16** trades range for precision, often requiring loss scaling to prevent underflow." fig-alt="Stacked horizontal bars showing bit breakdown. FP32: 1 Sign, 8 Exp, 23 Mantissa. BF16: 1 Sign, 8 Exp, 7 Mantissa. FP16: 1 Sign, 5 Exp, 10 Mantissa. INT8: 8 Integer bits."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}, scale=1.0]
|
||
\tikzset{
|
||
BitBox/.style={draw=white, line width=0.8pt, minimum height=0.6cm, align=center, font=\scriptsize\bfseries\usefont{T1}{phv}{m}{n}, text=white},
|
||
Label/.style={text=TextBlack, font=\small\bfseries\usefont{T1}{phv}{m}{n}, anchor=east}
|
||
}
|
||
|
||
% Colors
|
||
\definecolor{SignColor}{HTML}{D9534F} % Red
|
||
\definecolor{ExpColor}{HTML}{5BC0DE} % Blue
|
||
\definecolor{MantColor}{HTML}{F0AD4E} % Orange
|
||
\definecolor{IntColor}{HTML}{5CB85C} % Green
|
||
|
||
% FP32
|
||
\node[Label] at (-0.2, 3) {FP32 (32-bit)};
|
||
\node[BitBox, fill=SignColor, minimum width=0.3cm] (fp32_s) at (0.15, 3) {S};
|
||
\node[BitBox, fill=ExpColor, minimum width=2.4cm, right=0pt of fp32_s] (fp32_e) {Exponent (8)};
|
||
\node[BitBox, fill=MantColor, minimum width=6.9cm, right=0pt of fp32_e] (fp32_m) {Mantissa (23)};
|
||
|
||
% BF16
|
||
\node[Label] at (-0.2, 2) {BF16 (16-bit)};
|
||
\node[BitBox, fill=SignColor, minimum width=0.3cm] (bf16_s) at (0.15, 2) {S};
|
||
\node[BitBox, fill=ExpColor, minimum width=2.4cm, right=0pt of bf16_s] (bf16_e) {Exponent (8)};
|
||
\node[BitBox, fill=MantColor, minimum width=2.1cm, right=0pt of bf16_e] (bf16_m) {Mant (7)};
|
||
\node[right=0.2cm of bf16_m, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] {Matches FP32 Range};
|
||
|
||
% FP16
|
||
\node[Label] at (-0.2, 1) {FP16 (16-bit)};
|
||
\node[BitBox, fill=SignColor, minimum width=0.3cm] (fp16_s) at (0.15, 1) {S};
|
||
\node[BitBox, fill=ExpColor, minimum width=1.5cm, right=0pt of fp16_s] (fp16_e) {Exp (5)};
|
||
\node[BitBox, fill=MantColor, minimum width=3.0cm, right=0pt of fp16_e] (fp16_m) {Mantissa (10)};
|
||
|
||
% INT8
|
||
\node[Label] at (-0.2, 0) {INT8 (8-bit)};
|
||
\node[BitBox, fill=IntColor, minimum width=2.4cm] (int8) at (1.2, 0) {Integer (8)};
|
||
|
||
% Grid / Scale markers (approximate)
|
||
\draw[gray!30, dashed] (0, -0.5) -- (0, 3.5);
|
||
\draw[gray!30, dashed] (9.6, -0.5) -- (9.6, 3.5);
|
||
\node[below, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] at (0, -0.5) {Bit 31/15/7};
|
||
\node[below, font=\scriptsize\usefont{T1}{phv}{m}{n}, text=gray] at (9.6, -0.5) {Bit 0};
|
||
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
Beyond bit width, the allocation of bits between exponent and mantissa determines what range of values each format can represent.
|
||
|
||
::: {.callout-perspective title="The Dynamic Range Wall"}
|
||
|
||
The choice of numerical format is a direct application of the **Iron Law of ML Systems** (Principle \ref{pri-iron-law}). Reducing precision from FP32 to BF16 or FP16 halves the **Data Movement** term in the denominator, potentially doubling throughput on memory-bound workloads. However, the *type* of 16-bit format determines the engineering complexity:
|
||
|
||
* **Dynamic Range (The Exponent)**: BF16 preserves the 8-bit exponent of FP32. This means it can represent the same range of extremely large and extremely small values (gradients).
|
||
* **Precision (The Mantissa)**: FP16 has a larger 10-bit mantissa than BF16 (7 bits), offering higher precision for values within its range. But its 5-bit exponent is a major constraint; gradients often "vanish" to zero (underflow) because the exponent cannot represent them. To solve this, FP16 training requires **Loss Scaling**, an operational overhead where gradients are multiplied by a large constant to push them into the representable range.
|
||
* **Energy Efficiency**: INT8 operations are significantly more energy-efficient than floating-point equivalents because they utilize simpler integer ALUs and require less silicon area. Moving to INT8 for inference is the primary lever for deploying LLMs on battery-constrained edge devices.
|
||
|
||
:::
|
||
|
||
Among these formats, BF16[^fn-bf16] deserves special attention [@google_bfloat16]. By matching FP32's 8-bit exponent while truncating the mantissa to just 7 bits, BF16 preserves the full dynamic range needed for gradient representation. This avoids the underflow problems that plague FP16 training, eliminating the need for complex loss scaling. Most modern training uses BF16 for this reason—it is effectively a "drop-in" half-precision replacement for FP32 that just works.
|
||
|
||
[^fn-bf16]: **BF16** was originally introduced with the Google TPUv2 and has since been adopted by Intel, Arm, and NVIDIA (starting with Ampere architectures).
|
||
|
||
### Integer Quantization {#sec-machine-foundations-integer-quantization-5442}
|
||
|
||
Quantization maps continuous floating-point values to discrete integers, typically INT8. The key challenge is choosing how to map the floating-point range to integers. Two approaches dominate.
|
||
|
||
Symmetric quantization centers the mapping at zero:
|
||
$$ x_{\text{int}} = \text{round}\left(\frac{x}{\alpha} \times 127\right) $$
|
||
where $\alpha$ is the scale factor (typically the maximum absolute value). This works well for weight distributions centered around zero.
|
||
|
||
Asymmetric quantization handles distributions that are not centered (common after ReLU, which produces only non-negative values) by shifting the range before scaling. If $x_{\min}$ is the minimum of the range and $\alpha$ is the range width ($x_{\max} - x_{\min}$):
|
||
$$ x_{int} = \text{round}\left(\frac{x - x_{\min}}{\alpha} \times 255\right) $$
|
||
|
||
The choice between symmetric and asymmetric quantization depends on your tensor's distribution and has measurable accuracy implications.
|
||
|
||
With the full toolkit assembled—reference numbers, performance models, architectural constraints, and numerical trade-offs—it is worth pausing to address the most common ways engineers misapply these concepts. The following section catalogs fallacies and pitfalls that violate the physical and architectural principles covered above.
|
||
|
||
## Fallacies and Pitfalls {#sec-machine-foundations-fallacies-pitfalls-f9b1}
|
||
|
||
Even experienced engineers fall into traps when reasoning about hardware performance. The following misconceptions violate the physical and architectural principles covered in this appendix.
|
||
|
||
::: {.callout-warning title="Fallacy: Doubling accelerators halves training time."}
|
||
This assumes perfect strong scaling (Amdahl's Law). In practice, communication overhead (all-reduce) grows with $N$, and batch size constraints may limit parallelism. At large scale, you often hit diminishing returns unless you also scale the problem size (weak scaling).
|
||
:::
|
||
|
||
A related misconception concerns numerical precision.
|
||
|
||
::: {.callout-warning title="Fallacy: Higher precision (FP32) is always better."}
|
||
For deep learning, FP32 often hurts performance without improving convergence. It consumes 2$\times$ memory bandwidth and energy compared to BF16. Since neural networks are resilient to noise, the extra mantissa bits in FP32 are often modeling random variance rather than signal.
|
||
|
||
:::
|
||
|
||
With the common misconceptions addressed, use the reference numbers and models in this appendix as your first line of defense whenever a system behaves unexpectedly. A quick back-of-envelope calculation often reveals whether the culprit is physics, architecture, or a genuine software bug.
|
||
|
||
:::: {.callout-takeaways title="Numbers Every Engineer Should Know"}
|
||
|
||
- **Energy dominates**: Moving data costs ~600$\times$ more energy than computing on it. Arithmetic intensity—the ratio of compute to data movement—is the single most important metric for ML workload performance.
|
||
- **The Roofline Model** reveals whether a workload is compute-bound or memory-bound. Most inference workloads fall below the ridge point and are memory-bound; batch size is the primary lever to shift toward compute-bound operation.
|
||
- **Amdahl's Law** caps strong-scaling speedup at $1/s$ (where $s$ is the serial fraction). **Gustafson's Law** shows that scaling the problem alongside hardware yields near-linear throughput gains—the paradigm that makes large-scale training feasible.
|
||
- **Little's Law** ($L = \lambda W$) directly sizes inference infrastructure: concurrency, memory, and maximum throughput are all linked by this simple identity.
|
||
- **Memory hierarchy** spans five orders of magnitude in latency (register at ~0.3 ns to SSD at ~100,000 ns). Keeping data close to compute is not an optimization—it is the optimization.
|
||
- **Numerical precision** is a systems lever, not just a modeling choice. BF16 matches FP32's dynamic range at half the memory cost; INT8 quantization can deliver 2--4$\times$ inference speedup with careful calibration.
|
||
- **Physics is non-negotiable**: speed of light sets latency floors, energy ratios set efficiency ceilings, and no amount of software optimization can violate these constraints.
|
||
|
||
::::
|
||
|
||
## Further Reading {.unnumbered}
|
||
|
||
- **Hardware acceleration**: @sec-hardware-acceleration
|
||
- **Training system design**: @sec-model-training
|
||
- **Serving system design**: @sec-model-serving
|
||
- **Framework internals and kernels**: @sec-ml-frameworks
|