feat: add unified memory hierarchy reference and data locality invariant

2026-03-09 07:15:51 -05:00 · 2026-03-02 17:14:11 -05:00
parent 1052b2be31
commit 0d6b8fee7a
2 changed files with 97 additions and 1 deletions
--- a/book/quarto/contents/vol1/backmatter/appendix_machine.qmd
+++ b/book/quarto/contents/vol1/backmatter/appendix_machine.qmd
@@ -416,6 +416,21 @@ class AppendixMachineSetup:
    tpuv5_cap = tpuv5_cap_value
    tpuv5_ici = tpuv5_ici_value
    tpuv5_l2_mb = tpuv5_l2_mb_value
+
+    # --- Unified Hierarchy Stats (Reference Table) ---
+    bw_hbm_str = f"{H100_MEM_BW.m_as(GB/second):,.0f}"
+    bw_nvlink_str = f"{NVLINK_H100_BW.m_as(GB/second):,.0f}"
+    bw_pcie_str = f"{PCIE_GEN5_BW.m_as(GB/second):,.0f}"
+    bw_dram_str = f"{SYSTEM_MEMORY_BW.m_as(GB/second):,.0f}"
+    bw_ssd_str = f"{NVME_SEQUENTIAL_BW.m_as(GB/second):.1f}"
+    bw_net_str = f"{INFINIBAND_NDR_BW_GBS}"
+
+    e_reg = "0.01"
+    e_l1 = "0.5"
+    e_l2 = "2.0"
+    e_dram = "640"
+    e_ssd = "~5,000"   # ~1.2 uJ per 1KB read -> ~5nJ per 32b
+    e_net = "~10,000"  # ~1uJ per 1KB packet (header overhead)
 ```

 #### A Concrete Example: The A100 Analysis {#sec-machine-foundations-concrete-example-a100-analysis-5b30}
@@ -905,6 +920,22 @@ node[align=center,left]{Larger Capacity\\ Lower Cost}(L);
 ```
 :::

+The memory hierarchy is the fundamental physical constraint of machine learning systems. @tbl-physical-hierarchy-ref consolidates the physical properties—latency, bandwidth, and energy—across the entire stack.
+
+| **Layer**            | **Technology** | **Latency**                                    | **Bandwidth**                                     | **Energy (per 32b)**                         |
+|:---------------------|:---------------|-----------------------------------------------:|--------------------------------------------------:|---------------------------------------------:|
+| **Registers**        | Flip-Flops     |                                        ~0.3 ns |                                                 — | `{python} AppendixMachineSetup.e_reg` pJ     |
+| **L1 Cache**         | SRAM           |      ~`{python} AppendixMachineSetup.l1_ns` ns |                                                 — | `{python} AppendixMachineSetup.e_l1` pJ      |
+| **L2 Cache**         | SRAM           |      ~`{python} AppendixMachineSetup.l2_ns` ns |                                                 — | `{python} AppendixMachineSetup.e_l2` pJ      |
+| **Memory (Local)**   | HBM3           |     ~`{python} AppendixMachineSetup.hbm_ns` ns | `{python} AppendixMachineSetup.bw_hbm_str` GB/s   | `{python} AppendixMachineSetup.e_dram` pJ    |
+| **Interconnect**     | NVLink 4.0     |  ~`{python} AppendixMachineSetup.nvlink_ns` ns | `{python} AppendixMachineSetup.bw_nvlink_str` GB/s| ~`{python} AppendixMachineSetup.e_dram` pJ   |
+| **Host Link**        | PCIe Gen5      |    ~`{python} AppendixMachineSetup.pcie_ns` ns | `{python} AppendixMachineSetup.bw_pcie_str` GB/s  | ~`{python} AppendixMachineSetup.e_dram` pJ   |
+| **System RAM**       | DDR5           |                                        ~100 ns | `{python} AppendixMachineSetup.bw_dram_str` GB/s  | ~`{python} AppendixMachineSetup.e_dram` pJ   |
+| **Network (Fabric)** | InfiniBand NDR |      ~`{python} AppendixMachineSetup.ib_ns` ns | `{python} AppendixMachineSetup.bw_net_str` GB/s   | `{python} AppendixMachineSetup.e_net` pJ     |
+| **Storage (Local)**  | NVMe SSD       |     ~`{python} AppendixMachineSetup.ssd_ns` ns | `{python} AppendixMachineSetup.bw_ssd_str` GB/s   | `{python} AppendixMachineSetup.e_ssd` pJ     |
+
+: **Physical Properties of the Memory Hierarchy (c. 2024)**: Consolidating latency, bandwidth, and energy across the memory hierarchy. The hierarchy spans five orders of magnitude in latency and six orders of magnitude in energy per access. For the ML engineer, this table defines the "Silicon Contract": every optimization that moves data one layer higher in the hierarchy delivers an order-of-magnitude dividend in performance. {#tbl-physical-hierarchy-ref}
+
 The hierarchy's energy costs reveal why data movement dominates modern system design.

 ::: {.callout-notebook title="The High Cost of Data Movement"}
--- a/book/quarto/contents/vol1/ml_systems/ml_systems.qmd
+++ b/book/quarto/contents/vol1/ml_systems/ml_systems.qmd
@@ -1988,6 +1988,37 @@ class BandwidthBottleneck:
    video_width_str = fmt(width, precision=0, commas=False)
    video_height_str = fmt(height, precision=0, commas=False)
    bytes_per_pixel_str = fmt(bpp, precision=0, commas=False)
+
+class DataLocalityInvariant:
+    """
+    Namespace for Data Locality Invariant.
+    Scenario: 4K video stream vs. Cloud offload.
+    """
+    # ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
+    width = 3840
+    height = 2160
+    bpp = 3
+    fps = 60
+    net_bw_mbps = 100 # Home broadband
+    cloud_lat_ms = 100
+    edge_inf_ms = 10
+
+    # ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
+    frame_mb = (width * height * bpp) / 1e6
+    tx_time_ms = (frame_mb * 8 / net_bw_mbps) * 1000
+    remote_total_ms = cloud_lat_ms + edge_inf_ms
+
+    # Step 1: Decision
+    must_be_local = tx_time_ms > remote_total_ms
+
+    # ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
+    check(must_be_local, "4K video should require locality at 100Mbps!")
+
+    # ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
+    frame_mb_str = fmt(frame_mb, precision=0)
+    tx_time_ms_str = fmt(tx_time_ms, precision=0)
+    remote_ms_str = fmt(remote_total_ms, precision=0)
+    net_bw_str = f"{net_bw_mbps}"
 ```

 ::: {.callout-notebook title="The Bandwidth Bottleneck"}
@@ -2092,7 +2123,41 @@ plt.show()
 ```
 :::

-Edge ML provides quantifiable benefits that address key cloud limitations. The most immediate is latency: response times drop from 100--500 ms in cloud deployments to 1--50 ms at the edge, enabling safety-critical applications that demand real-time response. Bandwidth savings compound this advantage—a retail store with 50 cameras streaming video can reduce transmission requirements from 100 Mbps (costing \$1,000--2,000 monthly) to less than 1 Mbps by processing locally and transmitting only metadata, a 99% reduction. Privacy strengthens in turn, because local processing eliminates transmission risks and simplifies regulatory compliance. For industrial deployments, operational resilience is the decisive advantage: systems continue functioning during network outages, a property essential for manufacturing, healthcare, and building management applications where downtime carries immediate cost.
+### The Data Locality Invariant {#sec-ml-systems-data-locality-invariant}
+
+\index{Data Locality Invariant!definition} \index{bandwidth-latency trade-off}The decision between local edge processing and remote cloud processing is governed by the **Data Locality Invariant**. This principle establishes that data *must* stay local when the time to transmit it exceeds the total time for remote processing (including network latency and remote compute).
+
+::: {.callout-definition title="The Data Locality Invariant"}
+
+***The Data Locality Invariant*** states that a workload necessitates local processing whenever the transmission delay ($D_{vol}/BW_{net}$) dominates the remote response time:
+$\text{Data Locality} \iff \frac{D_{vol}}{BW_{net}} > L_{net} + \frac{O}{R_{peak, remote}}$
+
+1.  **Significance (Quantitative):** It defines the **Locality Crossover**, the point where adding cloud compute (increasing $R_{peak}$) yields zero benefit because the "Pipe" ($BW_{net}$) is too narrow for the "Volume" ($D_{vol}$).
+2.  **Distinction (Durable):** Unlike **The Iron Law**, which optimizes for **Time**, the Locality Invariant optimizes for **Architectural Feasibility** by identifying when network physics forbids remote offloading.
+3.  **Common Pitfall:** A frequent misconception is that 5G/6G "solves" locality. While these improve $BW_{net}$, they do not reduce $L_{net}$ below the Light Barrier, meaning latency-critical tasks remain inherently local.
+
+:::
+
+::: {.callout-notebook title="Napkin Math: The Locality Crossover"}
+
+\index{locality crossover!worked example}**Problem**: Should a drone's object avoidance system (4K, 60 FPS) offload to the cloud?
+
+**The Variables**:
+
+- **Data ($D_{vol}$)**: 4K frame ≈ `{python} DataLocalityInvariant.frame_mb_str` MB.
+- **Bandwidth ($BW_{net}$)**: `{python} DataLocalityInvariant.net_bw_str` Mbps home broadband (up).
+- **Remote Latency ($L_{net}$)**: `{python} DataLocalityInvariant.remote_ms_str` ms (round-trip + remote compute).
+
+**The Calculation**:
+
+1.  **Transmission Time**: `{python} DataLocalityInvariant.frame_mb_str` MB $\times$ 8 bits / `{python} DataLocalityInvariant.net_bw_str` Mbps = **`{python} DataLocalityInvariant.tx_time_ms_str` ms**.
+2.  **Remote Response**: **`{python} DataLocalityInvariant.remote_ms_str` ms**.
+
+**The Systems Conclusion**: Since `{python} DataLocalityInvariant.tx_time_ms_str` ms $\gg$ `{python} DataLocalityInvariant.remote_ms_str` ms, the system is **Bandwidth Blocked**. The cloud could have an infinite processor ($R_{peak} = \infty$), but the drone would still crash because it can't move the bits fast enough. This workload is **Locality Mandatory**.
+:::
+
+Edge ML provides quantifiable benefits that address key cloud limitations.
+ The most immediate is latency: response times drop from 100--500 ms in cloud deployments to 1--50 ms at the edge, enabling safety-critical applications that demand real-time response. Bandwidth savings compound this advantage—a retail store with 50 cameras streaming video can reduce transmission requirements from 100 Mbps (costing \$1,000--2,000 monthly) to less than 1 Mbps by processing locally and transmitting only metadata, a 99% reduction. Privacy strengthens in turn, because local processing eliminates transmission risks and simplifies regulatory compliance. For industrial deployments, operational resilience is the decisive advantage: systems continue functioning during network outages, a property essential for manufacturing, healthcare, and building management applications where downtime carries immediate cost.

 These benefits carry corresponding limitations that compound as deployments scale. Limited computational resources[^fn-edge-resource-limits] sharply constrain model complexity: edge servers often provide an order of magnitude or more less processing throughput than cloud infrastructure, limiting deployable models to millions rather than billions of parameters. Managing distributed networks introduces complexity that scales nonlinearly with deployment size, because coordinating version control and updates across thousands of devices requires sophisticated orchestration systems[^fn-edge-fleet-ops], and hardware heterogeneity across diverse platforms demands different optimization strategies for each target.